Modern Software Experience

2011-01-07

XML-based GEDCOM

GEDCOM 5.5.x

datespecificationnote
1995-12-11GEDCOM 5.5official standard
1998-04-20GedMLMichael Kay
1998-05-01GEDCOM Future Direction Draft 
1999-07-07GEDCOM Future Direction Draft 
1999-10-02GEDCOM 5.5.1de facto standard
2000-12-18GEDCOM 5.6 Draftearly GEDCOM XML
2001-12-28GEDCOM XML Draft 
2002-12-02GEDCOM XML Beta 

FamilySearch introduced GEDCOM 5.5 late in 1995 and GEDCOM 5.5.1 late in 1999. FamilySearch never made the GEDCOM 5.5.1 specification an official standard, so its official status remains that of a draft standard, but it is de facto standard. Even FamilySearch's own PAF uses GEDCOM 5.5.1 features.

GEDCOM XML

GEDCOM 5.5.1 is the last version of GEDCOM that FamilySearch released publicly. It was preceded by the GEDCOM Future Direction document and followed by the GEDCOM XML Draft. The GEDCOM Future Direction document is the precursor to GEDCOM XML, a.k.a. as GEDCOM 6.0, an ill-named would-be GEDCOM replacement, that is based on another data model another syntax; GEDCOM XML abandons the GEDCOM syntax for the XML syntax. The GEDCOM XML specification also included the idea that text could use a limited number of HTML tags for mark-up.

The GEDCOM 5.6 Draft used to be hard to come by - until today.

GEDCOM 5.6

In between the GEDCOM 5.5.1 Draft and GEDCOM XML Draft, FamilySearch released the GEDCOM 5.6 Draft. However, the GEDCOM 5.6 Draft was not released publicly. The GEDCOM 5.6 Draft was released by private communication to a few individuals. Who received it or what criteria FamilySearch used in selecting the recipients is unknown. The GEDCOM 5.6 Draft used to be hard to come by - until today. Tom Wetmore has posted the GEDCOM 5.6 draft on his DeadEnds Software domain.

GEDXML is obviously an abbreviation of GEDCOM XML, but GEDXML and GEDCOM XML are two different things.

XML

The first difference with the GEDCOM 5.5.1 specification is the title. The title of the GEDCOM 5.5.1 specification is THE GEDCOM STANDARD, the title of the GEDCOM 5.6 specification is THE GEDCOM SPECIFICATION, Includes GEDXML Format; GEDXML is an obviously an abbreviation of GEDCOM XML, but GEDXML and GEDCOM XML are two different things. The GEDXML in the GEDCOM 5.6 specification is not like the GEDCOM XML that FamilySearch introduced later.

The GEDCOM 5.6 specification includes both the lineage-linked GEDCOM form familiar from previous GEDCOM versions and an XML-based format of the same. You can think of GEDCOM 5.6's XML-based format as GedML 5.6; FamilySearch adopted Michael Kay's idea that expressing the existing lineage-linked form in XML instead of the GEDCOM format would be a step forward already. FamilySearch does not acknowledge his influence and calls the resulting format GEDXML.

why

The GEDCOM 5.6 specification describes version 5.6 as a refinement of the GEDCOM 5.5 specification, that was modified to allow inclusion of GEDXML; Tag names and some structures were changed in the GEDXML form so as to better accommodate XML DTD rules..

The specification does not state that GEDXML is expected to replace traditional GEDCOM. On the contrary, it states that GEDXML is not expected to replace normal GEDCOM. The stated reason for introducing the XML-based variant of GEDCOM is to make it easier to include GEDCOM data in web pages:

At this time it is not expected that the XML version in Chapter 3 would replace the normal GEDCOM format being be used for data exchange. It is presented for those that want to use the GEDCOM formatted data in an XML form for WEB purposes.

The GEDCOM 5.6 specification is dated 1999 Oct 2 and this statement should be understood in the context of other developments of that time. Not only is GEDXML similar in syntax to HTML, but at the time, the World-Wide Web Consortium (W3C) was already creating XHTML, a well-formed, XML-based reformulation of HTML. XHTML 1.0 became a W3C Recommendation on 2000 Jan 26. The tag-based format of GEDXML also allows styling GEDXML through Cascading Style Sheets (CSS). CSS was a standard already; Cascading Style Sheets Level 1 became a W3C Recommendation on 1996 Dec 17, and CSS Level 2 became a W3C Recommendation on 1998 May 12.
Thus, the XML-based variant of GEDCOM was included in the GEDCOM 5.6 specification to make it relatively easy to integrate GEDCOM into web pages.

changes

5.5 5.5.1 5.6 GEDCOM tag brief description
YNNBLOBBinary Large Object.
NNYCLNDRCalendar.
NYYEMAILemail address.
NYYFACTFact.
NYYFAXfax (phone) number.
NYYFONEPhonetic spelling.
NYYLATILatitude.
YNNLEGALegatee (LGTE).
NYYLONGLongitude.
NYYMAPMap coordinates.
NNYORDLLDS Ordination.
NYYROMNRomanisation.
NNYURLWorld wide web address (URL).
NNYWACWashing and Clothing.
NYNWWWWorld wide web address (URL).

The above statements can be found in the Modifications in 5.6 section of the GEDCOM 5.6 specification. The modifications listed are with respect to GEDCOM 5.5, so it includes all the changes between GEDCOM 5.5 and GEDCOM 5.5.1 as well.

The GEDCOM 5.5.1 specification notes that it eliminated the tag BLOB and added the tags EMAIL, FACT, FAX, FONE, LATI, LONG, MAP, ROMN and WWW. The GEDCOM 5.5.1 specification notes that it eliminated two tags, BLOB and LEGA, and added the tags CLNDR, EMAIL, FACT, FAX, FONE, LATI, LONG, MAP, ROMN, URL. and WAC.

You'd think that this means that the LEGA tag was removed in 5.6. Truth is, the LEGA tag does not appear in the GEDCOM 5.5.1 specification either. The Modifications in 5.5.1 section of the GEDCOM 5.5.1 specification omits to mention it, but the GEDCOM 5.5.1 specification eliminated the LEGA tag already.

The CLNDR tag is new in GEDCOM 5.6. That does not mean that GEDCOM 5.6 is the first version of GEDCOM to support multiple calendars. GEDCOM 5.6 supports exactly the same calendars as GEDCOM 5.5 and 5.5.1, it just specifies calendars in different way; GEDCOM 5.5.x uses an escape sequence to specify the calendar for a DATE tag, GEDCOM 5.6 uses a subordinate CLNDR tag.

The WAC tag introduced in GEDCOM 5.6 isn't a genealogical tag, but an LDS tag; WAC is an abbreviation of Washing and Clothing, and is a LDS ordinance type.

GEDCOM 5.5.1 introduces the tag WWW and GEDCOM 5.6 introduces the tag URL. It is the same tag under a different name. It is a tag used within the ADDR tag. I consider URL the better tag name. The name change between 5.5.1 and 5.6 is unfortunate, but probably because of GEDXML; HTML uses url (but back then, many people still used upper-cased their HTML tags, it was common to write URL instead), and FamilySearch probably wanted to avoid having to different tags for the same thing.

not changed

Remarkable are some of the things that haven't changed, despite repeated complaints from vendors that the specification is plain wrong. GEDCOM 5.6 still includes the oxymoron 8-bit ASCII. There is no such thing. ASCII is a 7-bit character set.

new

New in GEDCOM 5.6 is an entire chapter on GEDXML. Chapter 2, Lineage-Linked Grammar, is followed by chapter 3, GEDXML for Lineage-Linked Grammar. The first sentence calls the chapter an informal Document Type Definition (DTD) for a reading audience. The text claims that a formal DTD is available, but it is not printed or otherwise included with the GEDCOM 5.6 specification itself.
The notation used in chapter 3 takes some getting used to, but is readable. Anyone familiar with GEDCOM will easily recognise GEDXML as an XML-based GEDCOM.

The mapping from GEDCOM to GEDXML is pretty straightforward. A GEDXML file starts with <GEDXML> and ends with </GEDXML>. The first tag after <GEDXML> is <HEAD> and the last tag before </GEDXML> is <TRLR/>. It isn't necessary to have all three tags, but FamilySearch clearly decided that a clean and logical design of GEDXML is less important than a straightforward translation between GEDCOM and GEDXML. The GEDCOM 5.6 specification itself notes that The empty tag trailer tag, <TRLR/>, is not needed or required, but is included for compatibility with traditional GEDCOM.
The additional note that The <TRLR/> tag will also, theoretically, allow multiple GEDCOM files to be included with their own header and trailer tags. is not only a bit odd because FamilySearch considered GEDXML a temporary solution, it also ignores the fact that the start of a new header record already allows recognising multiple GEDCOM files within a single GEDXML file.

A separate, undated FamilySearch document released around the same time, Will GEDCOM Be Replaced By XML?, notes that FamilySearch considers GEDXML a temporary solution, a way to provide XML now, without having to wait for GEDCOM XML (GEDCOM 6.0), the real successor to GEDCOM 5.5. FamilySearch eventually abandoned GEDCOM XML, but when you know that FamilySearch did not consider GEDXML a lasting format, FamilySearch's preference for a straight conversion over a clean design makes sense.

what to use

The obvious question now that the GEDCOM 5.6 specification is public is whether applications should support it. That is really two questions in one; whether to support GEDCOM 5.6 and whether to support GEDXML 5.6.

GEDXML

FamilySearch itself essentially considered GEDXML obsolete before they introduced it, but the reason for that they were about to introduced GEDCOM XML. Now that they abandoned GEDCOM XML, that reasoning does not seem valid anymore. The reasoning that GEDXML is a relatively easy to move to XML for applications that already support GEDCOM continues to be valid.
In fact, all the reasons that Michael Kay forwarded for switching from GEDCOM to GedML are just as valid now as they were back then. However, although Michael Kay provided a logical approach, provided a specification, some samples and source code, GedML did not catch on. Now that FamilySearch has garnered itself a name for flat out abandoning products and standards without providing alternatives, a file format proposed by FamilySearch is not likely to catch on either.
That said, it is not unlikely that a few applications will start supporting it.

GEDCOM 5.5.1 should remain the default output version.

GEDCOM 5.6

The difference between GEDCOM 5.5.1 and GEDCOM 5.6 is so minimal that the question which one to support seems almost irrelevant; debating which of the two specifications to support may take more time than simply supporting both. However, there are a few other arguments to consider.
One is that a vendor who wants to support GEDXML really should support GEDCOM 5.6 first. FamilySearch never defined GEDXML 5.5.1, and supporting GEDXML 5.6 without supporting GEDCOM 5.6 would be rather odd.
Another argument is that practically all current genealogy applications use GEDCOM 5.5 and GEDCOM 5.5.1 already. As GEDCOM 5.6 does not add anything new, but does contain a tag (URL) that current applications do not recognise, while they do recognise the equivalent GEDCOM 5.5.1 tag (WWW), GEDCOM 5.5.1 should remain the default output version. Because it is likely that some vendors will, for varying reasons, start supporting GEDCOM 5.6 output, all vendors should upgrade their GEDCOM readers to recognise GEDCOM 5.6 files.

updates

2011-05-22: WAC

The WWW tag introduced in GEDCOM 5.6 existed in GEDCOM 5.3 already. It is absent from GEDCOM 5.4, 5.5 and 5.5.1, and then 5.6 reintroduces it.

The ORDL tag introduced in GEDCOM 5.6 is mentioned in GEDCOM 5.4 already. It is absent from GEDCOM 5.4, 5.5 and 5.5.1, and then 5.6 reintroduces it. In GEDCOM 5.4, ORDL occurs in the Appendix without appearing in the GEDCOM form.

links