Modern Software Experience

2014-09-28

Not as easy as it should be

GEDCOM specifications

Ideally, a method for detecting the GEDCOM version is based on the GEDCOM specifications, with perhaps some practical consideration for real-world deviations from the specificiations.
Alas, early version of FamilySearch GEDCOM specification are not readily available. In fact, when asked, even FamilySearch themselves was unable to provide them. The information in this article is almost based on the study of old GEDCOM files.

0HH
1HI 5 
1HN Phillip E. Brown              
1HS FHS
1HD FHS
...
...
0ND 

Family History System 1 (1986)

GEDCOM 1.0

The history of GEDCOM starts with GEDCOM 1.0, which was only ever implemented by Phillip Brown's Family History System (FHS). GEDCOM 1.0 files look quite different from GEDCOM 2.x and later.

GEDCOM 1.0 file are extremely rare, but because they are so different, easy to recognise. The GEDCOM 1.0 article provides a GEDCOM 1.0 file you can use to test your GEDCOM reader. GEDCOM 1.0 files do not start with 0 HEAD, but with 0HH.

GEDCOM 1.0 files do not specify the character set or encoding used. Family History System is an MS-DOS application. All GEDCOM 1.0 files use an MS-DOS code pages, most likely code page 850 (MS-DOS Latin 1).

GEDCOM 2.0 and 2.1

GEDCOM version 2.0 and 2.1 files start with 0 HEAD as the first line. GEDCOM 2.0 and 2.1 are quite different; a typical GEDCOM 5.5.x reader can make some sense of a GEDCOM 2.1 file, but can't handle a GEDCOM 2.0 file. GEDCOM 2.1 uses FAMC, FAMS and CHIL records just like GEDCOM 5.5.x, GEDCOM 2.0 does not. GEDCOM 2.0 files use FAMI, CHIL, SIBL YOUNG and OTHE.
As GEDCOM 2.0 and 2.1 are quite different, it is important to distinguish between GEDCOM 2.0 and 2.1, but GEDCOM 2.x files do not contain a HEAD.GEDC.VERS record that identifies the GEDCOM version. The GEDCOM version has to be identified through other means.

0 HEAD
1 SOUR PAF 2.2
1 DEST PAF
...
...
...
0 TRLR

PAF 2.2 (1989)

0 HEAD
1 SOUR PAF 2.1
1 DEST PAF
...
...
...
0 TRLR

PAF 2.1 (1987)

0 HEAD
1 SYST
2 SOUR PAF
2 DEST PAF
...
...
0 EOF

PAF 2.0 (1986)

EOF and TRLR

The most notable difference between GEDCOM 2.0 files and GEDCOM 2.1 files is that GEDCOM 2.0 files end with 0 EOF, while GEDCOM 2.1 files end with 0 TRLR, just like GEDCOM 5.5.x files.
This is a surefire to distinguish between GEDCOM 2.0 and 2.1, but taking advantage of this difference to detect the GEDCOM version requires reading the end of the file. Although GEDCOM 2.x files tend to be relatively small, having to examine the end of the file to determine the type of file isn't desirable.

Having to examine the end of the file to figure out the file type or version is undesirable for any file format, but it is particularly undesirable for GEDCOM files. A file that starts with a GEDCOM header does not need to end with a 0 TRLR line, it may also be the first file in a multi-volume GEDCOM file, and only the last file in that sequence ends a 0 TRLR line (see Multi-volume GEDCOM Files).

PAF 2.0 and 2.1

A notable difference between GEDCOM 2.0 files created by PAF 2.0 and GEDCOM 2.1 files created by PAF 2.1 is that the PAF 2.0 GEDCOM files identify the source with SOUR PAF (without a PAF version number anywhere), while PAF 2.1 GEDCOM files identify the source as SOUR PAF 2.1.
This is another surefire way to detect GEDCOM 2.0 and GEDCOM 2.1, but only for GEDCOM files created by PAF.

HEAD.SYST

There is a simple, product-independent way to distinguish between GEDCOM 2.0 and GEDCOM 2.1; GEDCOM 2.1 files contain a HEAD.SOUR record, but GEDCOM 2.0 files do not, GEDCOM 2.0 files contain a HEAD.SYST.SOUR record instead.
GEDCOM 2.0 is the only version of GEDCOM to use the HEAD.SYST record.

Most GEDCOM 2.0 files still in existence were created by PAF 2. In PAF 2 GEDCOM files, the 1 SYST line is the second line of the GEDCOM file, following immediately after the 0 HEAD line.

By the way, some GEDCOM 2.0 files may specify the character set, e.g. 1 CHAR ANSEL, but not all do. GEDCOM 2.0 files that do not specify the character set should be assumed to be using CHAR MS-DOS.

0 HEAD
1 SOUR PAF 2.3
2 NAME Personal Ancestral File (R)
1 DEST PAF
...

PAF 2.3 (1994)

0 HEAD
1 SOUR PAF
2 VERS 3.0
2 NAME Personal Ancestral File (R)
1 DEST PAF
...

PAF 3.0 (1997)

GEDCOM 3.0

GEDCOM 3.0 files look much like GEDCOM 4.0, GEDCOM 5.0 and GEDCOM 5.5.x files. A major difference is that GEDCOM 3.0 files do not contain a GEDCOM version number yet.

GEDCOM 3.0 did introduce a version number for the source system. In GEDCOM 2.x files, there was no version field, so vendors added a version number to HEAD.SOUR value. In GEDCOM 3.0 and later, the HEAD.SOUR field contains only the system identifier (e.g. PAF for Personal Ancestral File), with the version number provided in HEAD.SOUR.VERS. That enables distinguishing between GEDCOM 2.x and GEDCOM 3.0 files; a GEDCOM header that does not contains a HEAD.SOUR.VERS record is GEDCOM 2.x.

A GEDCOM header that contains a HEAD.SOUR.VERS record is GEDCOM 3.0 or later. If a GEDCOM header with a HEAD.SOUR.VERS record does not contain a HEAD.GEDC.VERS record, it is a GEDCOM 3.0 file.

0 HEAD
1 SOUR ANCESTRY
2 NAME Online Family Tree
2 VERS 1.0
2 CORP MyFamily.com, Inc.
...
1 GEDC
2 VERS 4.0
...

GEDCOM 4.0

GEDCOM 4.0

GEDCOM 4.0 and later all include the HEAD.GEDC.VERS record. A GEDCOM reader detects GEDCOM 4.0 the same way it detects GEDCOM 5.5.x.

early specification of GEDCOM 4.0

The confusing claim that PAF version 2.1 (1987) through 2.31 supported an early specification of GEDCOM 4.0 (1989) is nonsense that originates with the less than accurate GEDCOM page on Wikipedia. This nonsense can now be found all over the web. The edit history for Wikipedia's GEDCOM page shows that the early specification of GEDCOM 4.0 phrase is an unsourced claim added by Wikipedia user Gioto on 2009 Feb 16 5h25m.

There is no such thing as an early specification of GEDCOM 4.0 in the FamilySearch GEDCOM release history. GEDCOM 3.0 (1987) was followed by GEDCOM 4.0 (1989).
There is no GEDCOM 3.1, no GEDCOM 3.9, no GEDCOM 4.0 Beta. There is just GEDCOM 3.0 followed by GEDCOM 4.0.

Oddities

0 HEAD
1 SOUR PAF
2 VERS 2.1
1 DEST ANSTFILE
1 CHAR ANSEL
...

GEDCOM specification

FamilySearch specification PAF 2.1 GEDCOM 3.0 example

The FamilySearch GEDCOM 5.3, 5.5, 5.5.1 and 5.6 specifications contain an example of a PAF 2.1 file in GEDCOM 3.0 format (PAF version number provided as the HEAD.SOUR.VERS line value). However, every PAF 2.1 GEDCOM file I've seen uses GEDCOM 2.1 (PAF version number included in the HEAD.SOUR line value). As far as I know, the example in the GEDCOM specification is fictional.
PAF 3.0 (1997) is the first version of PAF to provide the PAF version number as the HEAD.SOUR.VERS line value.

0 HEAD
1 SOUR PAF
2 NAME Personal Ancestral File
2 VERS 4.0.4.18
...
1 DEST PAF
...
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
1 CHAR ANSEL

PAF 4.0 GEDCOM 5.5

0 HEAD
1 SOUR PAF 4.0
2 NAME Personal Ancestral File
2 VERS 4.0.4.18
...
1 DEST PAF
...
1 GEDC
2 VERS 4.0
2 FORM LINEAGE-LINKED
1 CHAR ANSEL

PAF 4.0 GEDCOM 4.0

PAF 4.0

FamilySearch PAF 4.0, an adaptation of Ancestral Quest 3.0, supports both GEDCOM 4.0 and GEDCOM 5.5.
There is a remarkable difference between PAF 4.0 GEDCOM 4.0 and PAF 4.0 GEDCOM 5.5 files; the PAF 4.0 GEDCOM 4.0 and 5.0 files both include the full PAF version number (4 numbers: 4.0.4.18) as the HEAD.SOUR.VERS line value, as they should, while the PAF 4.0 GEDCOM 4.0 files also include the abbreviated PAF version number (2 numbers: 4.0) as part of the HEAD.SOUR line value.

twice

As far as I know, PAF 4.0 is the only application that creates GEDCOM files that contain the application version number twice. This is wrong. Specifically, the HEAD.SOUR value is wrong; the line 1 SOUR PAF 4.0 should be 1 SOUR PAF.
The fact that PAF 4.0 only does this for GEDCOM 4.0 files and not for GEDCOM 5.5 files suggests that the error is deliberate, that this done because of some backward compatibility issue.

problem and solution

PAF 2.1 and 2.2 contain the version number as part of the HEAD.SOUR line value, while PAF 2.0 does not. This makes it easy to tell those PAF versions apart, so that is probably how some third party applications do it. Those applications are confused by PAF 3.0 GEDCOM files, misrecognising them as PAF 2.0 GEDCOM files, and PAF 4.0 avoids the misrecognition by using PAF 4.0 instead of PAF as the HEAD.SOUR line value. PAF 4.0 only does this for GEDCOM 4.0 files, not for GEDCOM 5.5 files, because it is wrong and GEDCOM 5.5 files are sufficiently different from GEDCOM 2.x files anyway.
That explanation sounds plausible, but is just a hypothesis.

system identifier

Today, most GEDCOM readers only handle GEDCOM 5.5 and later. Confronted with any GEDCOM 4.0 file, such a GEDCOM reader should simply report that it does not support GEDCOM 4.0. That the system identifier inside PAF 4.0 GEDCOM 4.0 files is wrong does not matter if you don't process the file anyway.
A GEDCOM reader that does support GEDCOM 4.0 should recognise PAF 4.0 GEDCOM 4.0 files as a special case, and understand the system identifier PAF 4.0 as PAF.

GEDCOM version 4+

GEDCOM files created by CommSoft Roots IV, Roots V, Palladium Family Gathering, and Palladium Ultimate Family Tree may list the GEDCOM version as 4+.

0 HEAD
1 SOUR ROOTSIV
2 VERS 1.1
1 CHAR ANSEL
...
1 GEDC
2 VERS 4+
...
 

Roots IV 1.1

0 HEAD
1 SOUR ROOTSV
2 VERS 5.01
1 DEST ANSTFILE
1 CHAR ANSEL
...
1 GEDC
2 VERS 4+
...

Roots V 5.01

0 HEAD 
1 SOUR FAMGATH 
2 VERS 1.1 
1 CHAR ANSEL 
1 DATE 22 APR 1998 
1 GEDC 
2 VERS 4+
...
 

Family Gathering 1.1

0 HEAD
1 SOUR UFTREE
2 VERS 2.5
1 CHAR ANSEL
1 DATE 5 MAR 1998
1 GEDC
2 VERS 4+
...
 

Ultimate Family Tree 2.5

Roots IV and Roots V were developed by CommSoft as part of the Roots line of genealogy applications. CommSoft partnered with Palladium Software to create Family Gathering. In 1997, Palladium purchased the entire Roots line of software from CommSoft, and then created Ultimate Family Tree (UFT) as the successor to both Family Gathering and Roots.
So, these four programs are all related to each other, and that explains why they all do the same thing.

Why these product identify the GEDCOM version as 4+ is not clear.
These are GEDCOM 4.0 files.

GEDCOM version 5.01

Several applications claim to create GEDCOM 5.01 files. Applications known to do so include Leister Production's Reunion, EasyTree, TribalPages, Généalogie and Ancestory.

0 HEAD 
1 SOUR ANCESTORY 
2 VERS V0.9
2 CORP Fore Word Press Ltd 
1 DEST USER 
...
1 GEDC 
2 VERS 5.01 
...
 

Ancestory 0.9

0 HEAD 
0 HEAD 
1 SOUR EasyTree
2 VERS V1.0
2 CORP DefCompany
1 DEST EasyTree
...
1 GEDC 
2 VERS 5.01
...

EasyTree 1.0

0 HEAD 
1 SOUR Généalogie
2 VERS V1.0
2 CORP DefCompany
1 DEST Généalogie
...
1 GEDC 
2 VERS 5.01
...
 

Généalogie 1.0

0 HEAD 
1 SOUR REUNION
2 VERS V4.0
2 CORP Leister Productions
1 DEST REUNION
...
1 GEDC 
2 VERS 5.01
...
 

Reunion 4.0

0 HEAD 
1 SOUR TribalPages
2 VERS V1.0
2 CORP TribalPages
1 DEST TribalPages
...
1 GEDC 
2 VERS 5.01
...
 

TribalPages 1.0

Incidentally, the use of the letter V in front of the product version number, while technically not illegal, is a mistake.

The problem with these files is that there is no GEDCOM version 5.01. There is GEDCOM 5.0, GEDCOM 5.1, GEDCOM 5.2, GEDCOM 5.3, GEDCOM 5.4, GEDCOM 5.5 and GEDCOM 5.5.1.

There is a GEDCOM version 5.1, and although most software developers would, if required to use two digits for the minor version number, appends a zero and write that as 5.10, some software developers write that as a 5.01.
The ostensible GEDCOM version 5.01 should be understood as GEDCOM version 5.1.

This mistake of misreporting the version number seems limited to GEDCOM version 5.1; I've never encountered ostensible GEDCOM 5.02, GEDCOM 5.03, GEDCOM 5.04 or GEDCOM 5.05 files.
Notice that most of the products here are version 1.0 products; the vendors of these products fixed the GEDCOM export in later versions of the product. Leister's Reunion is the only product that kept making this mistake for several versions. Reunion version 5.0 (1997) introduced support for GEDCOM 5.5.

PAF 2.2 GEDCOM 5.5

0 HEAD
0 HEAD
1   SOUR PAF 2.2
2     VERS 2.2
1   DEST PAF
1   DATE Friday, 20th November 1992
1   FILE ROYALS.GED
1   CHAR ANSEL
1   GEDC
2     VERS 5.5
2     FORM LINEAGE-LINKED
1   SUBM @SUBM1@

tampered GEDCOM file

There is at least one ostensible PAF 2.2 file using GEDCOM 5.5, a 1992 Nov 20 release of the well-known royal.ged file.

To a human expert, it is obvious the GEDCOM header has been messed with. That the name of the file is royal.ged does not match the ROYAlS.GED in the header is not relevant. The PAF version number is not only provided as part of the HEAD.SOUR line value, but also as the HEAD.SOUR.VERS line, something PAF 2.2 does not do.
Notice that the lines are indented, and no version of PAF does that. The most obvious problem is that it is a supposed PAF 2.2 GEDCOM 5.5 files while PAF 2.2 doesn't support GEDCOM 5.5; In fact, PAF 2.2 (1989) predates GEDCOM 5.5 (1995) by half a dozen years. The first version of PAF to to support GEDCOM 5.5 is PAF 4.0 (1999).

The ostensible PAF 2.2 GEDCOM 5.5 royal.ged file should result in a fatal error because of the indented lines, but some flexible GEDCOM readers will accept that.

PAF 5 GEDCOM 5.5.1

FamilySearch PAF 5 is infamous for using GEDCOM 5.5.1 and lying that it uses GEDCOM 5.5, yet a quick google will find PAF 5 GEDCOM files that correctly identify the GEDCOM version used as 5.5.1. There are several PAF 5 GEDCOM 5.5.1 files in the GitHub repository for gedcom4j that do include the correct GEDCOM version number. These repostory samples are aren't real PAF 5 GEDCOM 5.5.1 files, but hand-edited test files; someone took the trouble to correct PAF's mistake.

Ancestry.com's New Family Tree Maker

0 HEAD
1 SOUR FTM
2 VERS Family Tree Maker (21.0.0.723)
2 NAME Family Tree Maker for Windows
2 CORP Ancestry.com
...
1 CHAR UTF-8
1 FILE C:\Data\familytree.ged
1 SUBM @SUBM@
1 GEDC
2 VERS 5.5
2 FORM LINEAGE-LINKED
...

FTM 2012 Service Pack 7 (21.0.0.723)

Ancestry.com's New Family Tree Maker writes the product name in the HEAD.SOUR.VERS record, and then follows that with the version number between parentheses is completely wrong, not just because the product name belongs in the HEAD.SOUR.NAME record, and the HEAD.SOUR.VERS record is for the version number only, but also because the resulting string exceeds the maximum length of the HEAD.SOUR.VERS line value.
This issue - the fact that New Family Tree Maker does not even get the GEDCOM header right - was first pointed out in An Early Look At FTM 2008 Beta, published before (!) the first General Availability release of New Family Tree Maker. That, more than seven years later, Ancestry.com still hasn't bothered to perform the incredibly easy fix is reprehensible.

impact

That New Family Tree Maker does not even bother to create a correct GEDCOM header impacts the ability of other application to read the files it creates. GEDCOM readers that aren't ready to handle the illegally long HEAD.SOUR.VERS line value may reject New Family Tree Maker's files because they aren't GEDCOM files. Users can resolve the issue by fixing the Family Tree Maker GEDCOM header.

The messed up product version does not impact the detection of the GEDCOM version. What does impact the detection of the GEDCOM version is that New Family Tree Maker is one of many genealogy applications that uses GEDCOM version 5.5.1, but lies that it uses GEDCOM version 5.5.
Not all versions of New Family Tree Maker lie about it, some correctly identify the GEDCOM version used as GEDCOM 5.5.1.

There is no sure-fire way for a GEDCOM reader to know whether a GEDCOM file has been tampered with by a user.

detection

There is no sure-fire way for a GEDCOM reader to know whether a GEDCOM file has been tampered with by a user. GEDCOM does not feature any anti-tamper mechanism. On the contrary, GEDCOM was designed to be human-editable. Users will modify GEDCOM files, and they will foul up. The other side of that coin is that some applications foul up, and users can edit their files to fix them.
All a GEDCOM reader can do is read GEDCOM files as instructed by the GEDCOM header, and then issue errors and warnings for anything unexpected.

Detecting oddities such as FamilySearch's PAF 2.1 GEDCOM 3.0 example or an ostensible PAF 2.2 GEDCOM 5.5 files is technically possible. A GEDCOM reader could detect such mismatches by maintaining a database of matching application and GEDCOM versions. However, all such a GEDCOM reader can do when it detects a mismatch is report that it detected a mismatch, and that may be informative, but isn't very helpful.

A GEDCOM reader does not have to detect whether the product version and GEDCOM version match. A GEDCOM reader does not even have to detect whether a GEDCOM header is for a known product. A GEDCOM reader only has to figure whether the GEDCOM header is valid, legal, consistent with the Byte Order Mark (BOM) if present, and last but not least, whether it is for a GEDCOM version it supports.

A GEDCOM reader that specialises in reading a particular GEDCOM dialect can benefit from detecting and issuing warnings for unexpected combinations of values. All vendors should consider adding some product-specific consistency checks for GEDCOM files created by their own products.

Best Practice

GEDCOM reader

Make sure it is a GEDCOM 2+ file
Simple GEDCOM reader
Smart GEDCOM reader

GEDCOM writer

GEDCOM validator

updates

2014-09-30: Event GEDCOM

Event GEDCOM detection discusses handling of Event GEDCOM files.
This article does not incorporate the information and best practices provided there yet.

links