Modern Software Experience

2009-05-10

Recognising GEDCOM

magic value

A basic check an application should perform when trying to import a file is making sure that it is importing the right kind of file. This is generally done by recognising the file header. Many file formats deliberately include a so-called magic value in the header that identifies the file format. For example, all Windows bitmap images start with the ASCII values for the two letters B and M, in that order.

GEDCOM

The first line of a valid GEDCOM file is 0 HEAD. As a result, many genealogy applications check that the file starts with 0 HEAD.

flexibility

Some slightly more flexible readers allow a few empty lines, spaces and tabs before the magic value. That flexibility is nice, but less important than getting the magic value right.

Unicode

All but seriously outdated GEDCOM readers will recognise 0 HEAD when the file is encoded in UTF-16 ("UNICODE" in confused GEDCOM terminology) instead of UTF-8, ANSEL or ASCII. Sadly, quite a few current GEDCOM readers are seriously outdated, but this text is not about Unicode support.

bad magic

Fact is that most GEDCOM readers get the magic value wrong. The magic value is not 0 HEAD but a first line that contains nothing but 0 HEAD ; in other words 0 HEAD followed by a newline.
A reader that generously wants to be flexible beyond the GEDCOM specification could accept some additional whitespace on that line.

insignificant detail?

The newline may seem an insignificant detail. You are not likely to encounter many stories that start with "0 HEADING OUT INTO" or something similar. Nor is it likely to match a numbered list that starts with "0 HEADQUARTERS".

That GEDCOM files start with a zero and that tags are ALL-CAPS makes it very unlikely that any random text file will pass the erroneous magic value test.
Ordinary text rarely starts with a zero, most people start numbering their lists at one instead of zero, and most text intended for humans is not written in ALL-CAPS.

That the erroneous magic value could match some random text is not the issue. There are two issues; real errors and another format.

illegal content

One issue is that a carelessly coded test may fail to recognise real errors, such as some value following the HEAD tag, for example 0 HEAD 123. That is illegal. The HEAD tag should not be followed by any value. If there is a value anyway, That’s an error; perhaps it isn’t a GEDCOM file after all.

current applications

Here is a table summarising how several current genealogy applications respond to a GEDCOM file that has a value following the HEAD tag.

application       versionaction
Family Tree Maker Classic16.0.350warning message
New Family Tree Maker"2009" 18.0.0.307no error or warning
Personal Ancestral File5.2.18.0no error or warning
   
Aldfaer4.1no error or warning
Behold0.98.9.91 "alpha"message
Brother’s Keeper6.3.11no error or warning
Cognatio1.4.1no error or warning
GEDCOM Explorer1.0.0.85no error or warning
Legacy Family Tree7.0.0.90no error or warning
MyBlood1.0.600 "Alpha 4.1"no error or warning
RootsMagic4.0.1.1no error or warning

remarks

Family Tree Maker Classic

Family Tree Maker Classic produces a message: Line 1 : Tag: HEAD, cannot have a value field: 123, ignored.. The log file does not clearly label the message as either a warning or an error message, but the dialog box that prompts you to view the log file states that the import generated one warning and zero errors.

Behold

Behold produces the following message: level 0 record(s) should not have additional text following the record type. GEDCOM only allow this on "0 NOTE" records. Behold will include and display the extra text.. Behold does not identify its messages as either an error or a warning message, merely as a Possible GEDCOM problems.

If there is not even a valid GEDCOM header, the file simply isn’t a GEDCOM file and should be rejected.

fatal error

The table shows that most genealogy applications fail this incredibly basic test; they report neither an error nor a warning. Behold reports the issue, but does not indicate that it is an error. Family Tree Maker Classic reports the issue, but miscategorises the error as a warning.

serious issue

The inclusion of illegal text in a GEDCOM header is not some minor issue, but a very serious one, that should be treated as the fatal error it is.
Even flexible GEDCOM readers that try to support as many GEDCOM dialects as possible have to draw a hard line somewhere. The GEDCOM header is the natural compatibility boundary.
If there is a valid GEDCOM header, the reader can use it to determine what extensions to support, what errors to compensate for. If there is not even a valid GEDCOM header, the file simply isn’t a GEDCOM file and should be rejected. It makes little sense to try and bother to read the rest if the application that created the file cannot even get the GEDCOM header right.

disappointing

That most current genealogy applications fail to report an error in the GEDCOM header, even when there is illegal content on the very first line, the one line the reader must examine to check whether it could be a GEDCOM file at all, is very disappointing.

FTW TEXT

The other issue is that merely checking that a file starts with 0 HEAD it is not likely, but certain to match the magic value by another file format. A test that merely checks that a file begins with 0 HEAD, will not only match 0 HEAD, but also match 0 HEADER.

FTW TEXT files start with 0 HEADER and FTW TEXT is not just some random file format, but a proprietary format of Family Tree Maker, an genealogy application. What’s worse, Family Tree Maker actively misleads its users into thinking that FTW TEXT is GEDCOM. Thus, a GEDCOM reader is quite likely to be presented with an FTW TEXT file for processing and should produce an appropriate response when that happens.

If the GEDCOM recognition is done right, the GEDCOM reader will, upon being presented with an FTW TEXT file, inform the reader that the file is not a GEDCOM file, even if it does not know about FTW TEXT files.
Alas, as related in The FTW TEXT Problem, only a few of the already mentioned applications act correctly.

The following summarises how these applications respond to otherwise valid GEDCOM file that starts with 0 HEADER like an FTW TEXT file, instead of 0 HEAD.

application       versionaction
Family Tree Maker Classic16.0.350no error or warning
New Family Tree Maker"2009" 18.0.0.307no error or warning
Personal Ancestral File5.2.18.0no error or warning
   
Aldfaer4.1erroneous message
Behold0.98.9.91 "alpha"message
Brother’s Keeper6.3.11no error or warning
Cognatio1.4.1error message & abort
GEDCOM Explorer1.0.0.85error message & abort
Legacy Family Tree7.0.0.90no error or warning
MyBlood1.0.600 "Alpha 4.1"no error or warning
RootsMagic4.0.1.1no error or warning

remarks

Aldfaer

Aldfaer’s import dialog shows the progress and highlight the message GEDCOM bestand heeft geen HEADER record (GEDCOM file has no HEADER record) in red. Problem is, that the message is as wrong as it gets. A HEADER record is exactly what this file does have instead of a HEAD record, so this message only adds to the confusion. Aldfaer does not abort, but continues the import.

Behold

Behold produces the following message: The HEAD record is missing. This may not be a GEDCOM file. Behold will use what it can from it.. Behold does not indicate whether this is an error or warning message.

Cognatio

Cognatio puts up a messagebox with the text Serious errors were found when searching the GEDCOM file. The import was cancelled.. The messagebox icon makes it clear this an error, and the import is aborted. The import log additionally notes that HEADER is an unknown tag.

GEDCOM Explorer

GEDCOM Explorer puts up an messagebox with the text Not a valid Gedcom file, Error processing "GEDCOMHEADER.GED", Processing aborted. Again, the messagebox icon used conveys this is an error, and the import is aborted.

what this shows

Cognatio did best; it detects the problem, provides an error message to the user, aborts the import, and clearly documents the errors in the import log.

cause and effect

If you compare these results with those in The FTW TEXT Problem, you may notice that both Legacy and RootsMagic are confused by an actual FTW TEXT file and that neither produced any message about the erroneous header. This immediately suggests why Legacy and RootsMagic are confused; neither supports FTW TEXT, but both allow themselves to get confused by not performing proper header checks before proceeding with the rest of the file.

Detecting GEDCOM

GEDCOM magic

Detecting GEDCOM is easy; just check for the magic value, but do get the magic value right. The magic value is not 0 HEAD but 0 HEAD␤; 0 HEAD followed by a newline. ␤ is Unicode character U+2424, the Symbol for NewLine. If a file does not start with that, it is not a GEDCOM file.

If a GEDCOM file is expected, processing should be aborted with an error message.

magic valuefile format   action
0 HEADGEDCOMprocess as GEDCOM
anything elseunknownabort

A forgiving GEDCOM reader could allow whitespace in between 0 HEAD and the newline as a non-fatal error. For ease of discussion, the rest of this text assumes a regular GEDCOM reader.

FTW TEXT magic

Detecting FTW TEXT is just as easy, it has magic value slightly different from that for GEDCOM. The magic value for FTW is 0 HEADER followed by a newline. If a file does not start with that, it is not an FTW TEXT file.

detecting both

Because Family Tree Maker misleads its users into think that its FTW TEXT files are GEDCOM files, these users are likely to be confused when their ostensible GEDCOM file is refused because it is not GEDCOM.

single check

The magic keys for GEDCOM and FTW TEXT both start with 0 HEAD, so it is tempting to think that a reader that supports both could get away with checking just that, but that is a mistake. Not only does the reader need to make sure it is either GEDCOM or FTW TEXT and issue an error when it encounters a value such as 0 HEADING, which is neither, it also needs to distinguish between GEDCOM and FTW TEXT.

two checks

A GEDCOM reader that supports both GEDCOM and FTW TEXT should detect both and provide an informational message that it detected a GEDCOM header detected FTW TEXT header, as the case may be. 

magic valuefile format   action
0 HEADGEDCOMprocess as GEDCOM
0 HEADERFTW TEXTprocess as FTW TEXT
anything elseunknownabort

always detect

I recommend detecting FTW TEXT even if it not supported by the reader, just to provide the user with a really helpful message; to not just produce an error message telling the user that the file is not a GEDCOM file, but add an informational message that the file is a FTW TEXT file instead of a GEDCOM file. The application help file can explain that further.

FTW GEDCOM

Because FTW GEDCOM is very problematic GEDCOM dialect, it even makes sense to recognise FTW GEDCOM and issue some warnings that FTW GEDCOM files are problematic. Recognising a GEDCOM file as produced by a particular application is merely a matter of examining the GEDCOM header’s SOUR field.

source

There is no known need to examine an FTW TEXT header’s SOURCE field; all FTW TEXT files were produced by FTW. However, it is not a bad idea to examine that field anyway and issue a warning when it contains another value than expected.

updates

2011-04-08: GEDCOM Tags

GEDCOM Tags provides an overview of GEDCOM tags.

2012-02-27: Detecting GEDCOM 5.5EL

GEDCOM 5.5EL discusses detection of GEDCOM 5.5EL.

links

GEDCOM

FTW TEXT Primer

applications