Modern Software Experience

2014-07-04

The Art of FamilySearch GEDCOM interpretation

GEDCOM Dates

GEDCOM files without dates are rare. Even the fan files produced by GedFan contain a date - in the GEDCOM header.
That date might look like this: 4 Jul 2014, or this: 4 JUL 2014. The difference between these two date values is the casing style used. Some applications uppercase all three letters of the month abbreviation, other applications use an initial capital, followed by two lowercase letters.

The actual situation is more complex than one application using uppercase and another using IniCaps.

Applications that use all-uppercase include:

Applications that use IniCaps include:

The division of genealogy applications into these two categories is a simplification. The actual situation is more complex than one application using uppercase and another using IniCaps.
Family Tree Maker for DOS (FTM) and early Family Tree Maker for Windows (FTW) versions used IniCaps, FTW version 5 through 16 use all-uppercase, while New Family Tree Maker (FTM 2008 and later, a.k.a. version 17 and later) uses all-uppercase. Personal Ancestral File (PAF) changed in the other direction; The original PAF, PAF for DOS, used all-uppercase. So did PAF 4, the first PAF for Windows, but PAF version 5 uses IniCaps.

All versions of Ancestral Quest use all-uppercase, yet in PAF 5, FamilySearch switched to IniCaps anyway.

That Personal Ancestral File changed the abbreviation style between version 4 and 5 is remarkable. FamilySearch never created a Windows application. The Windows versions of Personal Ancestral File are adaptations of Incline Software's Ancestral Quest created for FamilySearch. So, you might guess that PAF changed from using all-uppercase to using IniCaps because Ancestral Quest did so, but that isn't the case. All versions of Ancestral Quest use all-uppercase, yet in PAF 5, FamilySearch switched to IniCaps anyway. This change from all-uppercase to IniCaps is a deliberate change in PAF, which FamilySearch had to request from Incline Software. That they did so suggests that FamilySearch felt strongly about this change.

RootsMagic GEDCOM files may even use both styles within the same GEDCOM file. That's particularly odd, because it breaks the expectation that computer-generated files use a consistent date format. However odd it may be, it is not a problem.

Because both styles are widely used by popular applications, a GEDCOM reader that treats dates as case-sensitive, to accept one format and reject the other would upset many users trying to import a GEDCOM file.
As far as I know, every GEDCOM reader treats the month abbreviations as case-insensitive, and will accept any casing style for the month abbreviations. Most GEDCOM readers will happily accept not just Jul and JUL, but unusual cases like jUL and JuL as well.

date format warnings

For the typical GEDCOM reader, it hardly seems to matter which casing style is right or best. It would be unwise for a genealogy application developer to create a GEDCOM reader that doesn't accept GEDCOM files from popular applications merely because it doesn't like the casing style of the month abbreviations. It would also be unwise to accept a casing style but issue warnings for it; not because the casing style is insignificant, not because there is nothing for the user to do about it, but because for typical files, it will lead to a veritable flood of date format warnings, and such a flood of warnings may discourage users from looking at warnings at all.

date format validation

GEDCOM validators are a different breed of GEDCOM readers. GEDCOM validators should issue errors or warnings for anything that deviates from the specification. If reporting a particular issue generates a flood of messages, so be it. GEDCOM Validators are meant for use by developers, and developers react differently to a message flood than users do. A flood of messages may help convince a developer to fix their GEDCOM reader.

There are several actively maintained GEDCOM validators. Tim Forsythe's VGED 3.04, the Chronoplex GEDCOM Validator 2.0.2.0, and Nigel Munro Parker's GED-inline 1.06 all agree: the month abbreviations may use any casing style.

full disclosure

Over the years, I have crafted several GEDCOM files for testing purposes, and created the GedFan utility, which generates GEDCOM files for testing. All these GEDCOM files have a date in the GEDCOM header, and the month abbreviation is in IniCaps.

debate

All this has not stopped one of my correspondents from arguing that the month abbreviations must be all-uppercase, that any other casing style is wrong, and that all such cases deserve an error or warning message.
He read the GEDCOM specification carefully, and his argument is far from without merit.

At the centre of this debate is FamilySearch's notoriously sloppy GEDCOM specification. There is no debate over which version of the GEDCOM specification to use, GEDCOM 5.5.1 is the de facto standard.

GEDCOM date specification

The FamilySearch GEDCOM 5.5.1 specification defines MONTH like this:

MONTH:={Size=3}
[ JAN | FEB | MAR | APR | MAY | JUN |
JUL | AUG | SEP | OCT | NOV | DEC ]
Where:

JAN
January
FEB
February
MAR
March
APR
April
MAY
May
JUN
June
JUL
July
AUG
August
SEP
September
OCT
October
NOV
November
DEC
December

There are few example GEDCOM fragments within the GEDCOM specification, but every date within these examples is all-uppercase.

later specifications

Moreover, the little-known and GEDCOM 5.6 specification as well as the misnamed GEDCOM XML 6.0 specification all use all-uppercase for the month abbreviations. Both these specifications were abandoned by FamilySearch before they were implemented. In fact, FamilySearch did not even release the GEDCOM 5.6 specification, it merely leaked more than a decade after its creation.
Still, both of these later specifications use all-uppercase month abbreviations.

case-insensitive

The GEDCOM 5.5.x MONTH definition and the exclusive use of all-uppercase month abbreviations in examples aren't the whole story. The GEDCOM 5.5 and 5.5.1 specifications also contain this statement about case-insensitivity:

  • All controlled line_value choices should be considered as case insensitive. This means that the values should be converted to all uppercase or all lowercase prior to comparing. The terms UPPERCASE and UpperCase are considered equal. TAGS are always UPPERCASE.

Controlled value is another term for what is also known as an enumeration. This statement makes it clear that just because some definition, like the definition of MONTH above, enumerates all possible values using all-uppercase strings, that still does not imply that GEDCOM files may only use uppercase values. On the contrary, for controlled line values, any casing style will do, and GEDCOM readers must take care to recognise all possible variations, by uppercassing or lowercasing the value before comparing to a fixed string.
By the way, the statement about having to convert the value before comparing is not entirely correct, an application might also rely on a case-insensitive string compare, or use custom comparison code, but it has to allow the any casing style.

later specifications

The same case-insensitivity paragraph is present in the GEDCOM 5.6 specification, while the GEDCOM 6.0 XML specification unequivocally specifies the use of all-uppercase month abbreviations, and refers to specified format as changed from the previous (GEDCOM 5.5.1) format:

Date
The date the record was changed in this format:
DD MMM YYYY.
(one or two digit day, 3 letter month abbreviation in upper case, and 4 digit year, using the Gregorian calendar. For example, 3 MAR 1842 or 14 JAN 1890.)

controlled line value

Practically all genealogy software developers treat month abbreviations in GEDCOM 5.5.x as case-insensitive, and do so because of the quoted case-insensitivity paragraph.

My correspondent argues that GEDCOM dates aren't controlled lines values, and that because they aren't, the case-insensitivity paragraph does not apply, and he is more than a little bit right about that.

controlled line values?

The FamilySearch GEDCOM specification does not define what is or isn't a controlled line value, so whether GEDCOM dates are a controlled line value may seem debatable, but it isn't.

I do agree with my correspondent that GEDCOM dates values aren't controlled line values, but not because I agree with his argument, which is not about GEDCOM date values in general. I agree because this is the GEDCOM 5.5.1 definition of DATE_VALUE:

DATE_VALUE:={Size=1:35}
[
<DATE> |
<DATE_PERIOD> |
<DATE_RANGE>
<DATE_APPROXIMATED> |
INT <DATE> (<DATE_PHRASE>) |
(<DATE_PHRASE>)
]

That isn't the complete MONTH definition, there is some explanatory text as well, but none of that matters here. The clincher in the above definition is the inclusion of DATE_PHRASE; a date phrase is free-form text and free-form text isn't a controlled line value.
I've stated before that I consider the inclusion of date phrases into GEDCOM a design error, but they are a part of the FamilySearch GEDCOM 5.5.1 specification, they are a part of the DATE_VALUE definition, and because of that, GEDCOM dates values definitely aren't controlled line values.

Not all DATE tags have a value of type DATE_VALUE. One other possible value type is DATE_EXACT, and you can certainly debate whether line values of type DATE_EXACT are controlled line values or not.

My correspondent's argument isn't about date values in general, but about DATE_EXACT values. His position is that even the much stricter defined line values of type DATE_EXACT still aren't controlled line values. I disagree, I think they are, but will not argue that point, I will argue that it does not matter instead. However, before I do so, let's look at why developers choose to use either all-uppercase or IniCaps.

all-uppercase versus IniCaps

There is a variety of reasons for developers to choose either an all-uppercase or IniCaps approach:

The GEDCOM specification is strongly associated with Personal Ancestral File (PAF), making PAF the closest thing to a reference implementation. So, many developers use IniCaps because PAF does so, and it just seems right to do the same.
However, as already pointed out, an earlier version of PAF uses all-uppercase, and no doubt several developers chose to use all-uppercase because of that.

By the way, the deliberate change from all-uppercase month abbreviations in PAF 4 to IniCaps month abbreviation in PAF 5 does not really corresponds with an update of the GEDCOM specification. Sure, PAF 5.0.1.10 is the first version of PAF to support GEDCOM 5.5.1, but regarding the month abbreviations, the GEDCOM 5.5.1 specification is identical to GEDCOM 5.5, supported by PAF 4.

earlier specifications

Yet another reason for the use of IniCaps is found in earlier GEDCOM specifications. Earlier specifications do not only explicitly allow and use IniCaps, but even show, with clarity surprising for such a minor issue, how the thinking on the proper case of month abbreviations evolved.

GEDCOM 5.3

The GEDCOM 5.3 specification (1993) defines MONTH thus:

MONTH:={Size=3:3}
[ JAN | FEB | MAR | APR | MAY | JUN |
JUL | AUG | SEP | OCT | NOV | DEC ]

A month name abbreviation selected from the choices above, used in forming dates.

The GEDCOM 5.3 specification makes no explicit statements about case-sensitivity. It neither mandates the use of all-uppercase, nor states that other cases are allowed. Comparison of the GEDCOM 5.3 MONTH definition above with the GEDCOM 5.4 MONTH definition below strongly suggests that GEDCOM 5.3 intended to mandate all-uppercase abbreviations only, but it uses IniCaps in an example:

There are specific subordinate GEDCOM-lines that may be used as subordinate GEDCOM- lines to other superior GEDCOM-lines. For example:

1 BIRT
2 DATE 02 Oct 1937
3 QUAY 1

That particular example is no longer present in GEDCOM 5.4 or 5.5.x.

GEDCOM 5.4

The GEDCOM 5.4 specification contains neither the GEDCOM 5.3 example, nor the GEDCOM 5.5.x statement about case-insensitivity. However, it does contain the following definition of MONTH:

MONTH:=
[ JAN | Jan | FEB | Feb | MAR | Mar | APR | Apr | MAY | May | JUN | Jun |
JUL | Jul | AUG | Aug | SEP | Sep | OCT | Oct | NOV | Nov | DEC | Dec ]
Where:

JAN | Jan
January
FEB | Feb
February
MAR | Mar
March
APR | Apr
April
MAY | May
May
JUN | Jun
June
JUL | Jul
July
AUG | Aug
August
SEP | Sep
September
OCT | Oct
October
NOV | Nov
November
DEC | Dec
December

Most example dates within the GEDCOM 5.4 specification use all-uppercase month abbreviations, but one example has an IniCaps month abbreviations, and in fact even mixes IniCaps and all-uppercase abbreviations within the same line value:

@6@ SOUR
        1 DATA
            2 EVEN BIRT, DEAT, MARR
                3 DATE from Jan 1820 to DEC 1825
            2 PLAC Madison, Connecticut

evolution

That GEDCOM version 5.3 and 5.4 specifications are drafts that vendors are not supposed use, but that does not matter here. What matters here is that there is a clear evolution from GEDCOM 5.3, through GEDCOM 5.4, to GEDCOM 5.5.x.
The GEDCOM 5.3 specification uses all-uppercase in the MONTH definition, but used IniCaps in an example.
The intention of the GEDCOM 5.3 specification wasn't crystal-clear, and the GEDCOM 5.4 specification addressed any possible confusion as explicitly as possible; by including both all-uppercase and IniCaps abbreviation in the MONTH definition. The GEDCOM 5.4 allows both all-uppercase and IniCaps, but does not allow any other casing style. The example combining both styles on a single line left little doubt that developers could mix and match both styles.
The GEDCOM 5.5 specification simplified the MONTH definition, by including only the all-uppercase values, but added the paragraph about case-insensitivity.

The specification evolved from implicitly allowing all-uppercase and IniCaps in GEDCOM 5.3, through explicitly allowing all-uppercase and IniCaps in GEDCOM 5.4, to allowing any case in GEDCOM 5.5.

controlled choice

Most genealogy software developers agree that the case-insensitivity paragraph implies that month abbreviations need neither be all-uppercase nor IniCaps, but may in fact use any casing style. My correspondent argues that MONTH may be a controlled value, but isn't the line value. The line value could be type DATE_VALUE, and that definitely isn't a controlled line value, and even the much more restricted line value of type DATE_EXACT isn't a controlled line value, and therefore the statement about case-insensitivity does not apply to DATE_EXACT values either.

sloppy specification

Here is that paragraph about case-insensitivity from the FamilySearch GEDCOM 5.5.x specification again:

  • All controlled line_value choices should be considered as case insensitive. This means that the values should be converted to all uppercase or all lowercase prior to comparing. The terms UPPERCASE and UpperCase are considered equal. TAGS are always UPPERCASE.

The paragraph really should be using must instead of should; there is a difference, and that difference does matter. This paragraph was introduced in GEDCOM 5.5, and this mistake should have been corrected in GEDCOM 5.5.1.

The paragraph does not start out by stating that Some controlled line value choices should be treated as case-insensitive, it starts out by stating that All controlled line value choices should be treated as case-insensitive.
The usage of the word All lends some credence to the idea that all controlled values are case-insensitive, but it does not merely say controlled value, it says controlled line value, and that's the reason my correspondent believes that the rule only applies to line values.

However, this paragraph does not state that all controlled line values should be treated as case-insensitive, it states that all controlled line value choices should be treated as case-insensitive.
Surely they used a different expression because they intended a different meaning. Alas, taken literally, controlled line value choices means choices for controlled line values, and that is just a somewhat convoluted way of saying controlled line values.

The odd choice of words makes a lot more sense when you merely change the order of the words from controlled line value choices to line value controlled choices, which means controlled choices within a line value.
FamilySearch deliberately used a different expression than controlled line values to try and express the notion that the case-insensitivity rules does not merely apply to entire line values, but applies to all controlled choices within a line value - such as a month abbreviation within a date.

sensible

There is no doubt that the FamilySearch GEDCOM specification is a sloppy piece of work, which leaves room for multiple interpretations, and makes it hard to figure out what was meant. However, allowing anyone to argue that some minor edit to the FamilySearch GEDCOM specification makes it fit their preferred interpretation is a truly bad idea, that will only result in yet more interpretations.
That the expression controlled line value choices is oddly different from for controlled line values and that this difference is probably significant is is a good observation, but hardly a convincing argument.

An important consideration is that allowing IniCaps is sensible, and disallowing it isn't, because so many application were doing so already, and suddenly making their output illegal would be a truly bad move.
Because of the strong relationship between GEDCOM and PAF, the fact that PAF changed from using all-uppercase to using IniCaps, without changing the GEDCOM specification on that point, is significant.
The evolution of the specification from GEDCOM 5.3, through GEDCOM 5.4, to 5.5 also suggests that FamilySearch intended to allow any case.

Moreover, when in doubt about what a specification says, it is good idea to look at actual practice. The actual practice is that genealogy applications use both all-upper and IniCaps month abbreviations, and that they all process these abbreviations in a case-insensitive manner. Deviating from that practice just because you interpret a part of the specification differently isn't a very hot idea.

FamilySearch validator

The GEDCOM specification does not exist in a vacuum.
There is no official reference implementation, but because of its strong relationship with GEDCOM, PAF is the closest thing to it. There are earlier and later specifications to look at, to provide context that helps in figuring out what the specification means. We already considered PAF's behaviour, and already looked at both earlier and later specifications. There is a third source on the meaning of the specification: GedChk 0.9, FamilySearch's own GEDCOM validator.

The FamilySearch GEDCOM validator is a seriously dated MS-application. The article GEDCOM Validation explains how to run it on today's Windows systems.
A quick test informs us that FamilySearch's GEDCOM validator allows months abbreviations to be all-uppercase, IniCaps and in fact any other case.

conclusion

What the specification says can be a matter of interpretation, and all things considered, in this case, the right interpretation is not what it says, but what it intended to say.
The GEDCOM specification may seem to say that month abbreviations should be all-uppercase, but that interpretation conflicts with everything else; that specification isn't sensible, it disagrees with how the specification evolved on this point, it disagrees with PAF, it disagrees with the FamilySearch GEDCOM validator, it disagrees with actual third party practice.
The interpretation that the specification meant to say to that any casing style will do seems to require a minor edit, but the specification is known to be sloppy, and that interpretation does agree with everything else: that specification is sensible, it fits with how the specification evolved on this point, it agrees with PAF, it agrees with FamilySearch's own GEDCOM validator, and it agrees with actual third party practice.

Best Practice

GEDCOM writers

GEDCOM readers

GEDCOM validators

links