2013-03-11 GEDCOM Identifiers: Length

GEDCOM Identifiers

This article was originally titled GEDCOM Identifiers.
It has been renamed and is part I of the GEDCOM Identifiers quadrology now.

How long is an identifier?

GEDCOM records

A GEDCOM file presents a bunch of related records as a flat text file. To express the relationships between the records, most records have a identifier, some unique name that other records can reference. The only two records without a identifier are the header and trailer record.
The GEDCOM specification refers to a record identifier as pointer or cross-reference identifier.

pointer versus cross-reference identifier

FamilySearch does not use pointer and cross-reference identifier as interchangeable synonyms; the identifier that goes with a record is an cross-reference identifier, while a reference to some record is pointer.
Because pointer is technically incorrect and cross-reference identifier is rather long, I'll simply call them references and record identifiers, or identifiers for short.

GEDCOM grammar

The GEDCOM specification introduces GEDCOM identifiers as part of the GEDCOM grammar:

In addition to hierarchical relationships, GEDCOM defines the inter-record relationships that allow a record to be logically related to other records, without introducing redundancy. These relationships are represented by two additional, but optional, parts of a line: a cross-reference pointer and a cross-reference identifier. The cross-reference pointer "points at" a related record, which is identified by a required, matching unique cross-reference identifier. The cross-reference identifier is analogous to a primary key in relational database terminology.

The last sentence of the above quote is the only place in the entire GEDCOM specification where relational databases are mentioned.
It says the record identifiers are analogous to primary key in relational databases. GEDCOM was in fact designed for use with relational databases; the GEDCOM specification as a whole reflects relational thinking.
Most genealogy applications are based on relational database systems, with tables corresponding to GEDCOM record types. When exporting to GEDCOM, most of these products follow an identifier naming convention that essentially exports the primary key as the record identifier.

identifier

The rules for identifiers are fairly flexible. The first character must be alphanumeric, and a few characters have special meaning. Otherwise, any arbitrary combination of characters is allowed. It is uncommon for systems to allow identifiers to start with a digit, but GEDCOM does allow this. It is easy for GEDCOM to allow this, as identifiers are always preceded and followed by an at sign (@).

Two characters that have special meaning are the colon (:) and the exclamation mark (!). They have a special meaning when used within pointers and therefore may not appear in identifiers.
The at sign ('@') may not be used within an identifier at all; a GEDCOM reader that encounters the at sign within an identifier will assume the at sign marks the end of that identifier.

forward references

The records in a GEDCOM file are in some sequential order. Although many products use the same sequence, the sequence the GEDCOM records are in is not significant. Products are allowed to exports records in any sequence they like. References to record identifiers provide the relationships between the records. References to non-existing identifiers are illegal, but forward references, references to records that occur further on in the file, records the GEDCOM reader may not have seen yet, are perfectly legal.

Pointers may refer to either records which have not yet appeared in the transmission (forward reference) or to records that have already appeared earlier in the transmission (backward reference). This arrangement usually requires a preliminary pass to construct a look up table to support random access by xref_ID during subsequent passes.

Most GEDCOM readers are multi-pass designs; they perform more than one pass of the GEDCOM file they are reading. Here, the GEDCOM specifications itself gives one reason why this is so.

record!subrecord

Relational databases have records and fields. Record do not have values, records have fields, and the record fields have values. The GEDCOM specification does not have fields but subrecords. In GEDCOM a record can have a value, but it often are only the subrecord that have one. This allows a fairly straightforward mapping between relational databases and GEDCOM.
Still, GEDCOM does not have fields, it has subrecords, and these subrecords may have an identifier of their own, just like the top-level records. In practice, all the top-level GEDCOM data records have identifiers, and the subrecords do not.

The GEDCOM grammar includes a way to address subrecords:

The pointer represents the association between two objects that usually reside in different records. Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record's cross-reference ID from the specific substructure's cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @I132!1@. Including the Record ID number in the pointer that associates objects within a record will allow the GEDCOM processors to build the index only at the record level and then search sequentially for the appropriate substructure cross-reference ID. The parent record ID is assumed when the cross-reference ID begins with a exclamation point (!) signifying an intra-record association.

The FamilySearch GEDCOM specification does not provide an example to help you understand that text.

file:record

The FamilySearch GEDCOM grammar does not only allow referencing subrecords of top-level records, it also allow referencing records other files:

The pointer must match a corresponding unique xref_ID within the transmission, unless the colon (:) character is present (which will be used in the future as a network reference to a permanent file record). A pointer is given instead of duplicating an object, though the logical result is equivalent. An expanded traversal of a record tree includes following the pointer to related records to some depth, and splicing those records (logically) into the resultant expanded tree.

Again, the FamilySearch GEDCOM specification does not provide an example.

The FamilySearch GEDCOM specification fails to mention this, but this feature is not meant for multi-volume GEDCOM files. Conceptually, a multi-volume GEDCOM files is conceptually a single file, and should contain simple identifier references as if it were a single.

at signs

According to the GEDCOM grammar, the xref_ID is a pointer, and a pointer starts and ends with an at sign.

pointer:=
    [(0x40) + alphanum + pointer_string + (0x40) ]
where:
    (0x40)=@

pointer_char:=
    [non_at ]

pointer_string:=
    [null | pointer_char | pointer_string + pointer_char ]

...

xref_ID:=
[pointer]

The inclusion of null as a character that's allowed within identifiers seems an obvious error. However, the null included here does not signify the null character, it signifies nothing. The pointer_string value is defined recursively; it consists of zero or more allowed characters. Thus, the minimum length of the identifier enclosed within the at signs is 1 alphanumeric character. The GEDCOM grammar does not specify a maximum length for identifiers.

GEDCOM lineage-linked form

The syntax for the lineage-linked form defines XREF like this:

XREF:={Size=1:22}
Either a pointer or an unique cross-reference identifier. If this element appears before the tag in a GEDCOM line, then it is a cross-reference identifier. If it appears after the tag in a GEDCOM line,then it is a pointer. The method of delimiting a pointer or cross-reference identifier is to enclose the pointer or cross-reference identifier within at signs (@), for example, @I123@. A XREF may not begin with a number sign (#). This is to avoid confusion with an escape sequence prefix (@#). The use of a colon (:) in the XREF is reserved for creating future network cross-references and the use of an exclamation (!) is reserved for intra-record pointers. Uniqueness of the cross-reference identifier is required within the transmission file.

XREF:FAM:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a fam_record.

XREF:INDI:= {Size=1:22}
A pointer to, or a cross-reference identifier of, an individual record.

XREF:NOTE:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a note record.

XREF:OBJE:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a multimedia object.

XREF:REPO:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a repository record.

XREF:SOUR:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a SOURce record.

XREF:SUBM:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a SUBMitter record.

XREF:SUBN:= {Size=1:22}
A pointer to, or a cross-reference identifier of, a SUBmissioN record.

maximum identifier length

The syntax provided in the GEDCOM grammar implies a minimum identifier length of one alphanumeric character. The syntax seems to allow identifiers of any length, but elsewhere in chapter 1, the maximum identifier length is stated:

XREF:={Size=1:22}
...
The cross-reference ID has a maximum of 22 characters, including the enclosing at signs (@), and it must be unique within the GEDCOM transmission.

The maximum length is 22 characters including the enclosing at signs, so the identifier itself may be no more than 20 characters long. The GEDCOM specification does not explain why the maximum length of identifiers is 20 characters, but it probably isn't coincidence that the largest 64-bit integer (18.446.744.073.709.551.615 without the dots) is 20 characters long.

Most examples GEDCOM files and GEDCOM fragments within the GEDCOM specification use entirely numeric identifiers only. In practice, most applications create identifiers that consist of a single letter followed by a number, as within the enclosed identifier @I123@ in the quoted text above, and because of that initial letter, the number is limited to the remaining 19 characters.
Programmers often use a signed integer type that allows a range of positive and negative numbers, instead of an unsigned integer that allows only zero and positive numbers.
A 64-bit signed integer type does not support the number range 0 ... 18.446.744.073.709.551.615, but the range -9.223.372.036.854.775.808 ... 9.223.372.036.854.775.807 instead; the largest positive integer it can handle is 19 characters long.

The GEDCOM grammar and GEDCOM form contradict each other.

at signs

The GEDCOM grammar limits identifiers to 22 characters, and the XREF definition in the GEDCOM lineage-linked limits identifiers to 22 characters as well, yet those two limits are not in agreement with each other.

According to the syntax provided the GEDCOM grammar, the at signs are part of the pointer.
The text of the XREF definition strongly suggests that the enclosing at sign are not part of the identifier itself. The rest of the lineage-linked syntax form confirms this; the enclosing at signs are already there.
So, the GEDCOM grammar says the delimiting at signs are part of the pointer or xref_ID (cross-reference identifier), and the GEDCOM lineage-linked form says those at signs are not part of the XREF (pointer or an unique cross-reference identifier).
The GEDCOM grammar and GEDCOM form contradict each other.

This is remarkably sloppy for a specification that's gone through multiple major and minor versions since its introduction more than a quarter century ago. The FamilySearch GEDCOM specification distinguishes between definition of and references to record identifiers, yet still contradicts itself regarding what constitutes an identifier in the first place.

The maximum length of a record identifier is 20 characters.

identifier length

The GEDCOM syntax takes precedence over any GEDCOM form; if a form were allowed to redefine the GEDCOM syntax there would be no GEDCOM syntax.
The GEDCOM syntax says that the maximum of length of identifiers including the enclosing at signs is 22 characters, so that's the maximum. No GEDCOM form can increase that to 22 characters excluding the enclosing at signs. The lineage-linked form is in error. The maximum length of a GEDCOM record identifier is 20 characters.

The maximum length of an enclosed identifier is 22 characters.

inconsistent terminology

Other than the maximum length of the GEDCOM identifiers, a maximum that vendors are unlikely to exceed anyway, deviation from the GEDCOM syntax isn't the issue here. The real issue, the one that enabled this error and allowed it go unnoticed, is inconsistent terminology.
It makes sense to give precedence to the GEDCOM syntax as a general rule, but the usage within the GEDCOM form corresponds better with the general understanding of what an identifier is. Within the XREF definition, the GEDCOM form expresses the relationship between GEDCOM identifiers and the enclosing at signs thus:

The method of delimiting a pointer or cross-reference identifier is to enclose the pointer or cross-reference identifier within at signs (@), for example, @I123@.

Moreover, the GEDCOM form already uses the word identifier for the HEAD.SOUR line value, another 20-character value.

The maximum length of a GEDCOM record identifier is 20 characters. A record identifier is enclosed by two at signs. The maximum length of an enclosed identifier is 22 characters.

file:record!subrecord

The colon (:) and exclamation mark (!) have special meaning. The colon is used to create references to record in other files. The exclamation mark is used to reference subrecords. There are no examples for either, and it neither mechanism seems to be used in practice.
The question many developers have struggled with after reading about these mechanisms in the GEDCOM grammar, is whether they should support it, and whether third parties support it.
There is no need need to support these mechanisms. Any vendor that does support it is probably in error.

The GEDCOM form defines these mechanisms, but also states that:

For the time being, however, the use of pointers is explicitly defined within the GEDCOM form, such as the Lineage-Linked GEDCOM Form defined in Chapter 2 (see page 19).

and the lineage-linked form states:

XREF:={Size=1:22}
...
The use of a colon (:) in the XREF is reserved for creating future network cross-references and the use of an exclamation (!) is reserved for intra-record pointers. Uniqueness of the cross-reference identifier is required within the transmission file.

The GEDCOM form defines these mechanism, but does not require that any GEDCOM form uses it, and the lineage-linked form explicit states that the exclamation mark (!) is not used, but merely reserved for future use, and that the exclamation mark is reserved as well.
It does not say the exclamation mark is reserved for future use, but that is implied, as the GEDCOM 5.5.1 lineage-linked form does not use references to subrecords. The GEDCOM 5.5.1 does not even include identifiers for subrecords.

Thus, the colon (:) and exclamation mark (!) are merely reserved for some future use. They should not be used within identifiers.

Best Practice

GEDCOM reader

Identifiers must be unique.
Report a fatal error and abort when two different records have the same identifier.
A smart GEDCOM reader could issue a non-fatal error if the two records are in fact identical, but a GEDCOM file should not contain the same record twice.
GEDCOM identifiers may be 20 characters long.
Report an error for each identifier that is longer than that.
Recognise that the colon (:) and exclamation mark (!) are reserved for future use.
Report a fatal error and abort when a GEDCOM 5.5.x identifier contains either of these.

GEDCOM writer

Identifiers must be unique. Ensure that they are.
- Export each records at most once.
- Follow the naming convention most vendors use.
GEDCOM 5.5.x identifiers may be 20 characters long.
Respect that maximum.
Do not use the colon (:), exclamation mark (!) or at sign (@) within an identifier.

GEDCOM validator

Identifiers should be unique.
Report an error for each reuse of an identifier.
GEDCOM 5.5.1 identifiers may be 20 characters long.
Report an error for each identifier that is longer than that.
Identifiers should not contain a colon (:) or exclamation mark (!).
Report an error when an identifier contains either of these.
Feel free to report a fatal error and abort.

updates

2013-03-14: Common GEDCOM Identifier Naming Convention

Common GEDCOM Identifier Naming Convention documents the common naming convention, complete with Best Practices.

2013-03-11 GEDCOM Identifiers: Length

GEDCOM Identifiers

How long is an identifier?

GEDCOM records

pointer versus cross-reference identifier

GEDCOM grammar

identifier

forward references

record!subrecord

file:record

at signs

GEDCOM lineage-linked form

maximum identifier length

at signs

identifier length

inconsistent terminology

file:record!subrecord

Best Practice

GEDCOM reader

GEDCOM writer

GEDCOM validator

updates

2013-03-14: Common GEDCOM Identifier Naming Convention

links