Modern Software Experience

2012-01-09

Common GEDCOM Extension

_UID tag

The _UID tag is what I call a Common GEDCOM Extension; a GEDCOM extension that's common to many different applications. The _UID tag is a legal GEDCOM extension; it starts with an underscore, like any vendor-defined extension should. However, it would be wrong to think of the _UID tag as a vendor- or product-specific tag. The _UID tag is not specific to any particular product or vendor. It is a common GEDCOM extension, supported by quite a few different products from different vendors.

vendor support

Applications that support the _UID tag include

The above list was created by googling for GEDCOM files that contain the _UID tag. It probably isn't complete, but it does show that support for the _UID tag is not merely common, but also that it is supported by some of the best-known and most popular applications, such as PAF, RootsMagic and Legacy.

table index

A typical genealogy application is built on top of a relational database; it is a collection of records kept in tables, and each record has a index; a number that indicates the position of that record within the table. Quite a few application actually show the index value, and allow you to navigate to records by by number.
Applications may include these numbers in reports, to facilitate searching by numbers, and may even include an option to not reuse record number after deletion of a record, to prevent the number associated with a particular individual or family record from changing.

record identifier

These index numbers are unique within each database, but no more unique than that. Typically, records are numbered sequentially starting at 1, for each new table, in each new database.
When a database is exported to GEDCOM, each record must be given an identifier. Typically, the identifier consists of a single letter followed by a number; the letter indicates the table and the number is the index value of that record within that table. For example, I2 is record number 2 in the Individual table, and F3 is record number 3 in the Family table. Prefixing the index values with a letter for the table prevents records from different tables having the same identifier.

cross-reference

Within a GEDCOM file, the identifiers are used for cross-references. This is how a GEDCOM file captures the relationships between records. For example, the HUSB and WIFE tags within a FAM record refer to INDI records for the husband and wife. Within the GEDCOM specification the identifiers themselves are known as cross-reference identifiers and the cross-references are known as pointers.

UUID

The cross-reference identifiers are unique with a single GEDCOM file, but different GEDCOM files, are very likely to use the same identifiers for different records. For some purposes it would be nice to have truly unique identifiers. That requires two things; some way to create globally unique identifiers, and a GEDCOM tag to carry that identifier. The Universally Unique ID (UUID) is that globally unique identifier, and _UID is the tag that carries it.

Microsoft developers known UUID as Globally Unique IDentifier (GUID). What makes UUIDs so suitable is that they were developed so precisely so that everyone can generate UUIDs, without coordinating with anyone else, and still be practically sure that the generated identifier is unique.
A UUID is an 128-bit (16-byte) number, generally represented by 32 hexadecimal digits, divided into five groups, separated by hyphens, like this: 12345678-1234-1234-1234-123456789ABC.


...
0 @I1@ INDI
1 NAME /One/
2 SURN One
1 SEX M
1 _UID 92FF8B766F327F48A256C3AE6DAE50D3A114
1 FAMC @F1@
1 CHAN
2 DATE 5 May 2011
3 TIME 14:42:00
...

PAF GEDCOM _UID: checksum

This is what a UUID looks like in a PAF GEDCOM. Notice that the _UID value is not divided into groups separated by hyphens, but represented as one long hexadecimal number. What is not immediately obvious because of its length is that the number shown is not a UUID. It cannot be an UUID, because an UUID is 32 hexadecimal digits long, and the _UID value is 36 hexadecimal digits long. When we hyphenate the number for readability, we find that there are four extra digits: 92FF8B76-6F32-7F48-A256-C3AE6DAE50D3-A114. The _UID value is a UUID followed by a checksum.

That little fact, essential to making sense of the _UID value, used to be undocumented. I once figured that out myself, but you don't have to do so, nor take my word for it. Nowadays, FamilySearch documents it, if you know where to look. The FamilySearch document GEDCOM Unique Identifiers states that it provides guidelines for the use of UUIDs, but it actually documents the format of the PAF _UID value; a 32-hexadigit UUID value followed by a 4-hexadigit checksum. That brief document includes some Windows C code, possibly the actual PAF source code. That code shows how to calculate the checksum.

PAF is a fork of Ancestral Quest, so it is no wonder that Ancestral Quest uses the same format. But it is not just Ancestral Quest that uses this format. This UUID format is the most popular one. Other applications that use the same UUID format for their _UID tag are Family Origins for Windows, Family Tree Heritage, Family Tree Legends, Genbox Family History, Legacy Family Tree, Reunion and RootsMagic.

no checksum

Daub Ages! seems to use the same format, but a second look shows different. Daub Ages! uses a 36-hexadigit value like PAF, but the last four hexadigits are always four zeroes. The checksum value used by FamilySearch PAF used to be undocumented, and not every developer bothered to figured those last four hexadigits out. It would have been better to use 32-hexadigit value, as the extra four hexadigits are of no benefit in any way. On the contrary, it is likely to get the value rejected by applications that expect a checksum.


...
1 SOUR MYHERITAGE
2 NAME MyHeritage Family Tree Builder
2 VERS 5.5
2 _RTLSAVE RTL
2 CORP MyHeritage.com
1 DEST MYHERITAGE
...
1 _UID GWJ645C9-19X4-DF14-GQ3B-GQ3B594316C5
...

formatted UUID

Some applications use a UUID without a checksum, and actually format it like a UUID. The benefit of formatting an UUID as an UUID is that the value is likely to be recognised as an UUID. One application that uses this format is GenoPro. Another application that appears to do so is MyHeritage Family Tree Builder.

However, some Family Tree Builder GEDCOM files contain _UID values that are hyphenated like a UUID, but definitely aren't UUIDs, as they contain non-hexadecimal characters.
This appears to be a defect specific to some Family Tree Builder version that identifies itself as version 5.5. You can find several such GEDCOM files by googling for MyHeritage Family Tree Builder 2 VERS 5.5.


0 HEAD
1 SOUR FAMILY_HISTORIAN
2 VERS 4.0
...
1 _UID {A6247491-A693-4ca4-A764-DD1A752D8C36}
....

GUID format

Microsoft prefers GUID values to be formatted as an UUID, groups of hexadigits separated by hyphens, with the entire number placed between curly brackets, like this: {12345678-1234-1234-1234-123456789ABC}. That is the format that Microsoft's GuidToString function returns. At least one application, Family Historian, uses that format.

RIN

The GEDCOM 5.5.1 specification introduced the RIN tag. RIN is an abbreviation of Record Identification Number. Appendix A defines the RIN tag as A number assigned to a record by an originating automated system that can be used by a receiving system to report results pertaining to that record., and its tag value AUTOMATED_RECORD_ID is described thus:

AUTOMATED_RECORD_ID:= {Size=1:12}
A unique record identification number assigned to the record by the source system. This number is intended to serve as a more sure means of identification of a record for reconciling differences in data between two interfacing systems.

That sure sounds like the RIN tag is intended to serve part of the purpose the _UID tag serves. However, it is not possible to use a UUID with the RIN tag, as its value should not exceed 12 characters. Support for the RIN tag does not seem widespread.

summary

The _UID tag is a common GEDCOM extension, supported by many well-known and popular genealogy applications. The value of the _UID tag is a UUID, but the format used for that value differs between applications.
Most applications use a single 36-hexadigit value; a 32-hexadigit value followed by a 4-hexadigit checksum. Some applications format the UUID value as a UUID is supposed to be formatted; several groups of hexadigits, separated by hyphens. At least one application uses curly brackets around the UUID.

There are some erroneous implementations. Daub Ages! has checksums that are always zero and Family Tree Builder 5.5 creates values that contain non-hexadecimal characters.

_UID advice

Applications that support the _UID tag should use the same format as FamilySearch PAF; a 32-hexadigit value followed by a 4-hexadigit checksum. This provides the highest compatibility with existing implementations, as it is already the most widely used format. The hexadecimal values should be written using uppercase letters.
Applications should not write but still read hexadecimal values using lowercase letter. They should also read the hyphenated values used by GenoPro and Family Tree Builder, as well as the hyphenated values within curly brackets used by Family Historian. Applications should reject all _UID values with invalid checksums, and all _UID values containing non-hexadecimal characters.

updates

2012-11-06: Ahnenblatt

Ahnenblatt 2.72 adds support for the _UID tag.

links