Modern Software Experience

2013-02-05

GEDCOM files are text files

Here's something the GEDCOM 5.5.1 specification should tell you but doesn't; a GEDCOM file is a structured text file.

text file

Here's something the GEDCOM 5.5.1 specification should tell you but doesn't; a GEDCOM file is a structured text file.
Seriously, that statement is not in there. The phrase text file does occur in the GEDCOM 5.5.1 specification, but just once, in chapter 4, where FamilySearch asks vendors to send them a copy of the GEDCOM product with installation procedures and a text file containing relevant technical documentation about the product's GEDCOM implementation.
The phrase text file does not occur anywhere else.

a database in the form of a sequential stream of related records

That GEDCOM files are text files may seem obvious when you've seen a few GEDCOM files, but the GEDCOM specification never says so.
Chapter 1 of the specification practically starts by defining a GEDCOM file (oddly called a transmission) thus: A GEDCOM transmission represents a database in the form of a sequential stream of related records..

The GEDCOM specification does not start by telling you a GEDCOM file is a text file. It starts by telling you a GEDCOM file is a database, one that consists of a sequential stream of related records.. That hardly sound the same.

It is only when you continue reading that paragraph that you'll begin to suspect that a GEDCOM file is indeed a text file, because it goes on to talk about lines and line terminators:

A GEDCOM transmission represents a database in the form of a sequential stream of related records. A record is represented as a sequence of tagged, variable-length lines, arranged in a hierarchy. A line always contains a hierarchical level number, a tag, and an optional value. A line may also contain a cross-reference identifier or a pointer. The GEDCOM line is terminated by a carriage return, a line feed character, or any combination of these.

That paragraphs says that a GEDCOM files is sequence of records, and that each of these records consists of several lines. These lines are variable-length, and terminated by a choice of line terminators. It almost sounds like they are describing a text file.
Almost, but not quite…

GEDCOM grammar

The GEDCOM grammar in chapter 1 is incomplete. It lacks the top level definitions.
It lacks the definition of a GEDCOM file as a sequence of GEDCOM records, and the definition of a GEDCOM record as a sequence of GEDCOM lines.
That information is in the text, but not in the grammar.
Together, these absent definitions imply that a GEDCOM file is a sequence of GEDCOM lines.

GEDCOM line

Chapter 1 of the GEDCOM 5.5.1 specification describes the syntax of a GEDCOM line thus:

A gedcom_line has the following syntax:
gedcom_line:=
level + delim + [optional_xref_ID] + tag + [optional_line_value] + terminator

GEDCOM Syntax

The section titled Grammar Rules, just before the Grammar Syntax section, provides several rules for GEDCOM records, GEDCOM lines and their level numbers. It notes that long values are broken into multiple lines using the CONC and CONT tags. It remarks that GEDCOM writers should not produce leading white space, but that GEDCOM readers should ignore it when it occurs.
It also says something about the maximum size of a logical GEDCOM record.

The notion of a maximum size for logical GEDCOM records is obsolete.

logical record size

The size of logical record in a GEDCOM file is easy to determine; a record starts with a 0-level line, and ends where the next record is started by the next 0-level line. The GEDCOM specification states that logical records should fit in a 32K memory buffer, and immediately follows with the statements that all length constraints are provided in characters, not bytes. Whether that second statement is meant to apply to the first one is not clear, and implementers should err on the safe side, but the issue is moot. The limitation itself is obsolete. Nowadays, it is not unlikely that a single note field is larger.

The stated maximum logical record size of 32 KB is seriously dated. It made a lot of sense when a home computer with 64 KB of RAM was uncommon, and continued to make some sense during the period of 16-bit software (MS-DOS, Classic Mac OS, Windows 3.x). It does not make sense now that 32-bit software is legacy software running on 64-bit systems.
The notion of a maximum size for logical GEDCOM records is obsolete.

The statement All length constraints are provided in characters, not bytes. is seriously confused.

line length

The statement All length constraints are provided in characters, not bytes. is seriously confused.
It is an erroneous statement by an editor unfamiliar with the character sets and encodings used by GEDCOM.

It is simply not possible to specify a memory buffer in characters, unless you limit yourself to very a simple character set such as ASCII. In both ANSEL and Unicode, coding a single character may require multiple code points. On top of that, the number of code units needed to encode a Unicode code point varies with both the encoding and the code point.
Character sets are complex. FamilySearch would have been wise to hire an expert to edit the GEDCOM specification.

Programmers are taught to distinguish between the length of a string and the size of a buffer. Ideally, the lenght of a string is measured in characters or code points. In practice, the lenght of a string is measured in code units.
The size of a memory buffer is specified in bytes.

The maximum length of a GEDCOM line is 255 code units.

The line that follows, When wide characters (characters wider than 8 bits) are used, byte buffer lengths [sic] should be adjusted accordingly. makes clear what the FamilySearch editor meant to convey; all lengths are in code units.

The maximum length of a GEDCOM line is 255 code units.
That is 255 bytes for both 7-bit (ASCII) and 8-bit (ANSEL and UTF-8) character sets and encodings, but is double that, 510 bytes, for UTF-16, a 16-bit encoding - and the same byte doubling applies to other stated lengths. This ensures that any value that fit onto a single line before the introduction of UTF-16 as a legal GEDCOM encoding (GEDCOM 5.5, in 1995), still fits on a single line when encoded as UTF-16. By the way, the reverse is not true; what fits on a single line in UTF-16 may require multiple lines in UTF-8.

line terminator

Each GEDCOM line is terminated by a terminator. That is short for line terminator, also known as a newline.
Each GEDCOM line is terminated by a single newline. The GEDCOM specification allows a choice of line terminators; it may be either a Carriage Return (CR), a Line Feed (LF), or a Carriage Return / Line Feed combination (CR/LF). The GEDCOM specification even allows the Line Feed / Carriage Return combination (LF/CR).
It is common to talk of newline as if it is a single character, even though it may be comprised of two characters.

Each GEDCOM line is terminated by a newline, and that is the only usage of newline within a GEDCOM file.

Each GEDCOM line is terminated by a newline, and that is the only usage of newline within a GEDCOM file.
To be more precise; each GEDCOM line is terminated by a newline, and - regardless of what that newline consists of - neither the Carriage Return (CR) nor the Line Feed (LF) character may be used anywhere else.

CR, LF and CR/LF

That the GEDCOM specification allows a choice of line terminators isn't odd. The CR, LF and CR/LF line terminators are all quite common.
That the GEDCOM specification allows LF/CR in addition to CR, LF and CR/LF is odd. The LF/CR terminator does not occur in practice, and it is easy to mess up line-counting logic when you try to support LF/CR in addition to CR, LF and CR/LF.
Not all GEDCOM readers support LF/CR as a line terminator, and few editors display such files well.

unrestricted freedom of choice

The GEDCOM specification does not state a reason for having a choice of line terminators, does not provide a reason for supporting these particular line terminators, and does not provide any guidance on which one to use.

Another, more basic thing, the GEDCOM specification fails to state is that you cannot mix and match line terminators within a single file. In fact, it states quite the opposite!
According to the GEDCOM syntax, each line ends with a terminator, and each terminator can be either CR, LF, CR/LF or LF/CR:

A gedcom_line has the following syntax:
gedcom_line:=
level + delim + [optional_xref_ID] + tag + [optional_line_value] + terminator

...

terminator:=
[carriage_return | line_feed | carriage_return + line_feed | line_feed + carriage_return ]

The FamilySearch GEDCOM grammar specifies that each line ends with a line terminator of your choice, and there is no accompanying text that restricts your choice of terminator in any way. There is no demand for consistency. Previous or subsequent choices do not factor into it.

According to FamilySearch's specification, it would be perfectly fine to end one line with a CR, another line with a LF, yet another line with CR/LF, even end some lines with LF/CR, and to keep mixing and matching at your whim.
Do not believe FamilySearch's specification. The unrestricted freedom of newline choices the GEDCOM 5.5.1 specification allows is unintended.

specification in error

This is probably one of the most fundamental mistakes in FamilySearch's GEDCOM specification. While GEDCOM is commonly thought of as a text-based format, that is not how FamilySearch chose to define it. Chapter 1 defines GEDCOM as database that consists of a series of records, and then goes on to define records as if they are some binary format. What's more, allowing the line-terminator to vary from one line to the next arguably makes it a binary format…

GEDCOM isn't a binary format, but a text-based format.
The GEDCOM creators did not intend GEDCOM files to be processed by humans, but did intend GEDCOM files to be human-readable. The GEDCOM specification even reserves space in the GEDCOM header for several items, such as the vendor's address, that are of no use to a GEDCOM reader, but can be handy for a human reader inspecting a GEDCOM file.

FamilySearch's GEDCOM specification is in error. It is hopeless to try and argue that the specification is right because it is the specification. The unrestricted freedom in newline choices does not make any practical sense. No current GEDCOM writer intentionally creates files with varying line terminators. Few GEDCOM readers would handle it well.
All current GEDCOM readers expect the text file format that FamilySearch intended to specify. The implementations do not need fixing, the specification needs fixing.

The shared understanding of the intended specification is the real industry standard; GEDCOM files are text files.

de facto standard

I've written more than once that GEDCOM is the de facto industry standard, but that is not entirely accurate.
All current GEDCOM implementations were created by programmars who think of GEDCOM files as text files, and that shared thinking has become an industry fact, despite a specification that fails to define it that way. The shared understanding of the intended specification is the real industry standard; GEDCOM files are text files.
GEDCOM files consistently use the same line terminator for each line.

Any vendor that makes the mistake of creating a GEDCOM writer that deviates from this de facto standard, is sure to have their product cursed by its users, simply because most text editors and GEDCOM readers cannot handle it, and few vendors are likely to update their product to support such an abomination.

conclusion

newline conventions

Classic Mac OS uses CR, UNIX and derivates use LF, MS-DOS and Windows use CR/LF. Mac OS X is a UNIX derivate, so it uses LF, not CR.
Cross-platform applications should use CR/LF.

GEDCOM files are text files. Historically, different platforms use different line terminators for text files, and that is why GEDCOM supports a choice of line terminators. The CR, LF and CR/LF line terminators are all used by different platforms, inclusion of LF/CR as a legal value is a mistake. GEDCOM writers should default to the newline convention for their system.

The GEDCOM 5.5.1 specification limits the size of logical GEDCOM records to 32K. That's a limitation from the era of 16-bit computing. It has been obsolete for years.

The GEDCOM specification contains the confused statement that all length constraints are given in characters instead of bytes. That is a mistake. What it meant to say is that length constraints are specified in code units; GEDCOM lines used to be limited to 255 bytes, but for UTF-16 GEDCOM files, the limitation is double that, 255 words, 510 bytes.

The GEDCOM specification should be updated to clearly state that GEDCOM files are text files. The specification should continue to allow a choice of line terminators, but only allow the three line terminators that are actually being used by various platforms; CR, LF and CR/LF. The LF/CR combination should be outlawed to avoid the complications it needlessly introduces.

The specification should clearly state that although there is a choice of line terminators, the choice can be made only once per GEDCOM file; each file should consistently use the same line terminator for each line.
By the way, a multi-volume GEDCOM files should be thought of as a single GEDCOM file split into parts; files that are part of a multi-volume GEDCOM file should all use the same line terminator.

Best Practice

GEDCOM writer

GEDCOM reader

GEDCOM validator

updates

2014-09-14 GEDCOM Magic

Updated links after split of GEDCOM Magic article into 0 HEAD Value and GEDCOM & FTW TEXT Magic.

2017-02-16: A WikiTree GEDCOM Problem

A WikiTree GEDCOM Problem discusses an odd UTF-8 GEDCOM violation: a character split in two.

links