Modern Software Experience

2012-06-06

The GEDCOM X converter is spectacularly memory-inefficient, but it is the first look at the GEDCOM X file format that's the real shocker.

Quick Look

ZIP

The FamilySearch GEDCOM X blog, which has been silent for months, has just made not one, but three posts.
The first posts tells us that they've decided to not use MIME as the basis for the GEDCOM X file format, but are now using the widely suggested ZIP file format instead. The second posts informs us that they've introduced some specifications, and the third posts informs us that they have a GEDCOM to GEDCOM X converter now.

Java project

The GEDCOM to GEDCOM X converter is an open source Java project on github. A Download the latest version of the utility link on the github project page leads links to a download. The download is a so-called JAR file. You need to have Java installed to run it.

third-party GEDCOM reader

Remarkably, FamilySearch's converter does not use their own GEDCOM reader. They've promoted their GEDCOM specification for years, and probably have more than one GEDCOM reader.
The GEDCOM X project started as part of their internal Data Framework project. Little is known about that project, but it would not surprise anyone if it contained code to read GEDCOM files.
Yet, the GEDCOM X Converter uses a third party GEDCOM reader. It relies on the open source GEDCOM parsing library that Dallan Quass introduced last year.

The GEDCOM X converter does use the GEDCOM X writer from the GEDCOM X project, but there is something strange going on there as well. The GEDCOM X writer was part of the overall GEDCOM X project, but is a separate project now. The code to read and write GEDCOM X files has been removed from the main GEDCOM X project.

GEDCOM writer

FamilySearch's GEDCOM X File Format library will both read and write GEDCOM X files. Dallan Quass' GEDCOM parser library will only read GEDCOM files.
Thus, these two building blocks provide the basis for a GEDCOM to GEDCOM X converter, but not for the complementary GEDCOM X to GEDCOM converter; you still need to bring your own GEDCOM writer.

initial release

The initial release has version number 0.1.0. The project page lists several limitations. Quite a few tags are not or at least not fully supported yet.
This initial release is not very robust either; the first few attempts to convert some GEDCOM files resulted in a NullPointerException. I did get it to work for the fan files.

DOS Box: GEDCOM X Converter

running the converter

The GEDCOM X converter isn't a GUI application, but a command-line application; you run it from the command line.
The screenshot shows the successful conversion of FAN4.GED to FAN4.ZIP in Windows. The converter isn't a Windows application, but a Java application, so it uses the Java interpreter to run the utility.
The gedcom5-conversion-0.1.0-full.jar file is a so-called executable JAR file; you don't need to unpack or install it, you can just pass it to the Java command. The utility expects you to specify both the input and the output file.

conversion speed

The FAN4.GED file is small, less than 2 kilobytes, and it contains just 15 individuals, yet the converter seemed to need more than second.
To get a better idea how much of that is Java start-up time, and how much is actual conversion time, I tried a few larger files. Conversion of FAN10.GED, a file containing 1.023 individuals took about 3 seconds. Conversion of FAN16.GED, a file containing 65.535 individuals took 45 seconds.
That is a conversion speed of some 1.456 individuals per seconds. That isn't outstanding, but it's certainly passable performance for a first release; it will vary a bit with hardware, but a medium-size genealogy is likely to convert within a minute.

memory-inefficient

What does worry me is how memory-inefficient the FamilySearch converter is. During conversion of FAN16.GED, a file of 12.416.822 bytes, the Windows Task Manager showed that java.exe, the Java interpreter, needed 898.536 KB. That is more than 74 times the file size! That's incredibly memory-inefficient.

DOS Box: GEDCOM X FAN4 Folder properties

conversion result

The result of converting from GEDCOM to GEDCOM X is a ZIP file. The screen shot not only shows that the converter successfully created FAN4.ZIP, it also shows that the FAN4.ZIP is a lot larger than the FAN4.GED file.
The FAN4.GED files is 1.897 bytes, the FAN4.ZIP file is 16.239 bytes, that is more than 8½ times as large.

That is really is bad. You see, GEDCOM is a rather inefficient format already, and it is the ZIP file that is more than 8 times as large, and a ZIP file isn't a single file, a ZIP file is compressed archive. A ZIP viewer tells me that the archive achieved 42 % compression, from 26.829 bytes to 11.379 bytes. The rest is overhead of the ZIP format.

However, don't think that the FAN4.ZIP file, once extracted into a directory, takes up only 26 KB. The FAN4.ZIP file contains 38 files in 4 folders. If you extract that to a folder, than that folder will take up a lot more disk space. The exact amount of disk space used depends on such things as your hard disk cluster size.
On my 1 TB hard disk, the resulting FAN4 folder takes up 155.648 (152 KB) bytes on disk. The FAN4.GED file uses 4.096 bytes (4K) of disk space. So, extracted, the GEDCOM X files take up 38 times as much disk space.

fileGEDCOMfilestotalZIPpedextracted
FAN41.8973826.82916.239155.648
FAN10147.8702.5581.908.0611.144.19110.649.656
FAN1612.416.622163.838126.400.64576.378.317685.735.936

ZIPped

Never mind that the numbers are even worse for larger GEDCOM files. The only conclusion that can be drawn here is that the GEDCOM X file format is not meant to be extracted, and that you really shouldn't do so. Still, the total file size of all the files in the archive is about 8 times that of the original GEDCOM file.

large collection of small files

A GEDCOM X file is ZIP file containing a large collection of small files in several folders.
The folder META-INF folder contains the MANIFEST.MF file. The contributors folder contains the GEDCOM contributor (SUBM tag). The persons folders contains one file for each individual, and the relationships folder contains one file for each relationship.
If you have a medium-size GEDCOM file containing 25.000 individuals connected through 25.000 relationships, the corresponding GEDCOM X file is a ZIP archive containing 50.000 separate XML files.

DOS Box: GEDCOM X relationships

The GEDCOM X file format explodes a small GEDCOM file into a ZIPped collection of files that takes up a lot more space.

The GEDCOM X file format is not merely spectacularly inefficient, it is stunningly ridiculous. The GEDCOM X file format makes GEDCOM look good.

standards-based

The statement that GEDCOM X uses industry standards such as XML and ZIP sounds nice, yet the actual GEDCOM X file format is not likely to make you happy. When you see the actual GEDCOM X file format, you'll have to make a conscious effort to stop yourself from laughing and crying at the same time.

Each of the individual and relationship files is a XML file, complete with XML declaration. Each of these files repeats the information that it is XML version 1.0. Each of these files repeats the information that the character encoding is UTF-8. Each of these file repeats the namespaces used…

The GEDCOM X converter is spectacularly memory-inefficient, but it is the first look at the GEDCOM X file format that's the real shocker.
The GEDCOM X file format is not merely spectacularly inefficient, it is stunningly ridiculous.
The GEDCOM X file format makes GEDCOM look good.

updates

2013-05-31: GEDCOM X Converter 0.2

The observations in this article have prompted changes in the GEDCOM X File format. See GEDCOM X Converter 0.2: Changes.

links