It is a little known and underpublished fact that FamilySearch's GEDCOM X project is arguably five years old already. How old GEDCOM X is exactly depends on what you consider the start of the project. What is clear is that the GEDCOM X Converter version 0.1 was released on 2012 June 5, about a year ago. The release of the first GEDCOM to GEDCOM X Converter enabled a first look at the GEDCOM X file format. GEDCOM X Converter was the first look at both the converter and that file format.
That first converter was not very robust, it crashed easily and was spectacularly memory-hungry. The GEDCOM X Converter 0.2 has not crashed on me yet, is memory-hungry but not as bad as version 0.1, and the conversion speed is no reason for complaints either.
Since becoming public in 2011, the GEDCOM X file format had already changed once, to use ZIP instead of MIME,
The first look at this already improved GEDCOM X file format that prompted some incredulous observations.
The GEDCOM X file created for FAN4.GED
turned out to be more than 8½ times as large as the GEDCOM file.
Now, the GEDCOM X file format is a zipped format, and unzipped the data within the ZIP already totalled more than 14 times the GEDCOM file,
but because it is in multiple files, actually takes up 38 times as much disk space.
A very surprising, almost unbelievable observation, was that the GEDCOM X file format not only
explodes a single GEDCOM file into a large collection of files,
but that each of these files is a XML file, complete with XML declaration.
Each of these files repeats the information that it is XML version 1.0.
Each of these files repeats the information that the character encoding is UTF-8.
Each of these file repeats all the namespaces used…
Most of the GEDCOM X file format wasn't data, but mindlessly repeated superfluous boilerplate, repeated ad nauseam.
That makes a spectacularly inefficient and stunning ridiculous file format.
One year on, the GEDCOM X Converter version 0.2 is available for download. The GEDCOM X file format has been changed in response to the observations made.
Version 0.2 of the GEDCOM X Converter is still a Java app, a so-called executable JAR file to be precise. There still is no GUI yet, you still need to be run from the command line. That is a bit cumbersome, but it is merely version 0.2, and it isn't some end-user tool, but merely a utility for developers.
The screenshot shows the result of converting FAN4.GED
.
Whereas the GEDCOM X Converter version 0.1 converted FAN4.GED
to FAN4.ZIP
,
version 0.2 converts FAN4.GED
to fan4.gedx
.
Version 0.2 not only uses another file extension than version 0.1, it also downcases the filename, presumably to match the lowercase file extension.
It is perfectly fine to use a lowercase file extension, but the converter should respect the casing of the base name.
The file extension used is *.gedx
now, but it is still a ZIP file.
You can easily look inside the *.gedx
by renaming it to *.zip
.
The GEDCOM X Converter 0.1 produced a ZIP file of 16.239 bytes, version 0.2 produces a ZIP file of 11.554 bytes.
That's still considerably bigger than the GEDCOM file, but it is an improvement.
The fan4.gedx
produced by version 0.2 is almost 29 % smaller than the FAN4.ZIP
produced by version 0.1.
So, what changed?
The file extension used is*.gedx
now, but it is still a ZIP file. You can easily look inside the*.gedx
by renaming it to*.zip
.
The GEDCOM X file format has changed.
It is normal for a file format to change over time, to accomodate new or changed application requirements.
Moreover, if a file format has to change, it's best that it change before anyone is using it.
GEDCOM X still isn't at version 1.0, FamilySearch does not seem to be using it themselves, and no third party has implemented it.
Changing the GEDCOM X file format is a perfectly reasonable thing to do that does not affect any real world user or developer.
The version 0.2 file is smaller than the version 0.1 file. The ZIP files created by 0.1 contain and 0.2 the same subdirectories and the same files. The difference is in the size of the files those directories contain.
A GEDCOM file has a GEDCOM header.
A GEDCOM X file is a ZIP file, so it has a ZIP header, but it does not have a GEDCOM X header,
A GEDCOM X file contains a MANIFEST.MF
file instead.
The MANIFEST.MF
file created by convertor 0.1 is 3.058 bytes, about 3 KB.
The MANIFEST.MF
file created by version 0.2 is only 70 bytes.
The MANIFEST.MF
file used to mention every other file within the ZIP file,
together with the so-called Content-Type
for that file.
Now it merely refers the contributor (submitter).
All mention of the other 36 files within the ZIP file is gone.
0 HEAD 1 SOUR GEDFAN 2 NAME The GEDCOM Fan Creator 2 VERS 0.3.2.0 2 CORP Tamura Jones 3 ADDR 4 ADR1 Modern Software Experience 4 WWW https://www.tamurajones.net 1 DEST GEDFAN 1 DATE 26 May 2013 2 TIME 15:27:01 1 SUBM @U@ 1 FILE FAN4.GED 1 COPR Copyright (c) 2011 Tamura Jones 1 GEDC 2 VERS 5.5.1 2 FORM LINEAGE-LINKED 1 CHAR ASCII 1 LANG English
Manifest-Version: 1.0 Name: relationships/F1-I2-I3 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F3-I7-I3 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F5-I11-I5 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F2-I4-I5 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F6-I12-I6 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F3-I6-I3 Content-Type: application/x-gedcom-conclusion-v1+xml Name: contributors/SUB1 Content-Type: application/x-gedcom-conclusion-v1+xml DC-creator: true Name: relationships/F4-I8-I4 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I5 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I4 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I7 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I6 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I1 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I3 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I2 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F7-I14-I15 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F1-I2-I1 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F7-I14-I7 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F2-I5-I2 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F7-I15-I7 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F2-I4-I2 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I10 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F3-I6-I7 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F4-I8-I9 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F6-I12-I13 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F5-I10-I11 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I9 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F5-I10-I5 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I8 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F1-I3-I1 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I15 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I14 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F6-I13-I6 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I13 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I12 Content-Type: application/x-gedcom-conclusion-v1+xml Name: relationships/F4-I9-I4 Content-Type: application/x-gedcom-conclusion-v1+xml Name: persons/I11 Content-Type: application/x-gedcom-conclusion-v1+xml
Manifest-Version: 1.0 Name: contributors/SUB1 DC-creator: true
The difference between the version 0.1 and 0.2 MANIFEST.MF
file is considerable.
Now that the MANIFEST.MF
file does not mention all the other files contained within the ZIP file anymore, there's hardly anything left.
It makes you wonder why all the files were mentioned in the first place.
It may make you wonder whether the one remaining mention, of the file contributors/SUB1
, is necessary,
and - if it isn't - why the MANIFEST.MF
file exists at all.
A GEDCOM file has at least one submitter.
The FAN4.GED
file has one submitter record, and the GEDCOM X converter creates the contributors/SUB1
file for it.
The GEDCOM X Converter 0.1 creates a file of 580, while version 0.2 creates a file of 282 bytes, that's less than half the size.
Now, less than half the size is great, but there is just one submitter record, and it was small to begin with, so it still doesn't matter much.
The difference between version 0.1 and 0.2 is the same as for the person and relationship records, as discussed below.
It are the many person and relationship records that make the real size difference between version 0.1 and 0.2.
The ZIP file contains a directory persons
.
That directory contains a single file for each individual.
These files are named using the same naming convention as commonly used for individual records in a GEDCOM file.
Record I2
in the GEDCOM file corresponds to file persons/I2
.
0 @I2@ INDI 1 NAME /Two/ 2 SURN Two 1 SEX M
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:person xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="I2"> <gxc:gender> <rdf:type rdf:resource="http://gedcomx.org/Male"/> </gxc:gender> <gxc:name> <gxc:preferred>true</gxc:preferred> <gxc:primaryForm> <gxc:fullText>Two</gxc:fullText> <gxc:part> <rdf:type rdf:resource="http://gedcomx.org/Surname"/> <gxc:text>Two</gxc:text> </gxc:part> </gxc:primaryForm> </gxc:name> </gxc:person>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <person xmlns="http://gedcomx.org/v1/" id="I2"> <gender type="http://gedcomx.org/Male"/> <name> <preferred>true</preferred> <nameForm> <fullText>Two</fullText> <part type="http://gedcomx.org/Surname" value="Two"/> </nameForm> </name> </person>
The version 0.2 file isn't the same as the version 0.1 file, but not wildly different either.
It is similar, but smaller.
Technically, it is not just similar, but the same, just without the superfluous inclusion of half a dozen namespaces.
It includes just one namespace now, http://gedcomx.org/v1/
, and all the rest is implied by that inclusion.
The version 0.2 persons/I2
file is smaller and easier to understand than the version 0.1 file,
but still seems rather large and complex when compared with GEDCOM.
This is partly because it uses XML syntax, which is somewhat verbose,
but also because it still contains a rather large amount of superfluous noise.
All 15 files in the person directory repeat the same header,
stating it is XML version 1.0, using UTF-8 encoding, and the http://gedcomx.org/v1/
namespace.
That is a waste of bytes, as all that is already implied by it being a GEDCOM X version 0.2 file,
and even if it were necessary to state it, it certainly isn't necessary to state in every separate file.
A genealogy consists of individuals and their relationships. The basic genealogical relationship is that of a child to its parents, but it is common to include partnerships as well. In the fan file, Two has three relationships. Two is the child of Four and Five, Two is a partner of Three, and together they have child One. As GEDCOM requires the partnership between Four and Five to be expressed to declare Two as their child, the relationship between Four and Five is included below.
0 @I1@ INDI 1 FAMC @F1@ .. 0 @I2@ INDI 1 FAMS @F1@ 1 FAMC @F2@ ... 0 @I3@ INDI 1 FAMS @F1@ ... 0 @I4@ INDI 1 FAMS @F2@ ... 0 @I5@ INDI 1 FAMS @F2@ .. 0 @F1@ FAM 1 HUSB @I2@ 1 WIFE @I3@ 1 CHIL @I1@ ... 0 @F2@ FAM 1 HUSB @I4@ 1 WIFE @I5@ 1 CHIL @I2@
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="F1-I2-I1"> <rdf:type rdf:resource="http://gedcomx.org/ParentChild"/> <gxc:person1 rdf:resource="persons/I2"/> <gxc:person2 rdf:resource="persons/I1"/> </gxc:relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="F1-I2-I3"> <rdf:type rdf:resource="http://gedcomx.org/Couple"/> <gxc:person1 rdf:resource="persons/I2"/> <gxc:person2 rdf:resource="persons/I3"/> </gxc:relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="F2-I4-I2"> <rdf:type rdf:resource="http://gedcomx.org/ParentChild"/> <gxc:person1 rdf:resource="persons/I4"/> <gxc:person2 rdf:resource="persons/I2"/> </gxc:relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="F2-I4-I5"> <rdf:type rdf:resource="http://gedcomx.org/Couple"/> <gxc:person1 rdf:resource="persons/I4"/> <gxc:person2 rdf:resource="persons/I5"/> </gxc:relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#" xmlns:ns4="http://purl.org/dc/terms/" xmlns:gx="http://gedcomx.org/" xmlns:gxc="http://gedcomx.org/conclusion/v1/" rdf:ID="F2-I5-I2"> <rdf:type rdf:resource="http://gedcomx.org/ParentChild"/> <gxc:person1 rdf:resource="persons/I5"/> <gxc:person2 rdf:resource="persons/I2"/> </gxc:relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <relationship xmlns="http://gedcomx.org/v1/" type="http://gedcomx.org/ParentChild" id="F1-I2-I1"> <person1 resource="persons/I2"/> <person2 resource="persons/I1"/> </relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <relationship xmlns="http://gedcomx.org/v1/" type="http://gedcomx.org/Couple" id="F1-I2-I3"> <person1 resource="persons/I2"/> <person2 resource="persons/I3"/> </relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <relationship xmlns="http://gedcomx.org/v1/" type="http://gedcomx.org/ParentChild" id="F2-I4-I2"> <person1 resource="persons/I4"/> <person2 resource="persons/I2"/> </relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <relationship xmlns="http://gedcomx.org/v1/" type="http://gedcomx.org/Couple" id="F2-I4-I5"> <person1 resource="persons/I4"/> <person2 resource="persons/I5"/> </relationship>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <relationship xmlns="http://gedcomx.org/v1/" type="http://gedcomx.org/ParentChild" id="F2-I5-I2"> <person1 resource="persons/I5"/> <person2 resource="persons/I2"/> </relationship>
Just as with the individuals, expressing relationships in GEDCOM X is more verbose than in GEDCOM, but the version 0.2 expression is less verbosely than the version 0.1 expression, because it no longer mentions half a dozen namespaces before describing the relationship itself.
What's particularly remarkable about both the version 0.1 and 0.2 file format is that the filename provides key relationship information already. However, it appears to be the intention that GEDCOM X readers essentially ignore the file name, and rely exclusively on the information contained within the file.
FamilySearch has changed the file format in response to the brief high-level analysis of the original file format
presented in GEDCOM X Converter.
The changes are an improvement, but at the same time,those improvements are so simple, so straightforward, so trivial really,
that you have to wonder why it took FamilySearch almost an entire year to make them.
They merely removed a lot of superfluous fluff that shouldn't have been there in the first place.
Removing the fluff noticeably decreased the file size, but there are no other improvements, and there is still plenty of superfluous fluff left.
The GEDCOM X programmers replaced half a dozen namespaces with a single namespace, and that resulted in a considerable size improvement.
However, the GEDCOM X file format still explodes a single GEDCOM file into a large collection of files,
and each of these files is still a full XML file, complete with XML declaration.
Each of these files repeats the information that it is XML version 1.0,
repeats the information that the character encoding is UTF-8,
repeats the namespace used…
Do not underestimate how much such superfluous repetition bloats the file.
A medium size GEDCOM file like FAN15.GED
, of 32.767 individuals and 16.383 couples, becomes a ZIP containing 81.918 files,
and of all those files, the MANIFEST.MF
file is the only file that does not repeat it.
I'm of more than two minds regarding the changes made to the file format.
On the one hand, it's good to see FamilySearch respond to public analysis by making changes, albeit slowly.
On the other hand, they merely removed superfluous fluff that should not have been in there in the first place.
There's still plenty of fluff left to remove, and there are no conceptual changes;
a GEDCOM X reader that ignored the superfluous stuff would not even notice that the format had changed.
On the gripping hand, less trivial changes are needed to address less obvious, deeper shortcomings.
The GEDCOM X and GEDCOM X Converter version number have been changed to 1.0.0 M1.
Copyright © Tamura Jones. All Rights reserved.