Modern Software Experience

2008-08-09

huge GEDCOM files

I recently came across two huge GEDCOM files, which I will call the ITIS GEDCOM and the LIFE GEDCOM. I do not use the word huge lightly. The LIFE GEDCOM is the largest GEDCOM I’ve seen so far. Have a look at the numbers below.

the files

fileITIS GEDCOMLIFE GEDCOM
exact size95.711.087 595.039.769
individual records472.676 2.080.787
family records65.799 225.673
GEDCOM version5.55.5
character encodingASCII ASCII
programPAF 5.2.18.0 (2002) PAF 5.2.18.0 (2002)
bytes per INDI202,49285,97

Those numbers are not a mistake. The ITIS GEDCOM has less than half a million INDI records, and it is less than 100 MB large. The LIFE GEDCOM has more than 2 million INDI records and the file is more than half a gigabyte in size.

uncommon GEDCOM

These are not common GEDCOM files. All the records are male. There are no females, all marriage records list a father and an unknown mother. All children are male. There are no sources. There are no repositories. There are no birth or death events. The marriage events do not have a date or place. It is just a lot of records in a hierarchical relation. The records have only be marked male because PAF does not let you record children until you assign a gender.

ITIS GEDCOM

These are not family trees. The ITIS GEDCOM is the ITIS species database in GEDCOM format. ITIS used to be an acronym for Interagency Taxonomic Information System, and is now an acronym for Integrated Taxonomic Information System. ITIS is a partnership of U.S. federal, Canadian and Mexican agencies. The ITIS database is a database of species names and their hierarchical classification.

LIFE GEDCOM

The LIFE GEDCOM is the Catalogue of Life in GEDCOM format. The Catalogue of Life is a database produced by the Species2000 project. This is an ongoing project to document all 1¾ million known species by 2011. The database has about 1,1 million species right now, which worked out to about 2 million records in the GEDCOM database.

GEDCOM files

The two projects do not offer their databases in GEDCOM format. Paul Pruitt, a biologist, had these two databases converted to GEDCOM format, and now offers them for download in the files section of the Google Group "Famous Family Trees". There are almost a hundred GEDCOM files there, for real families, literary families, mythological families, and corporations.

test files

I am not sure why Mr Pruitt wanted these files converted to GEDCOM, but I do know what I am going to use them for. The ITIS and LIFE GEDCOM makes great test files to test the import capabilities of best genealogy software with. To process the large number of records in a reasonable time, the code has to be efficient, and to import files this big, the import has to frugal with memory. To import these files, the GEDCOM import must make efficient use of both CPU cycles and memory.

I’ve been thinking about The Confucius Challenge for some time now and these two files are just what I needed to create a Confucius Cascade and torture test the best genealogy software in the Confucius Cup 2008.

updates

2008-09-09: GenealogyOfLife

Most of Pruit’s GEDCOMs can be viewed as online trees at GodsKingsAndHeroes, and GenealogyOfLife is a site where you can browse the Catalogue of Life as a TNG database.

2011-04-23: Google Group Famous Family Trees

The Google Group Famous Family Trees has not seen any activity in years. It is still around, but you may experience several time-outs before Google shows it to you.

links