Modern Software Experience

2008-08-11

GEDCOM import tests

As part of reviewing genealogy programs, I do quite a few import test. After starting out with random selections from my GEDCOM collection, I settled on small GEDCOM file, less than 5.000 individuals which is close to 1 MB in size, for initial test, and on a large file, more than 100.000 individuals, as the real test of a program’s abilities.

speed

These tests have revealed a plethora of problems, including failure to import, and I often find myself complaining how slow program are, but a few programs stood out for importing quickly.
In fact, some programs get so close to raw hard disk reading speed, that the measurements are noticeably impacted by system load and the limited measurement accuracy.

accuracy

Generally, accuracy of 1 second is more than satisfactory, but in some cases, I found myself estimating whether the import speed was perhaps half a second more or less. For these exceptional programs, a GEDCOM file with 100.000 individuals is a walk in the park. To truly test these programs, a more substantial file is needed.

The Challenge

largest family tree

The largest know family tree is that of Confucius (孔夫子; Kǒng Fūzǐ). The Confucius Genealogy Compilation Committee has been working toward publication of the fifth edition of Confucius’ lineage in 2009. The number of living descendants is estimated at 3 million. The publication will include more than 2 million people.

Confucius Challenge

The Confucius Challenge is a practical challenge for genealogical software; the ability to handle the database of Confucius’s descendants. The software should be able to import the database, let you work with it, and export it again. All interactive operations such start-up, navigation, editing, saving, and searching should perform at acceptable speeds. It should not just work, but be workable too. The challenge is to demonstrate that the software is good enough to be used by the Confucius Genealogy Compilation Committee.

moving target

This practical challenge is a moving target. The Confucius database keeps growing. Then again, computers keep getting faster and memory keeps getting cheaper. This challenge may spur vendors to ensure their products keep getting better.

software challenge

This is not a hardware challenge. Common PCs are already good enough. A 2 MHz 32-bit Intel PC with 2 GB of RAM is up to this - and these were for sale in the previous millennium already.

2GB of RAM is plenty of space for a database with 2 million individual records; That’s a kilobyte per record. Assuming a swap file of 2 GB, there will actually be 4 GB to work with, but once the program starts swapping, it become slow. It is best to think of the swap file as a final resort that allows us to briefly exceed the available RAM instead of failing with an out of memory error. The swap file allows brief peaks of 2,1 or even 3 GB.

In practice, the program will not have 2 GB of RAM available, as the operating system, firewall, anti-virus software, drivers, etcetera already take up about half a gigabyte. Thus, the program will have about 1½ GB of RAM to work in.
Practically, that probably translates to about 1 GB of RAM for data, and half a GB for the program itself and overhead such as the indexes on that data. That still leaves about half a kilobyte per individual record. That could be 32 bytes towards the necessary family records, and 480 bytes for the individual record itself (including references to its family records). That’s plenty of room for the vital data.

This calculation breaks down if you were to to load many pictures or complete biographies, so the program should keep those on disk until requested by some user action. That said, this calculation is pessimistic, as it not necessary to keep all vital data in memory.
Most database systems keep data and indexes on disk until they are needed. Such a system might load all the indexes in about half a gigabyte of RAM, and then still have one GB free to operate in. That’s a very practical approach, as the program will need some memory for the user interface, to draw diagrams and create reports and such. Besides, you want a bit of room to run a few other programs. Still, the hardware is good enough.

This is a software challenge, a challenge to produce mean and lean software with modest requirements, that makes efficient use of CPU cycles and frugal use of RAM.
This challenge is of interest to all of us. A program that’s efficient enough to be useable by the Confucius Genealogy Compilation Committee should be an interactive joy for everybody else.

links