Modern Software Experience

2008-05-22

GEDCOM Import Speed

I’ve made an overview of the GEDCOM Import speed numbers I reported in reviews, but some major programs are still missing from the list.

PAF 5.2.18.0

old

PAF 5.0 was released in 1999, PAF 5.2 in 2002. It has not been updated for years. This is the program I created the 100k INDI GEDCOM with, but just how easy will PAF import it again? Well, I already know it is a few minutes, but I timed the import for inclusion in the table.

dialog box

PAF puts up a options dialog immediately after starting selecting the file, but does not bother the user with any pop-ups during the import. Import time is measured from the moment I click Import on the file selection box, so the extra dialog box adds about a second to the total time.

progress reporting

The GEDCOM Import dialog box shows the progress, visually with a progress bar and numerically as a percentage. It also shows the number of Individuals, Marriages, Sources and Repositories in the file. It imports these in that order, and shows how many it has done. Those number increase at a satisfying pace. The first pass shows PAF importing more than thousand individuals per second. The final step is Linking Records, and it is the only step for which PAF does not display numbers, but just a percentage.
The dialog box also shows the number of errors encountered. The number of errors is zero, as it should be for a file the program created itself.

memory usage

During import of the 100k INDI GEDCOM, the Task Manager shows an increase in memory from 15 MB to 16 MB. I guess actual usage is a more than a single megabyte, but I do know that PAF has little trouble importing larger files. The import log list some GEDCOM statistics, but does not list the import time.

fine speed

The import speed for the 100k INDI GEDCOM is close to 4500 individuals per seconds, and more than 175 KB per second. Those are fine numbers that developers of many other programs should do well to aspire to.

1 MB GEDCOM

The numbers for the 1 MB GEDCOM are dramatically worse. Sure, with an import time of a just few seconds, that extra second caused by the dialog box really impacts the result. However, the real problem is that PAF reported 6268 errors.

BASGEN used TYPE ORDP and TYPE ORDM for REFN tags, and PAF complains about each instance. The resulting import listing file is 1.386.622 bytes, and that is 31,32 % larger than the GEDCOM file it reports on.

Reporting all those errors to the log file slows the import down. The resulting import time still puts many newer applications to shame, but it could have been better.

complaining TYPE

PAF’s complaint is unwarranted, as the GEDCOM specification clearly states that REFN.TYPE is user-defined, which means as much as anything goes. The only restriction posed by the specification is a maximum of 40 characters. More specifically, the specification states that is a user-defined definition of the USER_REFERENCE_NUMBER, and the specification makes it very clear that USER_REFERENCE_NUMBER need not be a number, but may be text. Besides, PAF did not complain about all the REFN.TYPE tags. BASGEN used TYPE ORCM tags for REFN tags in FAM records, and these do not trigger an error. Somehow, PAF is okay with REFN.TYPE tags in FAM records, but does not like REFN.TYPE tags in INDI records. That is wrong, as the GEDCOM specification allows REFN tags within INDI tags. The error PAF generates is wrong too: The TYPE tag is within the REFN tag and the REFN tag is within the INDI record, but PAF complains that it encountered an "Unexpected tag 'TYPE' in Individual Record".

PAF 5.2.18.0

file1 MB GEDCOM100k INDI GEDCOM
time11s3m44s
time in seconds11 224
INDI per second442,00446,72
bytes per second95.990,45 177.675,86

Legacy Family Tree 6

latest version

I tested with the latest version, which is is Legacy Family Tree 6.0.0.190. Legacy puts up a dialog immediately after choosing the GEDCOM file. You have to wait for Legacy to "analyse" the file before you can click the "Start Import" button on that dialog, but I am not fooled. That so-called analysis is part of the import process and probably already loads the entire file into memory, either by design or as a side-effect of the design, which allows pretty speedy processing from there on. Import time was measured from clicking the Okay button on the file File Open dialog.

You do not need to lose time because of the dialog. Until it is done analysing and the "Start Import" button appears (which should really be labelled continue Import), the dialog displays a "AutoStart Import" checkbox. Just check that box to avoid any delays.

The import time for the 1 MB GEDCOM is 1m30s. Once the import is done, Legacy offers to show the import log. This shows Legacy complaining about the REFN tags in the file. Funnily, while PAF complained about the REFN.TYPE in INDI records, and was okay with REFN.TYPE in FAM records, Legacy is okay with the REFN tags in the INDI records, but complains about the REFN.TYPE tags in the FAM records.

Legacy does not respect your database, but bluntly adds notes for the tags it did not recognise. Legacy adds "Reference:0" as a note to all family records, without asking permission and without any way to remove them.

An issue with Legacy’s import is that it may pops up several message boxes during import. All you can do when they appear is click OK. The one options that would perhaps make some sense, to cancel the import entirely because Legacy cannot handle the file, is not there. Legacy pauses to put up that MessageBox and wait for the user to click OK. Clicking OK is all you can do, and once you do so the import continues. Thus, the effect of the messagebox is nothing but a delay. This is wrong. Messages can be listed in the import log and truly important ones can be shown in the import progress dialog box or an import complete dialog box.

Legacy currently shows an import completed messagebox that displays the GEDCOM statistics, but not the import time. The import time for the 100k INDI GEDCOM is 49m35s. That is just a tad more than 33 individuals per second. A mediocre result.

Legacy produces an import log file. It contains a few warnings about invalid dates, but does not include the messages it popped up during import, does not include the GEDCOM statistics and does not list the import time either. Legacy’s import log does not list line numbers, and it does not list the erroneous lines themselves either. Instead of giving a line number, Legacy tells you which main record the issue occurred in (e.g. "Main Record:0 @F7522@ FAM") and then list the error message followed by the problematic part of the line.

One problem worth noting is that the resulting Legacy database is ridiculously large, 515.493.888 bytes, that is close to half a gigabyte.

Legacy Family Tree 6.0.0.190

file1 MB GEDCOM100k INDI GEDCOM
time1m30s49m35s
time in seconds90 2975
INDI per second54,0233,64
bytes per second11.732,17 13.041,81

RootsMagic 3.2.5

RootsMagic puts up a dialog immediately after choosing the file to import, to ask whether a source should be added for all records from this file. When you quickly click OK, the import time for the 1 MB GEDCOM is 16 seconds.

Import of the 100k INDI GEDCOM takes 5m21s. RootsMagic’s import time seems fairly constant at just a bit more than 300 INDI per second.
RootsMagic does not offer to show it upon completion of the import process, but it does create an import log. The quality of the import log is low. Unlike many other programs, that merely report line numbers, RootsMagic reports the content of the line as well. That sounds good, but although there were a small variety of errors, the only error message that RootsMagic reports is "UNKNOWN INFO".
That is not a big issue when RootsMagic reports "UNKNOWN INFO (LINE 770235)" for a line with the _MARNM tag; we immediately understand that RootsMagic does not support that PAF extension. The problem is that RootsMagic actually reports "UNKNOWN INFO (LINE 152)" for a line such as "4 DATE 22 Sep 1998". That is a perfectly fine date, and you are left to figure out why RootsMagic thinks there is somehow something wrong with including that date there.

2008-05-22 RootsMagic 3.2.5

file1 MB GEDCOM100k INDI GEDCOM
time16s5m21s
time in seconds16 321
INDI per second303,88 311,74
bytes per second65.993,44 120.870,38

links