Modern Software Experience

2009-08-07

name parts

A name consists of several parts, such as the given name and the surname. The GEDCOM specification recognises several parts and specifies how to encode these for correct transfer between different genealogy applications.

GEDCOM

The GEDCOM specification states that the full name should be followed by several tags that list its constituent parts. In other words, a GEDCOM file should tell you exactly how the full name is split into parts, which GEDCOM calls name pieces. However, not all genealogy applications bother to specify those name pieces in their GEDCOM export.

optional tags

The specification makes it clear that applications do not have to provide additional tags. It says that The NPFX, GIVN, NICK, SPFX, SURN, and NSFX tags are provided optionally for systems that cannot operate effectively with less structured information; not even the tags for the given name (GIVN) and surname (SURN) are mandatory.

The specification not only states that these tags are optional instead of mandatory, it also suggests that these tags are not important at all, by stating that they are only needed for for systems that cannot operate effectively with less structured information. Apparently, FamilySearch thinks that these tags are superfluous, that most systems, or at least their own systems, do not need these tags at all. It even seems to imply that system that do use these tags are inferior, and that is so wrong.

distinctive

The clearly prejudiced way of describing systems that do fully support name structure is one of the mistakes in the GEDCOM specification that seems to result from FamilySearch’s preconception that if they do not care about something, nobody else needs it.

That they go so far as to describe systems that do the right thing and do support these tags as systems that cannot operate effectively with less structured information is reprehensible.

Systems that do support name structure and do use these tags are superior to systems that do not support name structure and do not use these tags.
The right thing to do is distinguish between systems that do fully support the GEDCOM name structure and systems that provide limited support for that name structure.

Aldfaer

Aldfaer is a popular Dutch genealogy application for Windows. Its dialog box for entering individual has separate fields for the given name, surname and surname prefix, but its GEDCOM export does not include the GIVN, SURN or SPFX tags.

prefix recognition

Upon import, Aldfaer recognises the most common surname prefixes, but that does not guarantee that it will import its own GEDCOM files without making a mistake.

First of all, it only recognises the most common surname prefixes, not all of them. A slight spelling variation in an otherwise common surname prefix is enough for it to not be recognised as the prefix it is.

Secondly, even recognising all prefixes is simply is not good enough. A surname may contain (instead of start with) what such a recogniser believes to be a surname prefix.

not reliable

Thus, because Aldfaer GEDCOM provides only limited support for name structures, export from and import into Aldfaer can not be relied upon to recreate the same database! It may have been changed, and it will have been changed without any warning or error.

splitting names

Correctly splitting a name is considerably more complex than recognising a few surname prefixes. The example in RootsMagic Alternate Names contains a space. There are much more complex cases, such as double names. On top of that, users may erroneously add titles in a name field, or use it to record an alias when the application has no field for it. Surname prefixes, names with spaces, double names, and user errors make splitting a name complex already, and I have not even mentioned patronyms, nicknames, call names or alternative spellings yet.

Splitting names into their constituent part is not a trivial issue that can be solved by just a few lines of code and a small table of surname prefixes. It actually needs an extensive expert system backed by an extensive database of double names and with knowledge of common abuses of the name field.

surname prefix recognition algorithm

The simplistic surname prefix recognition algorithm that several genealogy application use is completely useless. It can be used to guess the split and present that guess to the user for approval.
However, that algorithm should not be used to split the name and silently assume that the split was done right. The application should involve the user.

introducing mistakes

A basic import rule is that import should not introduce errors.
An application that relies on nothing more than its surname prefix recognition algorithm to split names is sure to introduce errors, and that is unacceptable.

The Aldfaer GEDCOM export that fails to include full name structure information does not introduce any error. It provides incomplete information. On export, name information is lost. The GEDCOM that Aldfaer produces is of inferior quality, but it is not wrong.

dealing with limited structure

There are several things an application that does support name structure can do when presented with a GEDCOM that does not provide that structure.

Several applications silently auto-split names into parts using nothing more than some simplistic algorithm. However, because that is practically sure to introduce errors, that approach is wrong.

complain about low quality files

The best thing an application can do is to complain about the inferior quality of the file it is asked to import. The application can suggest to the user that they may they want to choose different export options, upgrade to a better version of the application that created it, or simply complain in their turn to the vendor of that application until the inferior export has been fixed.
This avoids importing inferior quality files by encouraging users to provide a better files and creating awareness about the issue.

import log

The import log should tell how each name has been split, and provide enough information, typically a record ID, to make it easy to find each record in the original database, to make it easy for the user to solve the problem at the source, if possible.

configuration file

The user may not be able to provide a better file, it which case it should be imported as good as possible. Ideally, the application should provide an interactive interface in which the user can approve or correct suggested splits, which are then written to an editable configuration file that will be used upon the next import. That way, the application will not bother the user with the same question again and again, and a few tries should be enough to accomplish an import that does not need ask about names again.

simple file format

It is important to keep such a configuration file simple and editable. That may even make is usable with more than one application.

The simplest approach to such a file would be just one name per line, with surname slashes marking the surname (as is done in GEDCOM).

A simple list of names should allow the importing applications to generally make the right decision upon import, so that the user need not fix all the errors a simplistic guessing algorithm would otherwise make after each import. Any name that still does not import correctly only highlights that the originating should provide not just the name, but the name structure as well.

The real beauty of this idea is that you should be able to create such a configuration file by editing the list of decision the importing applications writes to its import log, especially if that part of the import log was written with that possibility in mind.

name structure support

Meanwhile, informing the user about low quality GEDCOM files does not just create awareness about the issue, but also creates awareness about which programs fully support GEDCOM name structure and which ones do not.

That awareness will see users avoid the applications with limited name structure avoid and favour applications with full name structure support, and there is nothing like user demand to get vendors to upgrade their limited name structure support to full name structure support.

links