Modern Software Experience

2011-07-31

GED-inline Logo

A Brand New GEDCOM Validator

GEDCOM Validators

A few months ago, the article GEDCOM Validation discussed validation of GEDCOM files; I used the available validators and some other tools to validate the GEDCOM files generated by GedFan.

Validating the GEDCOM files was a less than ideal experience.
FamilySearch's GedChk is a 16-bit MS-DOS application. It runs in a DOS Box on 32-bit Windows, but you need an emulator to run it on 64-bit Windows.
VGED is a 32-bit Windows application. It runs just fine on 64-bit Windows, but version 3.02 is the final version.

I remarked that the situation is certainly less than perfect, as neither GedCheck nor VGED validator is up-to-date with the de facto GEDCOM standard, the already more than a decade old GEDCOM 5.5.1, and neither one is being maintained anymore.

There finally is a GEDCOM 5.5.1 validator.

That remark inspired Nigel Munro Parker to create a new GEDCOM validator. Version 1.00 of his GEDCOM validator did not have a name, version 1.01 is called GED-inline. The name is a play on words; genealogy applications vendors should get in line.
Unlike previous GEDCOM validators, GED-inline supports both GEDCOM 5.5 and GEDCOM 5.5.1. There finally is a GEDCOM 5.5.1 validator.

web service

Nigel Munro Parker's new GEDCOM Validator is a web service on the Amazon EC2 cloud service. It is easy to use; you browse for a GEDCOM file, upload it to the site, and then receive a validation report on the same page.

GED-inline 1.01

GED-inline version 1.01.

limitation

I enthusiastically started by trying to validate FAN16.GED. It took almost a minute to upload the file, after which I immediately received the validation report: Upload must be less than 10M.
The FAN16.GED file is 12.416.622 bytes. A message that submissions are limited to files up to 10 MB is just above the Choose File and Submit buttons. If I had not hurried to upload a file, I would have noticed that.
This GEDCOM validator is limited to small files.

GED-inline: FAN15.GED

Version 1.0 validation report for FAN15.GED; the URL shown in address bar is no longer valid.

Validation of FAN15.GED went smoothly. Upload of FAN15.GED (6.073.398 bytes) took 62 seconds, and about 15 seconds later the site showed its validation report. The validator claimed to have found two errors:

*** Line 8:       Invalid content for DEST tag: 'GEDFAN' is not a valid <RECEIVING_SYSTEM_NAME>
*** Line 26:      Invalid content for SEX tag: 'U' is not a valid <SEX_VALUE>

GEDCOM destination

I disagreed with both error messages.

It takes careful a reading of the GEDCOM specification to understand why the first error message may seem right, but is wrong. The article GEDCOM SOUR and DEST explains it in detail. Nigel released an updated version of GED-inline only a few hours after I published that article.
If you've read GEDCOM SOUR and DEST, you know that a GEDCOM validator could use HEAD.DEST or HEAD.SOUR value to check for compliance with a specific GEDCOM dialect. GED-inline 1.0 does not include any special support for any GEDCOM dialect, it simply supports the GEDCOM specification.

The second error message is wrong too; U is a perfectly legal <SEX_VALUE>, it signifies that the gender is unknown. This issue has already been fixed; GED-inline version 1.02 no longer flags gender Ua as an error.

beta test

Nigel developed the validator in just a few weeks, and performed limited testing. I discovered a few issues as I tried the service, you may discover a few others; The current version number is 1.0, but you should think of it as a beta release; upload your GEDCOM to try it out, and report anything that seems wrong.

The link in the bottom right corner of the GED-inline page says Feedback welcome. The speed with which GED-inline in response to my feedback is impressive; practically every issue I reported was fixed within day.

messages

I very much agree with the format of the error messages. The error messages provide a line number, are clear and use the same terminology as the specification itself. These are error messages a developer can use to quickly look up the relevant sections of the specification.

fan value

The validator imports FAN15.GED just fine. GED-inline supports GEDCOM 5.5.1; so it does not complain about the WWW tag. It does not export to GEDCOM, but generates its report just fine. However, the validator accepts small files, only and refuses to import FAN16.GED. Therefore, it's fan value is 15. That is low, but that's only because of the upload limit.

GEDCOM versions

GED-inline 1.0 supports GEDCOM 5.5 and GEDCOM 5.5.1. Once it recognises either, it automatically adjusts what error messages it will produce. GED-inline does not support any earlier version of GEDCOM.
When I tried to validate a GEDCOM 5.3 file in GED-inline 10, the entire validation report consisted of just the sentence File not recognised as valid GEDCOM file. That message is correct, it does not state the file isn't a GEDCOM file, it only states that GED-inline did not recognise it as such.

GED-inline 1.02 issues the message No validation support for GEDCOM version 5.3, assuming rules for version 5.5.1 and then continues as if you submitted a GEDCOM 5.5.1 file. That is nice, but I don't think it should do that, as it is likely to lead to misunderstandings. For example, when a GEDCOM 5.3 file is validated using the the GEDCOM 5.5 grammar, any use of the tag SCHEMA will provoke an error message, while it is a legal GEDCOM 5.3 tag. I think it would be best for GED-inline to either add real support GEDCOM 5.3, or bluntly reject GEDCOM 5.3 as unsupported with some message like File appears to be GEDCOM 5.3. GED-inline 1.01 supports GEDCOM 5.5. and 5.5.1 only instead. GED-inline is validator, validators are allowed to be blunt.

character encodings

AnselInputStreamReader

The GED-inline page acknowledges the AnselInputStreamReader by Michael Kay.
It is a Java class for reading ANSEL streams which Michael Kay originally created for his GEDCOM to GedML converter.
The source for that converter is available from his site.

GED-inline supports all the legal character encodings; ASCII, ANSEL, UNICODE (UTF-16) and UTF-8. It supports UTF-8 files with and without a Byte Order Mark (BOM), and supports both little-endian and big-endian UTF-16.
GED-inline supports these legal character encodings in the GEDCOM versions in which they are legal. If you try to validate an UTF-8 GEDCOM 5.5 file, GED-inline correctly reports that 'UTF-8' is a not valid <CHARACTER_SET>. It just tells you that it is wrong, it does not tell you is that CHAR UTF-8 is only illegal in GEDCOM 5.5, but is legal in GEDCOM 5.5.1. It does not have to tell you that; it is a GEDCOM validator, not a GEDCOM tutor.

If you try to validate a so-called ANSI GEDCOM, GED-inline reports that 'ANSI' is a not valid <CHARACTER_SET>. It would be perfectly fine for GED-inline to report just that error and then abort the validation, but it continues validating the rest the of the file.
GED-inline does not merely check whether the specified character encoding is legal, but actually takes the specified character encoding seriously; when I tried to validate an ASCII GEDCOM file containing non-ASCII characters, GED-inline complained that the Line contains illegal character(s).

GEDCOM dialects and alternatives

FTW GEDCOM

GED-inline 1.0 does not include explicit support for any particular GEDCOM dialect, but that does not stop you from validating GEDCOM dialects. There are legal and illegal GEDCOM extensions, and GED-inline only complains about illegal extensions. If you try to validate an FTW GEDCOM file, GED-inline will process the file just fine - which includes issuing errors for any illegal extensions it encounters. For both Family Tree Maker Classic and New Family Tree Maker GEDCOM files, GED-inline will rightly complain that the version number is in error because it is more than 15 characters long, a Family Tree Maker defect I've mentioned in reviews for years, but still hasn't been fixed. For the FTW GEDCOM file I tried to validate, GED-inline also complained that Invalid content for SUBM tag: 'Unknown' is not a valid <@<XREF:SUBM>@>.
Ancestry.com could certainly benefit from using GED-inline.

FTW TEXT

GED-inline does not support any other language than GEDCOM. It does not support any GEDCOM alternative. Some genealogy applications support FTW TEXT in addition to GEDCOM, GED-inline does not.
If you try to validate an FTW TEXT file, GED-inline 1.0 will simply report File not recognised as valid GEDCOM file. That is technically correct, and all GED-inline has to do, but it would be nice if GED-inline were to report a bit more. It would be nice if GED-inline used the detection technique documented in GEDCOM Magic to recognise FTW TEXT files, and inform the user that it is an FTW TEXT file. The biggest issue with FTW TEXT is that, because the user has been lied to by Family Tree Maker, the user is likely to believe they have a GEDCOM file, and unlikely to understand all the problems they experience with that ostensible GEDCOM file until they are informed that it isn't a GEDCOM file but an FTW TEXT file.

stand-alone

The one thing that disappoints me about GED-inline is that it is a web service; you have to upload your data to use it. GED-inline is written in Java, so a stand-alone edition would work on many operating systems, but there are no plans to create a stand-alone edition yet. That is a pity, as a desktop edition would have several significant advantages over the web service. When you do not have to upload your data to a third party, there is no worrying about who sees your data or what happens to it. A stand-alone version does not introduce an upload delay, and does not need to artificially limit the file size it handles either. A standalone version would certainly have a larger fan value, and probably be able to handle most real-life GEDCOM files. More importantly, vendors could integrate a standalone version into their test scripts.

updates

2011-08-05 VGed 3.03

Tim Forsythe has updated VGed with support for GEDCOM 5.5.1. There are two GEDCOM 5.5.1 validators now.

2011-08-06 GED-inline 1.03

Nigel Parker has introduced GED-inline version 1.03. The upload limit has been removed, it identifies FTW TEXT files as such, and it has a version history now.

2012-01-15 VGedX 1.00

Tim Forsythe has introduced VGedX, a command-line GEDCOM validator. He also introduced the VGedX demo site, which is functionally identical to the GED-inline site.

links