I recently came across a surprising question on the WikiTree Genealogist-to-Genealogist (G2G) forum.
The question asked by Dirk Laurie is titled Is it OK that GEDCOM export splits UTF-8 characters?
.
The body of the question explains the problem accurately and succinctly:
I was shocked to be told that the GEDCOM file is not valid UTF-8.
Reason: a multibyte UTF-8 character appearing in a biography was split
with the first half at the end of a line and the second half at the start of the following CONC line.
.
It is easiest to discuss the issue with a tailored example. I entered the following text as a note on WikiTree:
I then asked WikiTree to create a GEDCOM file. WikiTree creates a file and provides a download link. If you choose to right-click that link and view the file in a new browser tab, those two lines of text now look like this:
1 NOTE This text demonstrates a WikiTree GEDCOM export problem. Names like H� 2 CONC �lène contain accented characters. 2 CONC In UTF-8, the é and è are encoded in two bytes. The UTF-8 encoding o 2 CONC f é is C3 A9. The UTF-8 encoding of è is C3 A8.
You immediately notice that something is wrong with the name Hélène;
instead of the single character é (Latin Small Letter E with Acute),
the browser displays �, the Unicode Replacement Character twice.
Most browsers will display the Unicode Replacement Character when they do not recognise byte codes as any actual character.
There is something wrong with WikiTree's GEDCOM file.
This is what the WikiTree GEDCOM file looks like.
0 HEAD 1 SOUR WikiTree.com 2 NAME WikiTree: The Free Family Tree 2 CORP Interesting.com, Inc. 1 DATE 13 Feb 2017 2 TIME 10:41:18 EST1 CHAR UTF-8 1 FILE 22141-b31fd5.ged 1 COPR Interesting.com, Inc. and Tamura Jones 1 SUBM @SUBM@ 1 GEDC 2 VERS 5.5.1 2 FORM LINEAGE-LINKED .... 1 NOTE This text demonstrates a WikiTree GEDCOM export problem. Names like HC32 CONC A9lène contain accented characters.2 CONC In UTF-8, the é and è are encoded in two bytes. The UTF-8 encoding o 2 CONC f é is C3 A9. The UTF-8 encoding of è is C3 A8.
In the WikiTree GEDCOM export, the first byte (C3) is at the end of one line value, and the second byte (A9) is at the start of the next line value.
In this example, the Latin Small Letter e with Acute inside Hélène has been split over two lines.
WikiTree GEDCOM files always use Unicode, specifically the UTF-8 encoding.
In Unicode, Latin Small Letter E with Acute is code point U+00E9, and in UTF-8 that code point is encoded as the two-byte sequence C3 A9.
In the WikiTree GEDCOM export, the first byte (C3) is at the end of one line value, and the second byte (A9) is at the start of the next line value.
This is very wrong, but let's start with what WikiTree does right.
The WikiTree header is mostly okay.
WikiTree uses GEDCOM 5.5.1 and UTF-8, as a web tree should.
The only outright error in the WikiTree GEDCOM header is the use of a time zone; GEDCOM does not support time zones.
A shortcoming - but merely a deviation from best practice, not an error - is the absence of the UTF-8 Byte Order Mark (BOM).
WikiTree uses the Line Feed character combination as the End of Line (EOL) marker, which is a legal choice, but not the right one;
a web tree should use the Carriage Return / Line Feed (CR/LF) combination, to make sure the GEDCOM text files are compatible with any desktop platform.
The WikiTree GEDCOM lists WikiTree.com as the source (HEAD.SOUR
), but does not list a WikiTree version number (HEAD.SOUR.VERS
).
The Byte Order Mark isn't the only thing missing, there is another noteworthy omission.
The WikiTree GEDCOM lists WikiTree.com as the source (HEAD.SOUR
), but does not list a WikiTree version number (HEAD.SOUR.VERS
).
That version number isn't mandatory, but it should be.
Third parties need that version number to recognise version-specific issues.
The current Wiki GEDCOM header is an improvement over the one they used back in 2012 (see WikiTree GEDCOM).
Back then, the WikiTree GEDCOM header lied that it used GEDCOM 5.5, now it correctly states that it uses GEDCOM 5.5.1.
Back then, the submitter was listed incorrectly, now it uses the SUBM
record as it should.
Several more issues noted back then have been solved.
The same-day instant update issue, illegal trailing spaces on GEDCOM lines, has not been fixed in the intervening years.
The WikiTree GEDCOM contains some serious errors.
WikiTree use the EMAIL
and WWW
tags introduced in GEDCOM 5.5.1, but does not use them right.
GEDCOM 5.5.1 introduced the EMAIL
and WWW
tags as part of the ADDRESS_STRUCTURE
and WikiTree uses the nonexistent INDI.EMAIL
and INDI.WWW
tags.
The way WikiTree uses these tags is illegal, and very unlikely to transfer to any other application.
WikiTree sometimes uses theCONC
tag when it it should be using theCONT
tag.
WikiTree breaks long lines up into multiple lines using the CONC
tags.
WikiTree sometimes uses the CONC
tag when it it should be using the CONT
tag.
Within the sample text, the sentence In UTF-8...
starts on a new line, WikiTree should have used a CONT
tag, but still uses a CONC
tag.
WikiTree does not always uses CONC
, when it should be using a CONT
tag;
if you have a few empty lines in your note, WikiTree will use the CONT
tag. as it should
WikiTree adds default notes to each individual, so there are plenty of NOTE
records in WikiTree GEDCOM files,
many of which uses both CONC
and CONT
.
Several of these lines have trailing spaces, which is illegal.
It is a illegal for a reason: because of this error, the note is not likely to be read as WikiTree intended when imported into another application.
An UTF-8 GEDCOM lines may be at most 255 bytes long, and that is 255 bytes including the End of Line (EOL).
A GEDCOM line may be at most 255 characters code units long. The UTF-16 code unit is a word (two bytes), the UTF-8 code unit is as byte. An UTF-8 GEDCOM lines may be at most 255 bytes long, and that is 255 bytes including the End of Line (EOL).
WikiTree breaks a GEDCOM line after 78 bytes.
but that choice not illegal; GEDCOM allows breaking most GEDCOM records into multiple lines practically anywhere.
An experienced programmer recognises 78 as 80 minus 2; 80 columns minus two for the mandatory CR/LF;
if WikiTree used CR/LF as the End of Line, the maximum length of a WikiTree GEDCOM line would be 80 bytes.
WikiTree currently does not use CR/LF but LF as the End of Line, so the maximum length of a WikiTree GEDCOM line is actually 79 bytes instead of 80 bytes.
There is a reason why some programmers will break GEDCOM lines at 80 code units instead of 255; it is the 80-column mindset.
Back in the previous millennium, when graphical displays were still rare, a regular text monitor would display 25 lines of 80 characters.
There was a time that the DEC VT-100 monitor, with its 132-column mode, and graphical characters was sheer luxury.
Longer lines would either wrap around, be truncated, or require horizontal scrolling, all of which is inconvenient,
so many programmers have been taught text should always fit the 80-column monitor.
The 80-column mindset is especially strong in programmers whose teachers have an IBM mainframe background.
Those two problems with WikiTree GEDCOM lines - some lines have a trailing space and it sometimes splits a characters in two - have the same root cause.
Those two problems with WikiTree GEDCOM lines - some lines have a trailing space and it sometimes splits a characters in two - have the same root cause.
WikiTree's line breaking algorithm is so simple it is hardly deserves to be called an algorithm.
This is how WikiTree breaks a line of text: it takes your text, and converts it to UTF-8,
and then starts writing GEDCOM lines, using CONC
to always breaking long lines at the 78th byte.
That is wrong.
It is because of this overly simplistic approach that some lines have trailing spaces, and some characters get split in two.
The WikiTree GEDCOM header promises an UTF-8 encoded file, and has to deliver on that promise.
The GEDCOM specification defines a GEDCOM line as consisting of a number of characters;
so that is what GEDCOM readers expect and the WikiTree GEDCOM writer has to provide.
Every line must be valid UTF-8.
Each line must contain a whole number of characters.
When the UTF-8 encoding of a character consists of a sequence of bytes, those bytes must remain together; it is the particular sequence that encodes the desired character.
WikiTree splits the Latin Small Letter E with Acute into two separate bytes, with values C3 and A9, and a bunch of characters in between.
It is an interesting property of UTF-8 that the bytes value used for byte sequences may not occur on their own, they must be combined with another value to encode a characters.
This property makes it easy to detect invalid codes and then respond to them in some sensible manner, for example by displaying the Unicode Replacement Character.
Every line must be valid UTF-8. Each line must contain a whole number of characters.
Text may contain spaces, but GEDCOM lines may not have trailing spaces.
Lines must be always be broken before a space, never after a space.
If a line is broken after a space, you get an invalid GEDCOM, and the space is likely to be lost when the GEDCOM is read into another program.
If a line is broken just before a space, the CONC
line value starts with a space. This is legal, and should import fine.
Either situation can be avoided by always breaking lines inside a word.
The only real solution for these WikiTree GEDCOM problems is for WikiTree to fix their GEDCOM line breaking algorithm. Meanwhile, as a workaround, you can fix WikiTree GEDCOM files by manually correcting the WikiTree errors. You can moving trailing space to the beginning of the next line, and unsplit any split character.
You can unsplit the character by removing everything in between the two bytes:
the End of Line, the CONC
tag and the space between that tag and its line value.
The result will be a line more than 78 bytes long, but that is okay, as it is still way less than 255 bytes.
In the unlikely case that you have to unsplit several characters, and the resulting line is longer than 255 bytes, you can split that line yourself.
Because it contains invalid codes, the ostensible WikiTree GEDCOM file is not even a valid text file, so a text editor may get confused when you try to read the file. A text editor like Windows NotePad tries to guess the character encoding by analysing the text; and when it discovers that the text isn't valid UTF-8, it really is not going to guess that UTF-8 was intended, but will guess some other encoding instead.
An UTF-8 GEDCOM file should start with the UTF-8 Byte Order Mark (BOM), to make sure editors do not guess wrong, but always pick the right encoding.
That Byte Order Mark tells the editor that the file is an UTF-8 text file.
The WikiTree GEDCOM file lacks that Byte Order Mark....
Windows NotePad interprets WikiTree's invalid UTF-8 GEDCOM as Windows ANSI (code page 1252).
Windows NotePad interprets WikiTree's invalid UTF-8 GEDCOM as Windows ANSI (code page 1252). The byte values C3 and A9 are legal in Windows ANSI, but they mean something different! In Windows ANSI, C3 is the code for à (Latin Capital Letter A with Tilde, A9 is the code for © (Copyright Sign), and A8 is the code for ¨ (Diaeresis). If the sample text is UTF-8 encoded but interpreted as Windows ANSI, it becomes:
WikiTree use LF instead of CR/LF as the End of Line marker, which most Windows software does not recognise as EOL. Because of WikiTree's suboptimal EOL choice, NotePad displays the entire file as one single line, and that makes it a lot harder to navigate the file. You can still use Windows NotePad to fix the problem, but I recommend using NotePad++.
NotePad++ is free open source software. NotePad++ not only recognises LF as an End of Line characters, there even is a GEDCOM Plugin for NotePad++ that will colour-code any GEDCOM file you load.
These two screenshots show the before and after of the simple edit that fixes the split character problem. Like Windows NotePad, NotePad++ recognises the file as Windows ANSI encoded. I selected everything between the Latin Capital Letter A with Tilde at the end of one line, and the Copyright Sign on the next line, and then hit the Delete key. This removed all the characters in between, combined the two GEDCOM lines into a single GEDCOM line, on which the C3 and A9 bytes are a sequence again.
NotePad++ allows you to change the encoding of a file, and that something you should definitely not do with a GEDCOM file.
If you ask NotePad++ to save the edited file as UTF-8, you'll get what you asked for (UTF-8 interpreted as Windows ANSI, and then converted to UTF-8) but not what you want.
NotePad++ interpreted the file as Windows ANSI, so save it as Windows ANSI, and let the magic happen;
now that the file no longer has invalid codes, and one more valid UTF-8 sequence than before, it will automagically be recognised as the UTF-8 file it now is.
Copyright © Tamura Jones. All Rights reserved.