2018-05-04 ANSEL / Unicode Conversion Algorithms

GEDCOM ANSEL

How to do it right

ANSEL to Unicode conversion

There is more to conversion from ANSEL to Unicode than just the ANSEL to Unicode conversion tables. In ANSEL, combining characters (modifiers) come before the base characters, while in Unicode they come after the base character.

After conversion to Unicode, the result should be normalised to the normalisation for appropriate for the platform. On Mac OS X, the string must be converted to Unicode Normal Form D (decomposed characters), while on Windows, it must be converted to Unicode Normal Form C (composed characters).
This normalisation step may not be skipped for either platform or normal form, as ANSEL text may contain both composed and decomposed characters.

ANSEL to Unicode conversion algorithm

Convert each ANSEL code point into the corresponding Unicode code point
mirror the positions of each base character and its modifiers (modifiers after instead of before base characters)
Convert to Unicode Normal Form C or Unicode Normal Form D, as appropriate for the platform

Unicode to ANSEL conversion

Conversion from Unicode to ANSEL may seem simple; just perform the ANSEL to Unicode conversion step in reverse order:

Make sure the Unicode string is in Unicode Normal Form D
mirror the positions of each base characters and its modifiers (modifiers before instead of after base characters)
Replace Unicode code points with their ANSEL equivalents

However, Unicode to ANSEL conversion is more complicated than that. There are two issues that require attention:

unsupported characters
ANSEL's pre-composed characters

unsupported characters

First of all, ANSEL does not support all the same characters that Unicode supports, and does not feature an Replacement Character like Unicode does. The widespread convention for such cases is to replace the unsupported character by a question mark. When doing so, care much be taken replace an unsupported character and its modifiers by just one single question mark.

ANSEL does not support all same combining characters as Unicode either. Generally speaking, when a modifier isn't supported, the Unicode character isn't supported, and should be replaced by a question mark.
It is arguably better, albeit inconsistent, to keep the base character if supported, and simply lose any unsupported modifiers.

ANSEL's pre-composed characters

Special attention must be paid to ANSEL's pre-composed characters. For example, ANSEL does not contain an equivalent for U+031B Combining Horn, but it does contain equivalents for the pre-composed characters U+101A Latin Capital Letter O with Horn and U+10AF Latin Capital Letter U with Horn.
If your Unicode to ANSEL conversion does not take the existence of ANSEL's pre-composed characters into account, your conversion will reduce those characters to their base character.

Unicode to ANSEL conversion algorithm

Make sure the Unicode string is in Unicode Normal Form D
De-normalise for pre-composed characters supported by ANSEL
mirror the positions of each base characters and its modifiers (modifiers before instead of after base characters)
Replace Unicode code points with their ANSEL equivalents
- Replace unsupported characters with a question mark
- Take care to replace an unsupported characters and all its modifiers by a single question mark
- If a base character is supported, but a modifier is not, keep the base character, but leave of the modifier

The de-normalisation step is essential for taking advantage of ANSEL's pre-composed characters, and not ending up with just their base character instead.

operating system support

It is generally unwise to try and roll your own character set conversion functions, and best to rely on built-in functions for any character set handling and conversions. All major operating systems, operating environments, and several third-party libraries provide character set conversion functions, but no major systems and few libraries support ANSEL. However, all major operating systems provide functions for conversion of strings to Unicode Normal Forms:

Microsoft Windows provides the NormalizeString() and IsNormalizedString() functions.
Apple Mac OS X provides the NSString() function.
Oracle Java provides the Normalizer2 class, with multiple methods.
Microsoft .NET provides the String.Normalize() method.
Google Android provides the same Normalizer2 class.
Apple iOS provides the same NSString() function as Apple Mac OS X.

best practice for genealogy applications

support import of ANSEL GEDCOM files
do not export to ANSEL GEDCOM
- always export to UTF-8 or UTF-16
use built-in functions for Unicode normalisation

acknowledgement

This article was prompted by an observation made by Andrew Hoyle, creator of Chronoplex My Family Tree and the Chronoplex GEDCOM Validator. He observed that, unless done right (that extra de-normalisation step), ANSEL to Unicode conversion may easily lose information, unintentionally replacing ANSEL's pre-composed characters by their base characters.