There is more to conversion from ANSEL to Unicode than just the ANSEL to Unicode conversion tables. In ANSEL, combining characters (modifiers) come before the base characters, while in Unicode they come after the base character.
After conversion to Unicode, the result should be normalised to the normalisation for appropriate for the platform.
On Mac OS X, the string must be converted to Unicode Normal Form D (decomposed characters),
while on Windows, it must be converted to Unicode Normal Form C (composed characters).
This normalisation step may not be skipped for either platform or normal form, as ANSEL text may contain both composed and decomposed characters.
Conversion from Unicode to ANSEL may seem simple; just perform the ANSEL to Unicode conversion step in reverse order:
However, Unicode to ANSEL conversion is more complicated than that. There are two issues that require attention:
First of all, ANSEL does not support all the same characters that Unicode supports, and does not feature an Replacement Character like Unicode does. The widespread convention for such cases is to replace the unsupported character by a question mark. When doing so, care much be taken replace an unsupported character and its modifiers by just one single question mark.
ANSEL does not support all same combining characters as Unicode either.
Generally speaking, when a modifier isn't supported, the Unicode character isn't supported, and should be replaced by a question mark.
It is arguably better, albeit inconsistent, to keep the base character if supported, and simply lose any unsupported modifiers.
Special attention must be paid to ANSEL's pre-composed characters.
For example, ANSEL does not contain an equivalent for U+031B Combining Horn,
but it does contain equivalents for the pre-composed characters U+101A Latin Capital Letter O with Horn and U+10AF Latin Capital Letter U with Horn.
If your Unicode to ANSEL conversion does not take the existence of ANSEL's pre-composed characters into account, your conversion will reduce those characters to their base character.
The de-normalisation step is essential for taking advantage of ANSEL's pre-composed characters, and not ending up with just their base character instead.
It is generally unwise to try and roll your own character set conversion functions, and best to rely on built-in functions for any character set handling and conversions. All major operating systems, operating environments, and several third-party libraries provide character set conversion functions, but no major systems and few libraries support ANSEL. However, all major operating systems provide functions for conversion of strings to Unicode Normal Forms:
NormalizeString() and IsNormalizedString() functions.NSString() function.Normalizer2 class, with multiple methods.String.Normalize() method.Normalizer2 class.NSString() function as Apple Mac OS X.This article was prompted by an observation made by Andrew Hoyle, creator of Chronoplex My Family Tree and the Chronoplex GEDCOM Validator. He observed that, unless done right (that extra de-normalisation step), ANSEL to Unicode conversion may easily lose information, unintentionally replacing ANSEL's pre-composed characters by their base characters.
Copyright © Tamura Jones. All Rights reserved.