Modern Software Experience

2018-04-08

The original title of this article is GEDCOM ANSEL to Unicode. It was changed to ANSEL / Unicode Conversion Tables, after the introduction of the ANSEL / Unicode Conversion Algorithms article.

Official Conversion Tables

ANSEL to Unicode

These are the official ANSEL to Unicode conversion tables. Well, these is as official as it gets.

Technically, these aren't an ANSEL to Unicode tables, but a LDS ANSEL (LANSEL) to Unicode tables. The tables presented here include all the LDS Extensions to ANSEL proper, found in different versions of the FamilySearch GEDCOM specifications.

The tables presented here are based on multiple previous articles, now combined into the GEDCOM ANSEL series. The first article presents a correct ANSEL table for GEDCOM 5.5.1, as that specification contains a messed up table. The second article combines the ANSEL standard and ANSEL tables from different GEDCOM versions into a single table. The third articles discusses the mapping of the Alif and Ayn characters, as well as a few other characters.

This article presents the LDS ANSEL to Unicode conversion tables, based upon the previous articles, the official ISO 5426 (and ANSEL) to Unicode conversion table, official MARC-21 to Unicode conversion, and existing implementations, particularly FamilySearch's Personal Ancestral File (PAF) 5.2.18. The ANSEL ALif and Ayn article explains why PAF's alif and ayn conversion is wrong.

ANSEL versus Unicode

There is more to conversion from ANSEL to Unicode than just these tables. In ANSEL, modifying characters come before the base characters, in Unicode they come after the base character.
Conversion from ANSEL to Unicode and back is easiest for Unicode Normal Form D (decomposed characters), which Mac OS X uses, and takes extra works for Unicode Normal Form C (composed characters), which Windows uses.

GEDCOM ANSEL to Unicode conversion tables

ANSEL Non-spacing graphic characters
Hex wpcode Dec Graphic Name example of use code point Name
E0 2,4 224 ◌̉ low rising tone mark củi U+0309 Combining Hook Above
E1 1,0 225 ◌̀ grave accent règle U+0300 Combining Grave Accent
E2 1,6 226 ◌́ acute accent está U+0301 Combining Acute Accent
E3 1,3 227 ◌̂ circumflex accent même U+0302 Combining Circumflex Accent
E4 1,2 228 ◌̃ tilde niño U+0303 Combining Tilde
E5 1,8 229 ◌̄ macron gājējs U+0304 Combining Macron
E6 1,22 230 ◌̆ breve altă U+0306 Combining Breve
E7 1,15 231 ◌̇ dot above żaba U+0307 Combining Dot Above
E8 1,7 232 ◌̈ umlaut (diaeresis) öppna U+0308 Combining Diaeresis
E9 1,19 233 ◌̌ hacek vždy U+030C Combining Caron
EA 1,14 234 ◌̊ circle above (angstrom) hår U+030A Combining Ring Above
EB 2,11 235 ◌︠ ligature, left half akademii︠a︡ U+FE20 Combining Ligature Left Half
EC 2,12 236 ◌︡ ligature, right half akademii︠a︡ U+FE21 Combining Ligature Right Half
ED 1,10 237 ◌̕ high comma, off center rozdel̕ ovac U+0315 Combining Comma Above Right
EE 1,16 238 ◌̋ double acute accent időszaki U+030B Combining Double Acute Accent
EF 2,25 239 ◌̐ candrabindu Alii̐ev U+0310 Combining Cadrabindu
F0 1,17 240 ◌̧ cedilla ça U+0327 Combining Cedilla
F1 1,18 241 ◌̨ right hook, ogonek vietą U+0328 Combining Ogonek
F2 2,0 242 ◌̣ dot below teḍa U+0323 Combining Dot Below
F3 2,1 243 ◌̤ double dot below k̲h̲ut̤bah U+0324 Combining Diaeresis Below
F4 2,3 244 ◌̥ circle below Samskr̥ta U+0325 Combining Ring Below
F5 2,6 245 ◌̳ double underscore G̳hulam U+0333 Combining Double Low Line
F6 2,7 246 ◌̲ underscore s̲amar U+0332 Combining Low Line
F7 2,16 247 ◌̦ left hook dārzin̦a U+0326 Combining Comma Below
F8 2,14 248 ◌̜ right cedilla kho̜ng U+031C Combining Left Half Ring Below
F9 2,9 249 ◌̮ half circle below (upadhmaniya) ḫumantuš U+032E Combining Breve Below
FA 250 ◌︢ double tilde, left half n︢g︣alan U+FE22 Combining Double Tilde Left Half
FB 251 ◌︣ double tilde, right half n︢g︣alan U+FE23 Combining Double Tilde Right Half
FC 1,5 252 ◌̸ diacritic slash through char U+0338 Combining Long Solidus Overlay
FD 253 unused U+FFFD Replacement Character
FE 1,9 254 ◌̓ high comma, centered ge̓otermika U+0313 Combining Comma Above
FF 255 illegal U+FFFD Replacement Character
ANSEL Spacing graphic characters
Hex wpcode Dec Graphic Name example of use code point Name
A0 160 unused U+FFFD Replacement Character
A1 1,152 161 Ł slash L — uppercase Łódź U+0141 Latin Capital Letter L with Stroke
A2 1,80 162 Ø slash O — uppercase Øst U+00D8 Latin Capital Letter O with Stroke
A3 1,78 163 Đ slash D — uppercase Đuro U+0110 Latin Capital Letter D with Stroke
A4 1,88 164 Þ thorn — uppercase Þann U+00DE Latin Capital Letter Thorn
A5 1,36 165 Æ ligature AE — uppercase Ægir U+00C6 Latin Capital Letter AE
A6 1,166 166 Œ ligature OE — uppercase Œuvre U+0152 Latin Capital Ligature OE
A7 1,6 167 ◌ʹ mjagkij znak fakulʹtet U+02B9 Modifier Letter Prime
A8 1,1 168 · middle dot novel·la U+00B7 Middle Dot
A9 5,28 169 musical flat B♭ U+266D Musical Flat Sign
AA 4,32 170 ® registered trademark ABC® U+00AE Registered Sign
AB 6,1 171 ± plus or minus A±B U+00B1 Plus-Minus Sign
AC 1,230 172 Ơ hook O - uppercase U+01A0 Latin Capital Leter O with Horn
AD 1,232 173 Ư hook U - uppercase XƯA U+01AF Latin Capital Letter U with Horn
AE 1,11 174 ◌ʼ alif Unʼyusho U+02BC Modifier Letter Apostrophe
AF 175 unused U+FFFD Replacement Character
B0 2,11 176 ◌ʻ ayn faʻil U+02BB Modifier Letter Turned Comma
B1 1,153 177 ł slash l— lowercase rozbił U+0142 Latin Small Letter L with Stroke
B2 1,81 178 ø slash o— lowercase høj U+00F8 Latin Small Letter O with Stroke
B3 1,79 179 đ slash d— lowercase đavola U+0111 Latin Small Letter D with Stroke
B4 1,89 180 þ thorn— lowercase þann U+00FE Latin Small Letter Thorn
B5 1,37 181 æ ligature ae— lowercase skæg U+00E6 Latin Small Letter AE
B6 1,167 182 œ ligature oe— lowercase œuvre U+0153 Latin Small Ligature OE
B7 1,16 183 ◌ʺ hard sign (tvjordyj znak) obʺi︠a︡vlenie U+02BA Modified Letter Double Prime
B8 1,24 184 ı dotless i— lowercase masalı U+0131 Latin Small Letter Dotless I
B9 4,11 185 £ British pound £5.00 U+00A3 Pound Sign
BA 1,87 186 ð eth verður U+00F0 Latin Small Letter Eth
BB 187 unused U+FFFD Replacement Character
BC 1,231 188 ơ hook o - lowercase U+01A1 Latin Small O with Horn
BD 1,233 189 ư hook u - uppercase Tự Đức U+01B0 Latin Small U with Horn
BE 190 empty box U+25A1 Empty Box
BF 191 black box U+25A0 Black Box
C0 6,33 192 ° degree sign 10°C. U+00B0 Degree Sign
C1 6,49 193 script l 25 ℓ. U+2113 Script Small L
C2 4,71 194 phono copyright mark Decca℗ U+2117 Sound recording copyright
C3 4,23 195 © copyright mark ©1993 U+00A9 Copyright Sign
C4 5,27 196 music sharp sign D♯ U+266F Music Sharp Sign
C5 4,8 197 ¿ inverted question mark ¿Qué? U+00BF Inverted Question Mark
C6 4,7 198 ¡ inverted exclamation mark ¡Esta! U+00A1 Inverted Exclamation Mark
C7 199 unused U+FFFD Replacement Character
C8 200 unused U+FFFD Replacement Character
C9 201 unused U+FFFD Replacement Character
CA 202 unused U+FFFD Replacement Character
CB 203 unused U+FFFD Replacement Character
CC 204 unused U+FFFD Replacement Character
CD 205 e e in middle of line U+0065 Latin Small Letter E
CE 206 o o in middle of line U+006F Latin Small Letter O
CF 1,23 207 ß Ess Zed Preußen U+00DF Latin Small Letter Sharp S

legenda

Grey text: code points not documented in the FamilySearch GEDCOM 5.5.1 specification.
Brown text: LDS extensions.

links