Normalise Tai Tham or not?
Richard Wordingham via Unicode
unicode at unicode.org
Tue Oct 10 14:00:12 CDT 2017
I'm preparing to share a spell-checker for Northern Thai in the Tai
Tham script, and I'm having difficulty deciding whether to
offer corrections in NFC/NFD or unnormalised.
The problem arises in closed syllables with tone marks. For example,
ᨠᩥ᩠᩵ᨶ /kin/ 'smell', has two canonically equivalent encodings respecting
the principle of phonetic ordering: the unnormalised <U+1A20 HIGH KA,
U+1A65 SIGN I, U+1A75 TONE-1, U+1A60 SAKOT, U+1A36 NA>, which matches
the glyph structure of four glyphs: <U+1A20>, <U+1A65>, <U+1A75> and
<U+1A60, U+1A36>, and the NFC and NFD form <U+1A20, U+1A65, U+1A60,
U+1A75, U+1A36>.
The issues I see are:
1) The unnormalised form is a natural and easy form to type. To type
the normalised form character by character does not come naturally, and
an input method would be more complex.
2) The unnormalised form is easier for a rendering engine. HarfBuzz
actually presents the font with a non-standard canonical form so that
the invisible stacker, SAKOT, is reordered to before the subscrpt
consonant. The USE of Microsoft would more naturally accommodate the
unnormalised form, which would have a natural unit of '<halant,
consonant>' as an alternative to an indivisible final consonant. The
USE is not designed to respect canonical equivalence.
3) The normalised form is the form preferred for the Web, but the
pressure to use it has decreased.
4) The pressure on search tools to respect canonical equivalence is now
relatively low. Some editors do (e.g. LibreOffice); others don't (e.g
Emacs, so far as I am aware). Therefore, the dictionary suggestions
should match what the input method produces.
So, should I offer normalised corrections or unnormalised corrections?
Should the spell-checker accept spellings with the dispreferred state
(normalised v. unnormalised)?
Richard.
More information about the Unicode
mailing list