Codepoint Support for Phonetically-Aware Collation

Sun Jan 5 18:11:03 CST 2014

Several languages with phonetically ambiguous spelling take
pronunciation into account when sorting words alphabetically.  Typical
examples are Welsh and Slovak, where contractions are not applied for
chance combinations of characters ('ng' in Welsh and 'ch' in Slovak).
Less typically, visually opaque syllable boundaries are taken into
account, e.g. in Lao and in some older Thai dictionaries (though the
Thai examples I know of were compiled by Europeans).

There are two approaches to these ambiguities for correct automated
collation. One can either use a vocabulary-based collation table (as is
done for Tibetan-script languages) or use mark-up characters such as
U+00AD SOFT HYPHEN, U+200B ZERO WIDTH SPACE or U+034F COMBINING
GRAPHEME JOINER (CGJ) as appropriate to prevent contractions in
collation.  In the latter case, it is reasonable to assume that such
characters will only be used when it is likely that the text will be
subject to culturally-sensitive sorting. For example, the 'search'
collation settings for Welsh in the CLDR do not use the contractions
used for sorting Welsh, so one does not have to worry about the encoding
of the town name 'Bangor' unless it will be presented in an index in
Welsh - in which case Welsh inflections will be a greater source of
trouble.

CGJ may also used to distinguish umlaut and diaeresis (both usually
encoded U+0308) in German, by encoding the diaeresis as <U+034F,
U+0308>.

In some SE Asian dictionaries, an ordering distinction is made
between the use of the letter corresponding to Indic PA to represent a
voiced sound similar to /b/, used for native words, and the
unvoiced sound /p/, used in Indic loan words.  The examples I know of
are U+1794 KHMER LETTER BA and U+1A37 TAI THAM LETTER BA.  While it is
possible to represent the contrasting sound /p/ by <U+1794, U+17C9
KHMER SIGN MUUSIKATOAN> or U+1A38 TAI THAM LETTER HIGH PA respectively
instead, in many Indic loan words this is not done.  Is there any
encoding level mark-up available to distinguish between the two
pronunciations of BA when necessary?  I had thought the problem had
been solved for Khmer, but I can now find no evidence of a solution.

The usage of the two scripts share the feature that as the first
element of what is or was a true consonant cluster, BA usually (always?)
has an unvoiced sound, not the voiced sound.  (Sound changes have
made the situation more complicated to describe in Tai Lue, Tai Khuen
and Northern Thai, but the principle remains unchanged.)  This
complicates the use of what to me had seemed obvious, namely to use
<BA, CGJ> to represent the unvoiced sound.  It would be more natural to
use <BA, CGJ, COENG/SAKOT> to indicate the voiced sound should it
appear in clusters in foreign loanwords.

Richard.