Long-Encoded Restricted Characters in High Frequency Modern Use

Richard Wordingham richard.wordingham at ntlworld.com
Thu May 29 17:39:56 CDT 2014


I am a little confused by the call for a review of UTS #39, Unicode
Security Mechanisms (PRI #273).  Are we being requested to
report long-encoded 'restricted' characters in high frequency modern
use?  'Restricted' refers to the classification in
xidmodifications.txt.

One linked pair of long-encoded restricted characters in high frequency
use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM,
which occurs in the extremely common Thai and Lao words for 'water' or
'liquid in general' น้ำ ນ້ຳ whose NFKC decompositions are the
nonsensical forms น้ํา ນ້ໍາ, but may be faked by the linguistically
incorrect นํ้า ນໍ້າ.  In Thai the encodings are <U+0E19 THAI CHARACTER
NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM>,
<U+0E19, U+0E49, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32 THAI CHARACTER
SARA AA> and <U+0E19, U+0E49, U+0E4D, U+0E49, U+0E32>.  Now, U+0E4D THAI
CHARACTER NIKHAHIT is classified as 'allowed; recommended', although
its main use is in writing Pali, which would suggest that it should be
'restricted; historic' or 'restricted; limited-use'.  The situation is
not so clear for Lao
- U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language.

To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER
SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as
'restricted; technical'. They are all in use in the Khmer language.

U+17CB KHMER SIGN BANTOC is required for the main methods of writing
the Khmer vowels /a/ and /ɑ/.

U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn
that it has recently become little-used.  It is, however, readily
confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main
modern use is to show that a consonant is silent, rather like the Thai
letter U+0E4C THAI CHARACTER THANTHAKHAT.  (The names are the same.)
The confusion arises because Sanskrit -rCa was pronounced /-r/ in
Khmer, and final /r/ recently became silent in Khmer, so the effect of
the Sanskrit /r/ is now to silence the final consonant.

While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be
common, they are still in modern use.

Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in
frequency, it has not dropped out of use and is still a common enough 
way of writing the vowel /a/.

Richard.



More information about the Unicode mailing list