Long-Encoded Restricted Characters in High Frequency Modern Use

Sat May 31 14:27:55 CDT 2014

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Fri, May 30, 2014 at 12:39 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> I am a little confused by the call for a review of UTS #39, Unicode
> Security Mechanisms (PRI #273).  Are we being requested to
> report long-encoded 'restricted' characters in high frequency modern
> use?  'Restricted' refers to the classification in
> xidmodifications.txt.
>

First, "restricted" are meant not for everyday use, but specifically just
for the purpose of programming identifiers and similar sorts of
identifiers. Moreover, it sets up a framework, but the conformance
requirements are only that any modification is declared.

http://www.unicode.org/reports/tr39/proposed.html#C1

You may know this all, but just to be sure.

>
> One linked pair of long-encoded restricted characters in high frequency
> use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM,
> which occurs in the extremely common Thai and Lao words for 'water' or
> 'liquid in general' น้ำ ນ້ຳ whose NFKC decompositions are the
> nonsensical forms น้ํา ນ້ໍາ, but may be faked by the linguistically
> incorrect นํ้า ນໍ້າ.  In Thai the encodings are <U+0E19 THAI CHARACTER
> NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM>,
> <U+0E19, U+0E49, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32 THAI CHARACTER
> SARA AA> and <U+0E19, U+0E49, U+0E4D, U+0E49, U+0E32>.

The structure of the data is based on the use of NFKC characters in
identifiers. So SARA AM and the Lao equivalent are both not NFKC
characters, and are categorized as such, and would need to be represented
by their NFKC fors. The process is in
http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection

You can see the categorization (for 6.3) for a whole script with a link
like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=thai}

(It only works for 6.3 right now, but these items haven't changed recently.)

> Now, U+0E4D THAI
> CHARACTER NIKHAHIT is classified as 'allowed; recommended', although
> its main use is in writing Pali, which would suggest that it should be
> 'restricted; historic' or 'restricted; limited-use'.

For that, it would be best to submit via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html, just to be sure.

> The situation is
> not so clear for Lao
> - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language.
>

Based on your information, the following appear (at least to me) to be
caused by typos in  in the xidmodifications source files; they are all
marked as 'technical'.

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=khmer}

Again, best to submit this like above (via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html).

> To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER
> SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as
> 'restricted; technical'. They are all in use in the Khmer language.
>
> U+17CB KHMER SIGN BANTOC is required for the main methods of writing
> the Khmer vowels /a/ and /ɑ/.
>
> U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn
> that it has recently become little-used.  It is, however, readily
> confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main
> modern use is to show that a consonant is silent, rather like the Thai
> letter U+0E4C THAI CHARACTER THANTHAKHAT.  (The names are the same.)
> The confusion arises because Sanskrit -rCa was pronounced /-r/ in
> Khmer, and final /r/ recently became silent in Khmer, so the effect of
> the Sanskrit /r/ is now to silence the final consonant.
>
> While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be
> common, they are still in modern use.
>
> Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in
> frequency, it has not dropped out of use and is still a common enough
> way of writing the vowel /a/.
>

> Richard.
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/4d4d6ca7/attachment.html>