Combining Class of Thai Nonspacing_Marks

Richard Wordingham richard.wordingham at ntlworld.com
Tue Apr 4 23:23:55 CDT 2017


On Wed, 5 Apr 2017 10:00:25 +0700
"Gerriet M. Denkmann" <gerrietm at icloud.com> wrote:

> Any two strings which look (almost?) identical should be normalised
> into some canonical form. Reason: not to have identical looking
> filenames in a filesystem. With the current rules of normalisation
> there could be 8 different filenames all looking identical to
> “กินครึ่งทิ้งครึ่ง”.

> E.g. :
> - both NIKHAHIT + Sara Aa  and Sara Am should be normalised into the
> same string (whatever this is)

I think the answer to this is for renderers to insert a dotted circle
in the former.  I hope no-one is going to argue that NIKHAHIT + SARA AA
is appropriate for Sanskrit.

NFKC is not the answer; NFKC(น้ำ) = น้ํา.
 
> - both top-vowel + tone-mark and  tone-mark + top-vowel should be
> normalised into the same string (whatever this is). etc.

TUS declares that กิ่ (vowel then tone mark) and ก่ิ (tone mark then
vowel) should render differently.  Unfortunately, there is a tendency
for mark to mark positioning, if employed at all, to be restricted to
combinations that actually occur in correctly spelt Thai.  A
particularly nasty example is that doubled vowels above can be
indistinguishable from single vowels above.  I got an angry response
when I suggested that mark-to-mark positioning should be used for all
combinations of marks above - allegedly it makes the GPOS tables 'too
big'.

There's also the very high confusability of <SARA I, NIKHAHIT> and
<SARA UE>.  Traditionally, SARA UE is SARA I plus NIKHAHIT, and I
suspect this is the origin of the etymologically odd form of ลึงค์
'lingam'.  

> If, as Richard Wordingham wrote: "Unicode combining classes cannot be
> changed.  All that can be done is to enforce the order of characters
> in normalised text.” then the Unicode Normalisation algorithms should
> be updated.

I think it will be a long time before canonical equivalence is replaced
by canonical equivalence Version 2, but we may not have to wait many
centuries.

In the mean time, you will have to work with your own folding.

Richard.



More information about the Unicode mailing list