Combining Class of Thai Nonspacing_Marks

Asmus Freytag (c) asmusf at ix.netcom.com
Wed Apr 5 04:49:52 CDT 2017


On 4/4/2017 8:00 PM, Gerriet M. Denkmann wrote:
>> On 4 Apr 2017, at 00:00,Asmus Freytag <asmusf at ix.netcom.com> wrote:
>>
>> It is not possible to construct a set of secure network identifiers based on simply
>> a) ensuring the string is in NFC
>> b) otherwise allowing all of the Thai characters (insofar as the they are PVALID in IDNA 2008 [RFC5892]).
>>
>> Considerable attention to allowable contexts is required. There is a group in Thailand working on this, but their results have not yet been made public.
> Maybe this: Proposal for the Thai Script Root Zone Label Generation Rulesets <https://www.icann.org/en/system/files/files/proposal-thai-lgr-15dec16-en.pdf>

Just as long as you understand that it's not final, even for the problem 
domain it intends to address.
>
> But the rules for Root Zone Labels are (rightly) much more restricted than what I want:

One key difference is that the rules define a preferred ordering, and do 
not define a folding. Obviously, knowing a preferred ordering allows 
anyone to define a folding that results in that ordering.

Another generic difference between an LGR for network identifiers (Root 
Zone or otherwise) and filenames is that and LGR will tend to disallow 
pathological combinations, even if they are in an unambiguous order. 
"Pathological" combinations are those that result in unpredictable 
rendering - not just for a few isolated fonts, but across the board.

I would argue that for complex scripts, there may be a case for 
restricting filenames in a similar manner: expecting that any random 
combining sequence of unbounded length (up to the full filename) should 
be supported will surely lead to filenames that are impossible to tell 
apart; usually because they either do not get rendered in a sensible 
way, or things get clipped.

This may even be the case for combining sequences in general.

LGRs, and the Root Zone LGR in particular, go one step further: they 
tend to explicitly excluded characters that are obsolete, rare, 
historic, special use, and so on; this is done for two main reasons: to 
keep the resulting names recognizable to the majority of users and to 
avoid the kinds of problems introduced by these characters.

For example, for Arabic, the consensus seems to be that for domain 
names, one really doesn't want to support the combining marks. They are 
not needed there, unlike general text, and only lead to a bewildering 
host of non-normalizable dual representations, for which otherwise a 
folding would have to be defined.

Finally, LGRs have some features that go beyond having a clean and 
focused repertoire and a defined ordering: those are the cases where two 
strings look identical, but neither can be construed as "preferred". In 
an LGR these strings can be made "mutually exclusive" using the blocked 
variant mechanism (see RFC 7940). Some file systems have rudimentary 
forms of this, for example those that are case-preserving but not 
case-sensitive. Once a filename is used, its "variant" can no longer be 
added, but there's no a-priori folding into a preferred form.

Other than performance, perhaps, there's no reason a file system's valid 
file name space couldn't be described via RFC 7940. (Even with the full 
features of RFC 7940, collision checking can be implemented as an O(1) 
process for each new file name to be added to a folder). In addition to 
NFC, some additional foldings might be supplied to transform user input 
to valid file names (from case folding to some more complex folding like 
the one you are discussing). Like case-insensitive, non-preserving file 
systems, adding such foldings would return file names that can be 
different from the ones the user specified.

Again, whether or not you supply a folding is separate from defining a 
preferred ordering. For the latter, you might start with the work the 
Thai Generation Panel has been doing, so that valid network identifiers 
can immediately be valid file names.

A./
>
> Any two strings which look (almost?) identical should be normalised into some canonical form.
> Reason: not to have identical looking filenames in a filesystem.
> With the current rules of normalisation there could be 8 different filenames all looking identical to “กินครึ่งทิ้งครึ่ง”.
>
> E.g. :
> - both NIKHAHIT + Sara Aa  and Sara Am should be normalised into the same string (whatever this is)
> - both top-vowel + tone-mark and  tone-mark + top-vowel should be normalised into the same string (whatever this is).
> etc.
>
> If, as Richard Wordingham wrote: "Unicode combining classes cannot be changed.  All that can be done is
> to enforce the order of characters in normalised text.” then the Unicode Normalisation algorithms should be updated.
>
>
> Kind regards,
>
> Gerriet.
>
>



More information about the Unicode mailing list