Combining Class of Thai Nonspacing_Marks
Asmus Freytag (c)
asmusf at ix.netcom.com
Wed Apr 5 04:49:52 CDT 2017
On 4/4/2017 8:00 PM, Gerriet M. Denkmann wrote:
>> On 4 Apr 2017, at 00:00,Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> It is not possible to construct a set of secure network identifiers based on simply
>> a) ensuring the string is in NFC
>> b) otherwise allowing all of the Thai characters (insofar as the they are PVALID in IDNA 2008 [RFC5892]).
>> Considerable attention to allowable contexts is required. There is a group in Thailand working on this, but their results have not yet been made public.
> Maybe this: Proposal for the Thai Script Root Zone Label Generation Rulesets <https://www.icann.org/en/system/files/files/proposal-thai-lgr-15dec16-en.pdf>
Just as long as you understand that it's not final, even for the problem
domain it intends to address.
> But the rules for Root Zone Labels are (rightly) much more restricted than what I want:
One key difference is that the rules define a preferred ordering, and do
not define a folding. Obviously, knowing a preferred ordering allows
anyone to define a folding that results in that ordering.
Another generic difference between an LGR for network identifiers (Root
Zone or otherwise) and filenames is that and LGR will tend to disallow
pathological combinations, even if they are in an unambiguous order.
"Pathological" combinations are those that result in unpredictable
rendering - not just for a few isolated fonts, but across the board.
I would argue that for complex scripts, there may be a case for
restricting filenames in a similar manner: expecting that any random
combining sequence of unbounded length (up to the full filename) should
be supported will surely lead to filenames that are impossible to tell
apart; usually because they either do not get rendered in a sensible
way, or things get clipped.
This may even be the case for combining sequences in general.
LGRs, and the Root Zone LGR in particular, go one step further: they
tend to explicitly excluded characters that are obsolete, rare,
historic, special use, and so on; this is done for two main reasons: to
keep the resulting names recognizable to the majority of users and to
avoid the kinds of problems introduced by these characters.
For example, for Arabic, the consensus seems to be that for domain
names, one really doesn't want to support the combining marks. They are
not needed there, unlike general text, and only lead to a bewildering
host of non-normalizable dual representations, for which otherwise a
folding would have to be defined.
Finally, LGRs have some features that go beyond having a clean and
focused repertoire and a defined ordering: those are the cases where two
strings look identical, but neither can be construed as "preferred". In
an LGR these strings can be made "mutually exclusive" using the blocked
variant mechanism (see RFC 7940). Some file systems have rudimentary
forms of this, for example those that are case-preserving but not
case-sensitive. Once a filename is used, its "variant" can no longer be
added, but there's no a-priori folding into a preferred form.
Other than performance, perhaps, there's no reason a file system's valid
file name space couldn't be described via RFC 7940. (Even with the full
features of RFC 7940, collision checking can be implemented as an O(1)
process for each new file name to be added to a folder). In addition to
NFC, some additional foldings might be supplied to transform user input
to valid file names (from case folding to some more complex folding like
the one you are discussing). Like case-insensitive, non-preserving file
systems, adding such foldings would return file names that can be
different from the ones the user specified.
Again, whether or not you supply a folding is separate from defining a
preferred ordering. For the latter, you might start with the work the
Thai Generation Panel has been doing, so that valid network identifiers
can immediately be valid file names.
> Any two strings which look (almost?) identical should be normalised into some canonical form.
> Reason: not to have identical looking filenames in a filesystem.
> With the current rules of normalisation there could be 8 different filenames all looking identical to “กินครึ่งทิ้งครึ่ง”.
> E.g. :
> - both NIKHAHIT + Sara Aa and Sara Am should be normalised into the same string (whatever this is)
> - both top-vowel + tone-mark and tone-mark + top-vowel should be normalised into the same string (whatever this is).
> If, as Richard Wordingham wrote: "Unicode combining classes cannot be changed. All that can be done is
> to enforce the order of characters in normalised text.” then the Unicode Normalisation algorithms should be updated.
> Kind regards,
More information about the Unicode