Fwd: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

Mon Dec 12 11:30:31 CST 2016

-------- Forwarded Message --------
Subject: 	Re: Should unassigned code points in blocks reserved for 
combining marks, etc be GCB extended?
Date: 	Mon, 12 Dec 2016 08:26:45 -0800
From: 	Ken Whistler <kenwhistler at att.net>
To: 	Karl Williamson <public at khwilliamson.com>

On 12/12/2016 6:59 AM, Karl Williamson wrote:
> These are currently GCB Other, but when assigned, don't we know that
> they will be Extended?  So this could be done now.
>

Short answer: No.

Long answer:

Every proposal to pre-assign some range of unassigned code points a
non-default character property value for that range has a bunch of
hidden costs. This proposal would be particularly costly, because it
would be smack in the middle of some of the properties with the hairiest
dependency chains.

GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
particular change for GCB=Extend would also get reflected into WB=Extend
and SB=Extend, which are also dependent on Grapheme_Extend=Yes.

Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
gc=Me, which would seem to be a natural match for the blocks "reserved"
for combining marks, but it is actually also dependent on
Other_Grapheme_Extend, which is a mixed bag of various spacing combining
marks for normalization closure, plus ZWNJ, plus tag characters, plus
spacing halfwidth dakuten.

So that would raise complicated questions about *how* GCB=Extend would
itself be extended to include certain set ranges of unassigned code
points. Would those simply be assigned directly to Grapheme_Extend=Yes
(which would create a complicated default assignment for that derived
property, and complicate both its documentation *and* its derivation)?
Or would they be assigned directly to Other_Grapheme_Extend (which would
create a new animal in the zoo of properties -- a contributory property
which itself has ranges of unassigned code points given non-default
values). And once decided, what would be the implications for all the
documentation and the tooling?

Any proposal like this then also has hidden costs on the committees,
because it sets up implied requirements for what can be encoded where
and what properties it has to have. Every time such defaults are set up,
it makes the documentation of what is already "pre-assigned" more
complicated and fragile. Already, a large proportion of the participants
in the maintenance committees have very murky understandings about what
can and cannot be put where in the future, and why. And that is a recipe
for mistakes in encoding.

Finally, like it or not, there currently is no actually contract
guaranteeing that the remaining open ranges in blocks "reserved" for
combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
prevent the committees from deciding that one (or more) spacing
combining marks might be appropriate to encode there, or possibly even
spacing non-combining marks of some strange sort, like the spacing
Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
those ranges free of characters that would not be Grapheme_Extend=Yes
would require some guy on the committee to be aware of the arcane
dependencies for segmentation properties, and then to police such
decisions in perpetuity -- or at least until the blocks in question
filled up with non-problematical characters.

So the long answer is also: No.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161212/963220f8/attachment.html>