Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?
verdy_p at wanadoo.fr
Mon Dec 12 16:25:48 CST 2016
I agree. The remaining slots could be very well allocated for some
notational "superscript" (spacing) marks, more or less formed on ligatures
without being really "extenders" for graphemes as they could as well be
used isolately (I can think about special marks that could be used for
measurement units, or some currencies, or honorific marks, or some
localized variants of symbols like "trademark" or "registered", or some
localized "ampersand" or similar, or some symbol for meaning "birth/death"
after or before a date, or simply the encoding of superscript digits for
Western Arabic or Eastern Arabic for Persian/Urdu, which won't be
"extenders" for any grapheme but used isolately).
The only useful default property is the assignment of a range for strong
RTL letters/digits/punctuation/symbols, because of the complexity and
stability of BiDi algorithms and the security issues and that are related
to them, and difficulties for the UI. On the opposite each assigned block
can contain smaller subranges (sometimes smaller than a full column) for
combining marks, which are spread at various positions (but without huge
complexiuty for handling them in algorithms like normalizations, even if
they are necessarily stabilized: the default combining class for all
unencoded characters is simply 0, blocking any Bidi reordering that would
break later encoded documents using the newly assigned code points:
normalization will apply only to reoder or recombine them only when these
codes will be assigned to known characters with a known possibly non-zero
combining class, but past versions of normalizers will keep them unchanged,
preserving at least the canonical equivalences).
2016-12-12 18:30 GMT+01:00 Ken Whistler <kenwhistler at att.net>:
> -------- Forwarded Message --------
> Subject: Re: Should unassigned code points in blocks reserved for
> combining marks, etc be GCB extended?
> Date: Mon, 12 Dec 2016 08:26:45 -0800
> From: Ken Whistler <kenwhistler at att.net> <kenwhistler at att.net>
> To: Karl Williamson <public at khwilliamson.com> <public at khwilliamson.com>
> On 12/12/2016 6:59 AM, Karl Williamson wrote:
> > These are currently GCB Other, but when assigned, don't we know that
> > they will be Extended? So this could be done now.
> Short answer: No.
> Long answer:
> Every proposal to pre-assign some range of unassigned code points a
> non-default character property value for that range has a bunch of
> hidden costs. This proposal would be particularly costly, because it
> would be smack in the middle of some of the properties with the hairiest
> dependency chains.
> GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
> particular change for GCB=Extend would also get reflected into WB=Extend
> and SB=Extend, which are also dependent on Grapheme_Extend=Yes.
> Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
> gc=Me, which would seem to be a natural match for the blocks "reserved"
> for combining marks, but it is actually also dependent on
> Other_Grapheme_Extend, which is a mixed bag of various spacing combining
> marks for normalization closure, plus ZWNJ, plus tag characters, plus
> spacing halfwidth dakuten.
> So that would raise complicated questions about *how* GCB=Extend would
> itself be extended to include certain set ranges of unassigned code
> points. Would those simply be assigned directly to Grapheme_Extend=Yes
> (which would create a complicated default assignment for that derived
> property, and complicate both its documentation *and* its derivation)?
> Or would they be assigned directly to Other_Grapheme_Extend (which would
> create a new animal in the zoo of properties -- a contributory property
> which itself has ranges of unassigned code points given non-default
> values). And once decided, what would be the implications for all the
> documentation and the tooling?
> Any proposal like this then also has hidden costs on the committees,
> because it sets up implied requirements for what can be encoded where
> and what properties it has to have. Every time such defaults are set up,
> it makes the documentation of what is already "pre-assigned" more
> complicated and fragile. Already, a large proportion of the participants
> in the maintenance committees have very murky understandings about what
> can and cannot be put where in the future, and why. And that is a recipe
> for mistakes in encoding.
> Finally, like it or not, there currently is no actually contract
> guaranteeing that the remaining open ranges in blocks "reserved" for
> combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
> ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
> prevent the committees from deciding that one (or more) spacing
> combining marks might be appropriate to encode there, or possibly even
> spacing non-combining marks of some strange sort, like the spacing
> Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
> those ranges free of characters that would not be Grapheme_Extend=Yes
> would require some guy on the committee to be aware of the arcane
> dependencies for segmentation properties, and then to police such
> decisions in perpetuity -- or at least until the blocks in question
> filled up with non-problematical characters.
> So the long answer is also: No.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode