New Character Property for Prepended Concatenation Marks

Philippe Verdy verdy_p at wanadoo.fr
Thu Nov 26 06:29:41 CST 2015


2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:

> On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>
> The related definition for extended grapheme clusters says:
>
> ( CRLF
> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>            ( Grapheme_Extend | *SpacingMark* )*
> | . )
>
> However I do not understand why it may include only one Hangul-Syllable
> when applying prepended concatenation marks. And if the definition excludes
> whitespaces, nothing prevents it to extend to arbitrary sequences of
> letters/digits/symbols/punctuations (this could span very long sequences of
> sinograms, or other letters from scripts that do not use whitespaces as
> word separators. Even in the Latin script it would extend to the
> punctuation signs that may follow any word, or to an entire mathematical
> formula such as "1+2*3" but not "sin x"...
>
>
> White space is clearly NOT part a grapheme cluster, so I don't see what
> your issue is?
>

No, whitespace is a grapheme cluster by its own, matching (.)

The issue is the overlong extended grapheme cluster after any Prepend
occurs because ( Grapheme_Extend | *SpacingMark* )*
But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore
the rare RI-sequences which are still are stil short) and will not match
the sequences of digits or letters intended by the prepended concatenation
marks, but only one.


> BTW, if after careful analysis you think there is a mistake, you should
> probably raise a bug on this.
>

For now the proposal only speaks about listing the prepended characters
enumeration with a new defined property , it still does not address what
are the sequences of graphemes over which they apply. As these quequences
are specific to each prepended character, I don't see how the new property
will help if we need to specialize each one of these characters: we still
need custom algorithm (possibly tailored by locale) for breaking clusters
using them.

With the definition given above, the extended grapheme clusters will break
after each letter/digit/punctuation and
 <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
will still break into
  <ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
The proposed new property does not change this : how can we really extend
the sequence of digits so that the number sign will span all of them? Use
CGJ or explicit sequence delimiters ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/e225c35b/attachment.html>


More information about the Unicode mailing list