New Character Property for Prepended Concatenation Marks

Asmus Freytag (t) asmus-inc at
Thu Nov 26 06:58:44 CST 2015

On 11/26/2015 4:29 AM, Philippe Verdy wrote:
> 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at 
> <mailto:asmus-inc at>>:
>     On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>>     The related definition for extended grapheme clusters says:
>>     ( CRLF
>>     | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>>              ( Grapheme_Extend | *SpacingMark* )*
>>     | . )
>>     However I do not understand why it may include only one
>>     Hangul-Syllable when applying prepended concatenation marks. And
>>     if the definition excludes whitespaces, nothing prevents it to
>>     extend to arbitrary sequences of
>>     letters/digits/symbols/punctuations (this could span very long
>>     sequences of sinograms, or other letters from scripts that do not
>>     use whitespaces as word separators. Even in the Latin script it
>>     would extend to the punctuation signs that may follow any word,
>>     or to an entire mathematical formula such as "1+2*3" but not "sin
>>     x"...
>     White space is clearly NOT part a grapheme cluster, so I don't see
>     what your issue is?
> No, whitespace is a grapheme cluster by its own, matching (.)
> The issue is the overlong extended grapheme cluster after any Prepend 
> occurs because ( Grapheme_Extend | *SpacingMark* )*
> But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we 
> ignore the rare RI-sequences which are still are stil short) and will 
> not match the sequences of digits or letters intended by the prepended 
> concatenation marks, but only one.

Prepend in front of an RI-Sequence is really a "defective" cluster in 
terms of the Arabic number sign's definition. So, one thing the Grapheme 
cluster specification should be clear about is that it does not describe 
the breaks in formatting runs needed to implement these characters.

Also, for editing (a common use of grapheme clusters) running these 
together with any following characters is not very useful in my opinion. 
So, perhaps much of the "Prepend" is a bug after all?
>     BTW, if after careful analysis you think there is a mistake, you
>     should probably raise a bug on this.
> For now the proposal only speaks about listing the prepended 
> characters enumeration with a new defined property , it still does not 
> address what are the sequences of graphemes over which they apply. As 
> these quequences are specific to each prepended character, I don't see 
> how the new property will help if we need to specialize each one of 
> these characters: we still need custom algorithm (possibly tailored by 
> locale) for breaking clusters using them.

correct - I wouldn't call that an "algorithm" -- it's the formatting 
behavior for that code point (some of them are similar, as I said, I see 
three patterns: following digit, digit run and word run.
> With the definition given above, the extended grapheme clusters will 
> break after each letter/digit/punctuation and
> will still break into
> The proposed new property does not change this : how can we really 
> extend the sequence of digits so that the number sign will span all of 
> them? Use CGJ or explicit sequence delimiters ?
correct, gives an incorrect specification - we need an actual 
specification for the format runs.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list