New Character Property for Prepended Concatenation Marks

Philippe Verdy verdy_p at
Thu Nov 26 07:04:27 CST 2015

Also, for Kaithi (TUS-15.2 pages 570-571) I note this paragraph:

The character U+110BD kaithi number sign is a format control character that
interacts with digits, occurring either above or below a digit. The
position of the kaithi number South and Central Asia-IV 571 15.2 Kaithi
sign indicates its usage: when the mark occurs above a digit, it indicates
a number in an itemized list, similar to U+2116 numero sign. If it occurs
below a digit, it indicates a numerical reference. Like U+0600 arabic
number sign and the other Arabic signs that span numbers (see Section 9.2,
Arabic), the kaithi number sign precedes the numbers they graphically
interact with, rather than following them, as would combining characters.
The U+110BC kaithi enumeration sign is the spacing version of the kaithi
number sign, and is used for inline usage.

However there's absolutely no indication on how to disambiguate the two
usages and presentations if these are unified within the same U+110BD
character. In both cases it will be encoded before the Kaithi digits. Note
that U+110BC is a separate standalone usage (as a symbol without any
number) which is a priori much more limited. Possibly something was
forgotten there:
- add an additional (joiner) control between it and the digits for the
numeric reference (e.g. note calls), and none for itemized lists (including
when numbering section headings) ?
- or encode a separate character for its usage in numeric reference (below

In the Latin script, both usages are generally distinguished but no
specific mark is used (with the exception of the legacy Numero symbol), and
there's no need to tweak the default presentation of clusters :
- the "numero" symbol or abbreviation (N or n + superscript o) is used for
references, or the number itself is put in superscript or between
- but for itemized lists, the indicator is typically a suffix after the
number (e.g. a dot or hyphen punctuation before the item itself,  or within
the item itself a superscript "o" or "a", or superscripted final
abbreviation, such as "e", "er" in French, "st", "nd", "rd" in English...)

2015-11-26 13:29 GMT+01:00 Philippe Verdy <verdy_p at>:

> 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at>:
>> On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>> The related definition for extended grapheme clusters says:
>> ( CRLF
>> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>>            ( Grapheme_Extend | *SpacingMark* )*
>> | . )
>> However I do not understand why it may include only one Hangul-Syllable
>> when applying prepended concatenation marks. And if the definition excludes
>> whitespaces, nothing prevents it to extend to arbitrary sequences of
>> letters/digits/symbols/punctuations (this could span very long sequences of
>> sinograms, or other letters from scripts that do not use whitespaces as
>> word separators. Even in the Latin script it would extend to the
>> punctuation signs that may follow any word, or to an entire mathematical
>> formula such as "1+2*3" but not "sin x"...
>> White space is clearly NOT part a grapheme cluster, so I don't see what
>> your issue is?
> No, whitespace is a grapheme cluster by its own, matching (.)
> The issue is the overlong extended grapheme cluster after any Prepend
> occurs because ( Grapheme_Extend | *SpacingMark* )*
> But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore
> the rare RI-sequences which are still are stil short) and will not match
> the sequences of digits or letters intended by the prepended concatenation
> marks, but only one.
>> BTW, if after careful analysis you think there is a mistake, you should
>> probably raise a bug on this.
> For now the proposal only speaks about listing the prepended characters
> enumeration with a new defined property , it still does not address what
> are the sequences of graphemes over which they apply. As these quequences
> are specific to each prepended character, I don't see how the new property
> will help if we need to specialize each one of these characters: we still
> need custom algorithm (possibly tailored by locale) for breaking clusters
> using them.
> With the definition given above, the extended grapheme clusters will break
> after each letter/digit/punctuation and
> will still break into
> The proposed new property does not change this : how can we really extend
> the sequence of digits so that the number sign will span all of them? Use
> CGJ or explicit sequence delimiters ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list