Extended grapheme cluster stability
Asmus Freytag via Unicode
unicode at unicode.org
Tue May 22 16:48:56 CDT 2018
On thing to bear in mind about breaks: Unicode is plain-text and not
"final rendered text".
Many types of breaks depend on things like actual font selection, column
width and other factors determined by styling. They are therefore not
necessarily stable from a plain text perspective (the same goes for
things not specified by Unicode, like hyphenation, because hyphenation,
for example, depends on the actual language associated with a text,
something not part of the plain text back-bone).
The moral is that if you need a frozen representation of text that does
not behave differently if accessed, iterated, viewed etc. at different
times, you need to have some kind of rich-text format that can represent
all segmentation choices. If, on the other hand, you are doing a live
interaction with the text, then Unicode segmentation gives you the "best
available" algorithm - which may change over time as new information
becomes available about what constitutes best practice.
For many writing systems, the understanding of best practice is still
quite limited at this point - in the sense that even if it is known, it
is not widely available and therefore there has not yet been a chance to
validate and standardize it. (Setting aside areas of actual innovation,
like emoji). For these reasons, it would be outright detrimental if any
of these algorithms are "frozen" -- however, the hope is that updates
are handled with some sensitivity to avoid unnecessary disruption of
On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote:
> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:
>> None of the *_Break properties are stable, as far as I can see in
>> https://www.unicode.org/policies/stability_policy.html. If I understand
>> correctly, this means that, at least in theory, it is possible that in
>> Unicode version X a sequence of characters AB forms an extended grapheme
>> cluster, i.e. A × B in the notation used in the algorithm description
>> and in the test data, but then in Unicode version X+1, that changes to A
>> ÷ B.
>> Am I reading this correctly or is this not possible? Or is it possible
>> in theory but not in practice? Or maybe it has happened before?
> Hmm, to answer my own question, yes, this has happened before. In
> Unicode 8 there were no breaks between regional indicators. In Unicode 9
> now there are no breaks "between regional indicator (RI) symbols if
> there is an odd number of RI characters before the break point". I has
> also happened in the direction break=>no break, with when emoji ZWJ
> sequences were introduced.
More information about the Unicode