Extended grapheme cluster stability

Asmus Freytag via Unicode unicode at unicode.org
Tue May 22 16:48:56 CDT 2018

On thing to bear in mind about breaks: Unicode is plain-text and not 
"final rendered text".

Many types of breaks depend on things like actual font selection, column 
width and other factors determined by styling. They are therefore not 
necessarily stable from a plain text perspective (the same goes for 
things not specified by Unicode, like hyphenation, because hyphenation, 
for example, depends on the actual language associated with a text, 
something not part of the plain text back-bone).

The moral is that if you need a frozen representation of text that does 
not behave differently if accessed, iterated, viewed etc. at different 
times, you need to have some kind of rich-text format that can represent 
all segmentation choices. If, on the other hand, you are doing a live 
interaction with the text, then Unicode segmentation gives you the "best 
available" algorithm - which may change over time as new information 
becomes available about what constitutes best practice.

For many writing systems, the understanding of best practice is still 
quite limited at this point - in the sense that even if it is known, it 
is not widely available and therefore there has not yet been a chance to 
validate and standardize it. (Setting aside areas of actual innovation, 
like emoji). For these reasons, it would be outright detrimental if any 
of these algorithms are "frozen" -- however, the hope is that updates 
are handled with some sensitivity to avoid unnecessary disruption of 
settled practice.


On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote:
> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:
>> Hello,
>> None of the *_Break properties are stable, as far as I can see in
>> https://www.unicode.org/policies/stability_policy.html. If I understand
>> correctly, this means that, at least in theory, it is possible that in
>> Unicode version X a sequence of characters AB forms an extended grapheme
>> cluster, i.e. A × B in the notation used in the algorithm description
>> and in the test data, but then in Unicode version X+1, that changes to A
>> ÷ B.
>> Am I reading this correctly or is this not possible? Or is it possible
>> in theory but not in practice? Or maybe it has happened before?
> Hmm, to answer my own question, yes, this has happened before. In
> Unicode 8 there were no breaks between regional indicators. In Unicode 9
> now there are no breaks "between regional indicator (RI) symbols if
> there is an odd number of RI characters before the break point". I has
> also happened in the direction break=>no break, with when emoji ZWJ
> sequences were introduced.

More information about the Unicode mailing list