Extended grapheme cluster stability
Richard Wordingham via Unicode
unicode at unicode.org
Tue May 22 13:27:17 CDT 2018
On Tue, 22 May 2018 14:43:23 +0200
Martinho Fernandes via Unicode <unicode at unicode.org> wrote:
> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:
> > Hello,
> > None of the *_Break properties are stable, as far as I can see in
> > https://www.unicode.org/policies/stability_policy.html. If I
> > understand correctly, this means that, at least in theory, it is
> > possible that in Unicode version X a sequence of characters AB
> > forms an extended grapheme cluster, i.e. A × B in the notation used
> > in the algorithm description and in the test data, but then in
> > Unicode version X+1, that changes to A ÷ B.
> > Am I reading this correctly or is this not possible? Or is it
> > possible in theory but not in practice? Or maybe it has happened
> > before?
> Hmm, to answer my own question, yes, this has happened before. In
> Unicode 8 there were no breaks between regional indicators. In
> Unicode 9 now there are no breaks "between regional indicator (RI)
> symbols if there is an odd number of RI characters before the break
> point". I has also happened in the direction break=>no break, with
> when emoji ZWJ sequences were introduced.
These are more refinements of the algorithm than fundamental changes.
However, many of the breaks are inherently uncertain and may therefore
English has uncertainties as to word boundaries, but the author's
decision is represented in writing, e.g. 'beam width' v. 'beamwidth'.
In writing systems without visible boundaries between words, such as
Thai, such vacillation could occur between software versions rather
than between version of Unicode.
Line break opportunities can in practice vacillate in such writing
systems, e.g. between breaks at syllable boundaries and breaks at word
Formal extended grapheme cluster boundaries have varied in normal,
well established text. In Thai, left matras and consonants were
briefly part of the same grapheme cluster. When that formal property
was implemented in editors, there were howls of pain from Thailand,
and the change was promptly reversed.
I do not believe one rules suits all Indic consonant clusters. While
X virama | Y
makes sense for Devanagari with its half-forms,
X | coeng Y
makes no sense for scripts where it is the second consonant that
changes shape. It makes even less sense when some combinations of
'coeng Y' are encoded separately, as in mainland SE Asia. These
combinations are categorised as marks.
In Burma, the syllable boundary comes after U+1A58 TAI THAM SIGN MAI
KANG LAI. In Laos, it comes before it.
We came very close to extended grapheme clusters being extended to
whole aksharas in Unicode 11.0. My view is that Unicode has
attempted to conflate several concepts in grapheme cluster, and it
just doesn't work.
More information about the Unicode