Atomicity of Grapheme Clusters

Richard Wordingham via Unicode unicode at unicode.org
Wed Dec 13 12:36:35 CST 2017


I have been reviewing UAX#29 Unicode Text Segmentation because I have a
feeling we will be trying to do too much with the concept of grapheme
clusters, even with tailoring, when we extend it to include whole
aksharas.

What is the meaning of "Word boundaries, line boundaries, and sentence
boundaries should not occur within a grapheme cluster: in other words,
a grapheme cluster should be an atomic unit with respect to the process
of determining these other boundaries"?  In particular, whom is it
directed to?

Now, once quadrate support is added and we are able to write Ancient
Egyptian in Unicode, we will probably have two very significant
languages that regularly breach parts of that rule.  (At least, I
assume a whole Egyptian quadrate would be included in a dropped
capital.) Sanskrit word boundaries frequently occur within *legacy*
grapheme clusters, and sentence boundaries may occur within quadrates.
I presume UAX#29 does not intend that we should use means other than
Unicode to write samhita Sanskrit and Ancient Egyptian.

Richard.


More information about the Unicode mailing list