UAX#29 Word-Breaking Interface for Complex Context

Richard Wordingham richard.wordingham at ntlworld.com
Sun Aug 23 09:15:39 CDT 2015


The word-breaking algorithm defines an apparently innocuous interface
for word breaking of 'complex context' scripts such as Thai, Lao and
Myanmar.  The complex context part, whose internals are deliberately
and reasonably not defined by Unicode, assigns word break property
values to the characters.  Are there any implementations that work that
way? Negative answers such as 'xxx does not work that way' would also be
useful.

For example, ICU does not work this way.  Instead, the complex context
parts deliver word boundaries rather than character properties to the
part of the algorithm working in accordance with a tailoring of the
algorithm in UAX#29.

It seems that in general the assignments may be a little complicated.
For example, in the usual case of interest, Thai script word
characters delimited by white space, it seems to me that the characters
of alternate words should be assigned to 'ALetter' and 'Katakana'.
Have I missed a trick here?  'RI' is a new alternative to 'ALetter' and
'Katakana', but that seems even more bizarre, and I'd worry about its
stability.

I'm finding some interesting constraints arisng from the interface.
For example, *within* xกy (that's a Thai letter flanked by two English
letters), there are either no or two word boundaries.  By contrast,
there may be no, one or two linebreak opportunities *within* the string.

Richard.

 



More information about the Unicode mailing list