UAX #29: Ambiguities in WB4, and contributing back testcases

Richard Wordingham richard.wordingham at ntlworld.com
Thu Dec 22 16:58:10 CST 2016


On Thu, 22 Dec 2016 14:05:18 -0800
Manish Goregaokar <manish at mozilla.com> wrote:

> I guess the confusion is, with → rules, do we apply them globally, or
> only apply them when considering subsequent rules?

I would say the latter.  The logic is that you apply the whole set of
rules on either side of each character.

> I suspect the answer here is that you only apply them in order. The
> list of rules is not a list of precedences, but rather a list with the
> order in which the rules are applied. So a → rule means "Treat the
> left side as if it were the right side in the context of all
> subsequent rules"

I would indeed say that you apply them in order.  The relevant example
in the test suite (file auxiliary/WordBreakTest.txt in the UCD) is:

÷ 000D ÷ 0308 ÷ 000A ÷

Now, I am not sure if it is possible to automatically turn the rules
into an automatic break iterator based on regular expressions.  The last
time I looked, ICU was doing this by manual conversion.  I would
therefore deduce that such a conversion is impossible, difficult, or
produces highly inefficient code.  ICU has the added complication that
it also needs to invoke real Southeast Asian break iterators.  When I
looked, their interface was not returning appropriate
word-break properties for the characters, but was itself a break
iterator. 

Richard.



More information about the Unicode mailing list