Potential contradiction between the WordBreak test data and UAX #29

Tom Hacohen tom at osg.samsung.com
Wed Nov 23 04:22:59 CST 2016


On 23/11/16 10:01, Daniel Bünzli wrote:
> On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote:
>> However, looking at the test case and the UAX[2], this does not look
>> correct. More specifically, because of rule 4:
>> ZWJ Extended GAZ -> ZWJ GAZ
>> And then according to rule 3c, there should be no break opportunity
>> between them.
>
> I'd say this is not the right operational model. From [1]:
>
> "The rules are processed from top to bottom. As soon as a rule matches and produces a boundary status (boundary or no boundary) for that offset, the process is terminated."
>
> So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 quicks in. It does not produce a boundary status, it only changes your offset context to ZWJ GAZ, as you mention. Now you continue applying the rules sequentially with WB6 which does not match, with WB7 which does not match,... and you'll get to WB999 which matches and produces a boundary status.
>
> After WB4 you do not restart the matching process from the beginning, as you do, leading you to say that WB3c should apply.

Hey Daniel,

Thank you for your reply, but I don't think the UAX, specifically the 
line you quoted implies that. The line you quoted says that the process 
is terminated when a rule matches and produces a boundary status. In 
Table 1[1], the right-arrow (which is used in rule 4) is listed as a 
boundary symbol, so I would argue that one should stop the process and 
start it again from the start.

Furthermore, in the clarification to rule 4[2] it clearly states: "The 
main purpose of this rule is to always treat a grapheme cluster as a 
single character—that is, as if it were simply the first character of 
the cluster".
This again sides with my understanding that:
X Extendend Y
should behave exactly the same as
X Y
after the extended part.
Which is exactly what I'm arguing for.

--
Tom

[1] http://www.unicode.org/reports/tr29/#Table_Boundary_Symbols
[2] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules


More information about the Unicode mailing list