Question regarding TR-29

prospero prospero at cyber-wizard.com
Wed Dec 6 17:44:17 CST 2023


unicode.org/reports/tr29
 
The WB4 rule for word breaks:
 
> Ignore Format and Extend characters, except after sot, CR, LF, and Newline. (See Section 6.2, Replacing Ignore Rules[https://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules].)
> This also has the effect of: Any × (Format | Extend | ZWJ)

seems incomplete and ambiguous. First, the "except after" part needs to apply to WSegSpace also, otherwise tests fail. And the handling of WB3c seems contradicted by the tests, e.g., the one on line 1158:

÷ 200D × 0308 ÷ 231A ÷	#  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0] COMBINING DIAERESIS (	Extend_FE) ÷ [999.0] WATCH (ExtPict) ÷ [0.3]

seems to contradict it, since ignoring the 0308 (Extend_FE) should yield a ZWJ_FE + ExtPict, which should not break, but the test requires a break. If the tests are dispositive, could TR-29 be better clarified to reflect them?



More information about the Unicode mailing list