Question regarding TR-29
prospero
prospero at cyber-wizard.com
Wed Dec 6 17:44:17 CST 2023
unicode.org/reports/tr29
The WB4 rule for word breaks:
> Ignore Format and Extend characters, except after sot, CR, LF, and Newline. (See Section 6.2, Replacing Ignore Rules[https://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules].)
> This also has the effect of: Any × (Format | Extend | ZWJ)
seems incomplete and ambiguous. First, the "except after" part needs to apply to WSegSpace also, otherwise tests fail. And the handling of WB3c seems contradicted by the tests, e.g., the one on line 1158:
÷ 200D × 0308 ÷ 231A ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0] COMBINING DIAERESIS ( Extend_FE) ÷ [999.0] WATCH (ExtPict) ÷ [0.3]
seems to contradict it, since ignoring the 0308 (Extend_FE) should yield a ZWJ_FE + ExtPict, which should not break, but the test requires a break. If the tests are dispositive, could TR-29 be better clarified to reflect them?
More information about the Unicode
mailing list