Question regarding TR-29
Manish Goregaokar
manishsmail at gmail.com
Thu Dec 7 17:56:58 CST 2023
Hi!
I think a crucial thing to note about interpreting these rules is that they
must be applied in order, WB4 can only be applied after all of the WB3s,
etc. In general the logical model is that each rule is applied to the
entire input string before moving on to the next rule. In practice,
implementations tend to come up with a way of doing this in one or a
handful of loops by retaining some careful state.
The sequences `WSegSpace Format* WSegSpace` or `ZWJ Extend Ext_Pict` won't
have do-not-breaks generated by WB3d/WB3c because those rules apply before
the "ignore Extend/Format"
Since no rules after WB4 mention Extended_Pictographic or WSegSpace, WB4
does not need to try to include them in the "except" clause.
Hope this helps
Thanks,
-Manish
On Wed, Dec 6, 2023, 4:17 PM prospero via Unicode <unicode at corp.unicode.org>
wrote:
>
> unicode.org/reports/tr29
>
> The WB4 rule for word breaks:
>
> > Ignore Format and Extend characters, except after sot, CR, LF, and
> Newline. (See Section 6.2, Replacing Ignore Rules[
> https://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules].)
> > This also has the effect of: Any × (Format | Extend | ZWJ)
>
> seems incomplete and ambiguous. First, the "except after" part needs to
> apply to WSegSpace also, otherwise tests fail. And the handling of WB3c
> seems contradicted by the tests, e.g., the one on line 1158:
>
> ÷ 200D × 0308 ÷ 231A ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS ( Extend_FE) ÷ [999.0] WATCH (ExtPict) ÷ [0.3]
>
> seems to contradict it, since ignoring the 0308 (Extend_FE) should yield a
> ZWJ_FE + ExtPict, which should not break, but the test requires a break. If
> the tests are dispositive, could TR-29 be better clarified to reflect them?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231207/e8bc5986/attachment.htm>
More information about the Unicode
mailing list