Possibly incorrect line break tests?
thin.crew1671 at railgunlabs.com
thin.crew1671 at railgunlabs.com
Wed Sep 3 11:41:40 CDT 2025
In LineBreakTest.txt, there are test cases that indicate there should *not* be a break after U+0308, however, the LB rule cited does not appear to apply and it would appear that there *should* be a break. For example:
× 000A ÷ 0308 × 23E9 ÷ # × [0.3] <LINE FEED (LF)> (LF_NotEastAsian) ÷ [5.03] COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [28.0] BLACK RIGHT-POINTING DOUBLE TRIANGLE (AL) ÷ [0.3]
LB28 states "Do not break between alphabetics (“at”)" with the following break rule:
(AL | HL) × (AL | HL)
However, in the aforementioned test case, neither U+000A nor U+0308 has break class AL or HL (they have break class LF and CM). Yet rule 28.0 is cited as the reason for not breaking between U+0308 and U+23E9. It would appear that there _should_ be a break here.
Likewise, for the test:
× 200B ÷ 0308 × 0024 ÷ # × [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ÷ [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [24.03] DOLLAR SIGN (PR_NotEastAsian) ÷ [0.3]
LB24 states "Do not break between alphabetics (“at”)" with the following break rule:
(PR | PO) × (AL | HL)
(AL | HL) × (PR | PO)
However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they have break class ZW and CM). Yet rule 24.03 is cited as the reason for not breaking between U+0308 and U+0024. It would appear that there _should_ be a break here.
In total, I have collected ~80 test cases from LineBreakTest.txt that exhibit this same pattern.
I'm wondering if these test cases were meant to have a hyphen character because then they'd respect rule LB20a which states "Do not break after a word-initial hyphen". This rule has the definition:
( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | [\u2010] ) × AL
So, for example, test case:
× 000A ÷ 0308 × 23E9 ÷ # LF ÷ CM × AL (incorrect?)
would become:
× 000A ÷ 0308 ÷ 002D × 23E9 ÷ # LF ÷ CM ÷ HY × AL (correct)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250903/ea47633b/attachment.htm>
More information about the Unicode
mailing list