Possibly incorrect line break tests?
Robin Leroy
egg.robin.leroy at gmail.com
Wed Sep 3 14:23:15 CDT 2025
Le mer. 3 sept. 2025 à 20:10, Henry via Unicode <unicode at corp.unicode.org>
a écrit :
> × 200B ÷ 0308 × 0024 ÷ # × [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ÷
> [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [24.03] DOLLAR SIGN
> (PR_NotEastAsian) ÷ [0.3]
>
> LB24 states "Do not break between alphabetics (“at”)" with the following
> break rule:
>
> (PR | PO) × (AL | HL)
> (AL | HL) × (PR | PO)
>
> However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they
> have break class ZW and CM).
>
You missed rule LB10.
LB9: Treat X (CM | ZWJ)* as if it were X, where X is any line break class
except BK, CR, LF, NL, SP, or ZW.
LB10: Treat any remaining CM or ZWJ as if it had the properties of U+0041 A
LATIN CAPITAL LETTER A, that is, Line_Break=AL, General_Category=Lu,
East_Asian_Width=Na, Extended_Pictographic=N.
U+0208 is CM.
U+200B is ZW, so LB9 does not apply. Therefore, LB10 applies, and it
becomes AL for subsequent rules.
LB24 therefore applies, (AL | HL) × (PR | PO).
Same for the other example you cite, a CM becomes AL.
Best regards,
Robin Leroy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250903/b24f6abb/attachment.htm>
More information about the Unicode
mailing list