Possibly incorrect line break tests?
Kent Karlsson
kent.b.karlsson at outlook.com
Wed Sep 3 14:55:33 CDT 2025
Speaking of line breaking... I have not done an analysis
of the rules, but something is wrong when it comes to quote
marks (QU) and line breaking, at least in some common
applications. And it is wrong in a very annoying way,
and it happens very often (especially on 'smartphones'
where the line length is relatively short).
1) It is common that an automatic line break is inserted
between an open quote mark (which vary by language) and
the quoted text (no space after the (open) quote mark).
2) It is not uncommon to see an automatic line break between
the quoted text and an end quote mark (varies by language).
3) This never happend in the "naive old days" when (almost)
only spaces guided where automatic line breaks were inserted.
I know that for French it is common to have a space after
a begin quote mark, and before an end quote mark. Maybe those
should be NARROW NO_BREAK SPACEs (U+202F)... And yes, in some
scripts one does not use space inside phrases/sentences at all.
It is still quite annoying to see inappropriate automatic
line breaks between a (begin) quote mark and a letter/symbol
or between a letter/symbol and an (end) quote mark.
I can't at this time point to a specific rule to change/fix...
(or if it is just some implementations that are at fault).
Kent Karlsson
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Henry via Unicode
Sent: Wednesday, September 3, 2025 6:42 PM
To: unicode at corp.unicode.org
Subject: Possibly incorrect line break tests?
In LineBreakTest.txt, there are test cases that indicate there should *not* be a break after U+0308, however, the LB rule cited does not appear to apply and it would appear that there *should* be a break. For example:
× 000A ÷ 0308 × 23E9 ÷ # × [0.3] <LINE FEED (LF)> (LF_NotEastAsian) ÷ [5.03] COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [28.0] BLACK RIGHT-POINTING DOUBLE TRIANGLE (AL) ÷ [0.3]
LB28 states "Do not break between alphabetics (“at”)" with the following break rule:
(AL | HL) × (AL | HL)
However, in the aforementioned test case, neither U+000A nor U+0308 has break class AL or HL (they have break class LF and CM). Yet rule 28.0 is cited as the reason for not breaking between U+0308 and U+23E9. It would appear that there _should_ be a break here.
Likewise, for the test:
× 200B ÷ 0308 × 0024 ÷ # × [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ÷ [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) × [24.03] DOLLAR SIGN (PR_NotEastAsian) ÷ [0.3]
LB24 states "Do not break between alphabetics (“at”)" with the following break rule:
(PR | PO) × (AL | HL)
(AL | HL) × (PR | PO)
However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they have break class ZW and CM). Yet rule 24.03 is cited as the reason for not breaking between U+0308 and U+0024. It would appear that there _should_ be a break here.
In total, I have collected ~80 test cases from LineBreakTest.txt that exhibit this same pattern.
I'm wondering if these test cases were meant to have a hyphen character because then they'd respect rule LB20a which states "Do not break after a word-initial hyphen". This rule has the definition:
( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | [\u2010] ) × AL
So, for example, test case:
× 000A ÷ 0308 × 23E9 ÷ # LF ÷ CM × AL (incorrect?)
would become:
× 000A ÷ 0308 ÷ 002D × 23E9 ÷ # LF ÷ CM ÷ HY × AL (correct)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20250903/9e5fd6eb/attachment-0001.htm>
More information about the Unicode
mailing list