Line-breaking algorithm: Unexpected break in multiple consecutive numeric prefixes

Andy Heninger andy.heninger at gmail.com
Mon Apr 11 16:31:55 CDT 2022


I tried the sequences you identified against ICU line breaking,

− 2212 MINUS SIGN  (line-breaking class PR)
‎$ 0024 DOLLAR SIGN (line-breaking class PR)
‎4 0034 DIGIT FOUR  (line-breaking class NU)
‎5 0035 DIGIT FIVE  (line-breaking class NU)

and

+ 002B PLUS SIGN  (line-breaking class PR)
‎$ 0024 DOLLAR SIGN (line-breaking class PR)
‎4 0034 DIGIT FOUR  (line-breaking class NU)
‎5 0035 DIGIT FIVE  (line-breaking class NU)

In both cases there was a boundary after the first character (− or +),
which is consistent with the UAX-14 rules. Whether this is desirable or not
is a separate question.

Perhaps Safari has done some additional tailoring of the rules in question.

For what it's worth, for Numbers, ICU uses the full regular expression
( PR <http://unicode.org/reports/tr14/#PR> | PO
<http://unicode.org/reports/tr14/#PO>) ? ( OP
<http://unicode.org/reports/tr14/#OP> | HY
<http://unicode.org/reports/tr14/#HY> ) ? NU
<http://unicode.org/reports/tr14/#NU> (NU
<http://unicode.org/reports/tr14/#NU> | SY
<http://unicode.org/reports/tr14/#SY> | IS
<http://unicode.org/reports/tr14/#IS>) * (CL
<http://unicode.org/reports/tr14/#CL> | CP
<http://unicode.org/reports/tr14/#CP>) ? ( PR
<http://unicode.org/reports/tr14/#PR> | PO
<http://unicode.org/reports/tr14/#PO>) ?
instead of the short fragments of rules from LB24 and LB25. The main
difference is that a "number" sequence must contain at least one NU
character.

  -- Andy


On Fri, Apr 1, 2022 at 8:38 AM Ophir Lifshitz via Unicode <
unicode at corp.unicode.org> wrote:

> Hello again,
>
> I hope it's not an issue to re-ask this question I had from a while back.
>
> Thanks!
>
> On Sun, Sep 19, 2021 at 5:13 AM Ophir Lifshitz <me at ophir.li> wrote:
>
>> I have a question about the line-breaking algorithm. Apologies if it
>> is uninformed or if this is the wrong venue.
>>
>> I recently experienced an unexpected line break[1] after the first
>> character in the following sequence[2]:
>>
>> ‎− 2212 MINUS SIGN  (line-breaking class PR)
>> ‎$ 0024 DOLLAR SIGN (line-breaking class PR)
>> ‎4 0034 DIGIT FOUR  (line-breaking class NU)
>> ‎5 0035 DIGIT FIVE  (line-breaking class NU)
>>
>> (However, if the first character is replaced by 002B PLUS SIGN (also
>> class PR), a line break does not occur.)
>>
>> I also noticed that there is no "PR × PR" rule in (e.g.) LB25.
>>
>> Is this intended, perhaps an oversight, or is it up to implementation
>> discretion i.e. "tailored"?
>>
>> If it is an oversight, what is the process for correcting it or filing
>> a bug? It is hard to find that information on the Unicode website.
>>
>> Thank you.
>>
>>
>> [1] The line break appeared in Chrome 93 and Safari 13.1 on Mac 10.13,
>> but not in Firefox 85.
>> I tested by navigating in my browser to the following data URIs:
>>
>> data:text/html;charset=utf-8,<p%20style="width:1px;">%E2%88%92$45</p>
>> data:text/html;charset=utf-8,<p%20style="width:1px;">%2B$45</p>
>>
>> [2] This sequence is intended to behave as a single unit (word), and
>> refers to a price discount in the original text.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20220411/495bee95/attachment.htm>


More information about the Unicode mailing list