9.0.0 segmentation and line breaks on the empty string

Andy Heninger andy.heninger at gmail.com
Mon Jun 20 17:32:12 CDT 2016

> I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on
> the empty string while UAX14 still does

This is an interesting edge case.

My reading of UAX 14 is that an empty string would not produce a break.
Both "sot" and "eot" would be true, so LB2,
    sot ×
would match and apply, and that would be the end of the story. LB3 would
never be applied because LB2 would match first.

As to mandating a hard break at the end of text (LB3), I'm not at all sure
this was a good idea. It seems like the breaking behavior would depend on
the external context of the text, about which the LB algorithm knows
nothing. It's different from having text that ends ends with a LF or other
hard-break character. But I'm also disinclined to suggest changes in this
area; the possibility of breaking applications that have come to expect the
existing behavior seems real, and it's all edge cases.

  -- Andy

On Sun, Jun 19, 2016 at 9:34 AM, Daniel Bünzli <daniel.buenzli at erratique.ch>

> Le dimanche, 19 juin 2016 à 16:57, Karl Williamson a écrit :
> > Yes. Use http://www.unicode.org/reporting.html to make an error report.
> Thanks, did that.
> Best,
> Daniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160620/7bfbfe28/attachment.html>

More information about the Unicode mailing list