Question about the Sentence_Break property

Philippe Verdy verdy_p at wanadoo.fr
Fri Feb 20 17:56:14 CST 2015


2015-02-20 6:14 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
> One thing that is missing is mention of the convention that a single
> newline character (or CRLF pair) is a line break whereas a doubled
> newline character denotes a paragraph break.
>

In that case CR or LF characters alone are not "paragraph separators" by
themselves unless they are grouped together. Like NEL, they should just be
considered as line separators and the terminology used in UAX 29 rule SB4
is effectively incorrect if what matters here is just the linebreak
property. And also in that case, the SB4 rule should effecticely include
NEL (from the C1 subset).

But as SB4 is only related to sentence breaking, It would be e problem
because simple linebreaks are used extremely frequently in the middle of
sentences.

What the Sentence break algorithm should say is that there should first be
a preprossing step separating line breaks and paragraph breaks (creating
custom entities,(similar to collation elements, but encoded internally with
a code point out of the standard space), that the rule SB4 would use
instead of "Sep | CR | LF". That custome entity should be "Sep" but without
the rule defining it, as there are various ways to represent paragraph
breaks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150221/2f0b0570/attachment.html>


More information about the Unicode mailing list