Question about the Sentence_Break property

Konstantin Ritt ritt.ks at gmail.com
Fri Feb 20 20:50:53 CST 2015


When UAX9 mentions a paragraph level, it says:

> Paragraphs are divided by the Paragraph Separator or appropriate Newline
Function (for guidelines on the handling of CR, LF, and CRLF, see *Section
4.4, Directionality*, and *Section 5.8, Newline Guidelines* of [Unicode
<http://www.unicode.org/reports/tr41/tr41-15.html#Unicode>]). Paragraphs
may also be determined by higher-level protocols: for example, the text in
two different cells of a table will be in different paragraphs.

Regards,
Konstantin

2015-02-21 3:56 GMT+04:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2015-02-20 6:14 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
>> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
>> One thing that is missing is mention of the convention that a single
>> newline character (or CRLF pair) is a line break whereas a doubled
>> newline character denotes a paragraph break.
>>
>
> In that case CR or LF characters alone are not "paragraph separators" by
> themselves unless they are grouped together. Like NEL, they should just be
> considered as line separators and the terminology used in UAX 29 rule SB4
> is effectively incorrect if what matters here is just the linebreak
> property. And also in that case, the SB4 rule should effecticely include
> NEL (from the C1 subset).
>
> But as SB4 is only related to sentence breaking, It would be e problem
> because simple linebreaks are used extremely frequently in the middle of
> sentences.
>
> What the Sentence break algorithm should say is that there should first be
> a preprossing step separating line breaks and paragraph breaks (creating
> custom entities,(similar to collation elements, but encoded internally with
> a code point out of the standard space), that the rule SB4 would use
> instead of "Sep | CR | LF". That custome entity should be "Sep" but without
> the rule defining it, as there are various ways to represent paragraph
> breaks.
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150221/711e644e/attachment.html>


More information about the Unicode mailing list