Question about the Sentence_Break property
ritt.ks at gmail.com
Fri Feb 20 20:50:53 CST 2015
When UAX9 mentions a paragraph level, it says:
> Paragraphs are divided by the Paragraph Separator or appropriate Newline
Function (for guidelines on the handling of CR, LF, and CRLF, see *Section
4.4, Directionality*, and *Section 5.8, Newline Guidelines* of [Unicode
may also be determined by higher-level protocols: for example, the text in
two different cells of a table will be in different paragraphs.
2015-02-21 3:56 GMT+04:00 Philippe Verdy <verdy_p at wanadoo.fr>:
> 2015-02-20 6:14 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
>> One thing that is missing is mention of the convention that a single
>> newline character (or CRLF pair) is a line break whereas a doubled
>> newline character denotes a paragraph break.
> In that case CR or LF characters alone are not "paragraph separators" by
> themselves unless they are grouped together. Like NEL, they should just be
> considered as line separators and the terminology used in UAX 29 rule SB4
> is effectively incorrect if what matters here is just the linebreak
> property. And also in that case, the SB4 rule should effecticely include
> NEL (from the C1 subset).
> But as SB4 is only related to sentence breaking, It would be e problem
> because simple linebreaks are used extremely frequently in the middle of
> What the Sentence break algorithm should say is that there should first be
> a preprossing step separating line breaks and paragraph breaks (creating
> custom entities,(similar to collation elements, but encoded internally with
> a code point out of the standard space), that the rule SB4 would use
> instead of "Sep | CR | LF". That custome entity should be "Sep" but without
> the rule defining it, as there are various ways to represent paragraph
> Unicode mailing list
> Unicode at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode