Question about the Sentence_Break property
public at khwilliamson.com
Sat Feb 21 13:10:14 CST 2015
On 02/20/2015 04:56 PM, Philippe Verdy wrote:
> 2015-02-20 6:14 GMT+01:00 Richard Wordingham
> <richard.wordingham at ntlworld.com <mailto:richard.wordingham at ntlworld.com>>:
> TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8.
> One thing that is missing is mention of the convention that a single
> newline character (or CRLF pair) is a line break whereas a doubled
> newline character denotes a paragraph break.
> In that case CR or LF characters alone are not "paragraph separators" by
> themselves unless they are grouped together. Like NEL, they should just
> be considered as line separators and the terminology used in UAX 29 rule
> SB4 is effectively incorrect if what matters here is just the linebreak
> property. And also in that case, the SB4 rule should effecticely include
> NEL (from the C1 subset).
> But as SB4 is only related to sentence breaking, It would be e problem
> because simple linebreaks are used extremely frequently in the middle of
> What the Sentence break algorithm should say is that there should first
> be a preprossing step separating line breaks and paragraph breaks
> (creating custom entities,(similar to collation elements, but encoded
> internally with a code point out of the standard space), that the rule
> SB4 would use instead of "Sep | CR | LF". That custome entity should be
> "Sep" but without the rule defining it, as there are various ways to
> represent paragraph breaks.
But isn't SB4 contradictory to this from TUS Section 5.8?
R2c In parsing, choose the
For example, in recommendation R2c an implementer dealing with sentence
tics would reason in the following way that it is safer to interpret any
• Suppose an
were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.
• Suppose an
were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence br
eaks, which would result in significant
problems with the sentence break heuristics
It seems to me SB4 is choosing the non-safer way. What am I missing?
More information about the Unicode