Text Segmentation Question

Cameron Dutro cdutro at twitter.com
Wed Jan 29 17:15:18 CST 2014


Hey CLDR users,

I've been working on doing some text segmentation using the rules for
SentenceBreak defined in
segments/root.xml<http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml>.
I've run up against some unusual behavior that could be a bug in this file
(although it's much more likely to be my fault). I'd be much obliged if
someone could tell me what's going on.

Take a look at the rule with ID 9. The comment above this rule reads:

<!-- Break after sentence terminators, but include closing punctuation,
trailing spaces, and (optionally) a paragraph separator. -->

The operative word here is "*break*". You would think that the rule then
would contain the break boundary symbol (÷), but instead it contains the
non-break boundary symbol (×):

<rule id="9"> ( $STerm | $ATerm ) $Close* × ( $Close | $Sp | $Sep | $CR |
$LF ) </rule>

Why is this? Shouldn't a break go here? According to my tests, a break
*should* be placed here. Consider this example:

"The. Quick. Brown. Fox"

Rule 9 matches at position 3 in the string: Period ($ATerm) *non-break*
space ($Sp). This would be correct if the boundary symbol was a break, but
it's not. Because this rule matches, my algorithm continues to look at the
next character in the string instead of breaking, giving the wrong results.

Any and all help is greatly appreciated. What am I missing?

-Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140129/89287f0c/attachment.html>


More information about the CLDR-Users mailing list