From rick at unicode.org Thu Jan 2 11:28:56 2014 From: rick at unicode.org (Rick McGowan) Date: Thu, 02 Jan 2014 09:28:56 -0800 Subject: Mail list changes for 2014 In-Reply-To: <52C2F5A7.1060802@unicode.org> References: <529E6190.3030207@unicode.org> <52C2F5A7.1060802@unicode.org> Message-ID: <52C5A1D8.8070600@unicode.org> Hello everyone, the cldr-users mail list has now been re-activated. Regards, Rick On 12/31/2013 8:49 AM, Rick McGowan wrote: > As mentioned in early December, this mail list will be taken off-line > shortly, and be back after the new year. > Regards, > Rick > > On 12/3/2013 2:56 PM, Rick McGowan wrote: >> At the end of the year, we will be changing the mail list server for >> the public-access mail lists, including this one. The new system will >> be Gnu "Mailman", an interface familiar to many. This should make it >> easier for users to handle their subscriptions and options in one >> place, via the web interface. >> >> We will thus be shutting down the public mail lists over the "holiday >> break" in the final days of 2013, and re-open with the new system in >> January 2014. >> >> Affected mail lists are those listed on the Mail Lists page here: >> http://www.unicode.org/consortium/distlist.html >> including Unicode, CLDR-Users, ULI-Users, and Indic. >> >> The new mail list system is documented here: >> http://www.gnu.org/software/mailman/ >> > From cdutro at twitter.com Wed Jan 29 17:15:18 2014 From: cdutro at twitter.com (Cameron Dutro) Date: Wed, 29 Jan 2014 15:15:18 -0800 Subject: Text Segmentation Question Message-ID: Hey CLDR users, I've been working on doing some text segmentation using the rules for SentenceBreak defined in segments/root.xml. I've run up against some unusual behavior that could be a bug in this file (although it's much more likely to be my fault). I'd be much obliged if someone could tell me what's going on. Take a look at the rule with ID 9. The comment above this rule reads: The operative word here is "*break*". You would think that the rule then would contain the break boundary symbol (?), but instead it contains the non-break boundary symbol (?): ( $STerm | $ATerm ) $Close* ? ( $Close | $Sp | $Sep | $CR | $LF ) Why is this? Shouldn't a break go here? According to my tests, a break *should* be placed here. Consider this example: "The. Quick. Brown. Fox" Rule 9 matches at position 3 in the string: Period ($ATerm) *non-break* space ($Sp). This would be correct if the boundary symbol was a break, but it's not. Because this rule matches, my algorithm continues to look at the next character in the string instead of breaking, giving the wrong results. Any and all help is greatly appreciated. What am I missing? -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: