String Ranges in Unicode Sets

Richard Wordingham richard.wordingham at ntlworld.com
Mon Sep 7 01:23:21 CDT 2015


On Thu, 03 Sep 2015 09:32:42 -0700
Rick McGowan <rick at unicode.org> wrote:

> A proposed update to the LDML specification (UTS #35) will be
> available for review as of Monday, September 7 at 06:00 GMT. The open
> review period closes on Monday, September 14 at 06:00 GMT. (This is a
> short review period, because CLDR 28 is scheduled for release in the
> week of September 16.)
> 
> The proposed update will be at
> http://unicode.org/reports/tr35/proposed.html
> 
> To report bugs in the specification, please use 
> http://unicode.org/cldr/trac/newticket
> 

Have the implications of adding string ranges to Unicode sets been
considered?  I'm mentioning them on the list because their impact goes
beyond locales, and I haven't worked out their implications myself.

By my reading, adding string ranges will initially make regular
expression engines that don't use ICU non-compliant with Level 1 of
UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and
intersection'.  I don't imagine the extra work of set operations on
Unicode sets containing string ranges will be popular.  It may be worst
for the minority of regular expression engines that use the regularity
of regular expressions.

I note that the safety feature of requiring the start and end points
to have the same length has been removed from their design.  String
ranges seem particularly vulnerable to the ill-effects of unpredictable
normalisation.

Richard.


More information about the Unicode mailing list