String Ranges in Unicode Sets

Richard Wordingham richard.wordingham at
Mon Sep 7 14:46:06 CDT 2015

On Mon, 7 Sep 2015 16:54:16 +0200
Mark Davis ☕️ <mark at> wrote:

> On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham <
> richard.wordingham at> wrote:

>> By my reading, adding string ranges will initially make regular
>> expression engines that don't use ICU non-compliant with Level 1 of
>> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction
>> and

> I don't see where you are getting that. UTS 35 isn't referenced by
> UTS 18 except for some examples of possible extensions in 1.2.3 Other
> Properties, and locale id syntax in level 3. I may be missing
> something, however. Can you tell me where #18 is referencing
> UnicodeSet?

In ,
you stated that the Unicode sets referred to in UTS#18 RL1.3 are the
Unicode sets defined in UTS #35.  We are now waiting for you to add the
reference under Action 141-A76 - 'Make changes in UTS #18 based on
general feedback in
L2/14-277' (
I presume no change has been made yet because there are no *urgent*
changes for UTS #18.

> String ranges need not be implemented internally (and I don't think
> the CLDR committee would expect them to be, in general). They are
> simply a way of expressing the *string format* of a UnicodeSet in a
> more compact fashion. (And UnicodeSets themselves can have a variety
> of different implementations, in any event).

[\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a
very compact way of expressing a lot of strings.  You wouldn't
decompose that into a list of strings.

>> String ​ ​
>> ranges seem particularly vulnerable to the ill-effects of
>> unpredictable

> UnicodeSets are low level constructs, as are their string
> representations. Like all strings, the string format of a UnicodeSet
> may change if it is normalized. That is nothing new.

>    - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A
> through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390
> code points.
>    - Under NFC it  would change to "[a-Ω]" (that is,  U+0061 LATIN
> contain 841 code points.

At least this gives the same range whether normalised to NFC or to
NFD.  Using NFD, the preferred normalisation for regular
expressions semi-respecting canonical equivalence, [{x̀}-{ẍ}] would
not include the 2-character string "xa", as both bounds would decompose
to two characters.  Using NFC, the preferred normalisation for LDML
(and for XML, I think), this would be a contraction for [{x̀}-{xẍ}],
and would include the 2-character string "xa".  If the two strings had
to have the same length, [{x̀}-{ẍ}] would be flagged as erroneous if
interpreted in NFC, and with any luck, similar errors that were not
detected would then also be corrected.  It's not perfect, but il meglio
è l’inimico del bene.

> You really don't want to normalize the string format of UnicodeSets.
> Or if you suspect that those string formats might be normalized, then
> just use escaped format \x{...} for anything that might change under
> normalization.

It would probably be sensible to issue a warning if the specification
of a string bound had more than one canonical equivalent. 

I'm thinking of accidents.  While an XML processor must not be Unicode
compliant, I thought most regular expression engine environments were
allowed to be Unicode compliant.

TUS 8.0 Chapter 3 C6: "A process shall not assume that the
interpretations of two canonical-equivalent character sequences are

> Note that while it is fine to bring up topics for discussion here (or,
> better yet, on the "cldr-users at" <cldr-users at>
> list),

As this impacts regular expressions in general, I think this is the
better list for the impact on Unicode sets outside CLDR.

> anything that requires a change will have to be filed as a
> CLDR ticket. Richard, I'm sure you know this, and also raised this
> topic here because of the relation to UTS18, so this is a reminder
> for others.



More information about the Unicode mailing list