String Ranges in Unicode Sets
Richard Wordingham
richard.wordingham at ntlworld.com
Mon Sep 7 14:46:06 CDT 2015
On Mon, 7 Sep 2015 16:54:16 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:
> On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>> By my reading, adding string ranges will initially make regular
>> expression engines that don't use ICU non-compliant with Level 1 of
>> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction
>> and
> I don't see where you are getting that. UTS 35 isn't referenced by
> UTS 18 except for some examples of possible extensions in 1.2.3 Other
> Properties, and locale id syntax in level 3. I may be missing
> something, however. Can you tell me where #18 is referencing
> UnicodeSet?
In http://unicode.org/mail-arch/unicode-ml/y2014-m05/0052.html ,
you stated that the Unicode sets referred to in UTS#18 RL1.3 are the
Unicode sets defined in UTS #35. We are now waiting for you to add the
reference under Action 141-A76 - 'Make changes in UTS #18 based on
general feedback in
L2/14-277' (http://www.unicode.org/L2/L2014/14277-pubrev-ovrflw.html).
I presume no change has been made yet because there are no *urgent*
changes for UTS #18.
> String ranges need not be implemented internally (and I don't think
> the CLDR committee would expect them to be, in general). They are
> simply a way of expressing the *string format* of a UnicodeSet in a
> more compact fashion. (And UnicodeSets themselves can have a variety
> of different implementations, in any event).
[\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a
very compact way of expressing a lot of strings. You wouldn't
decompose that into a list of strings.
>> String
>> ranges seem particularly vulnerable to the ill-effects of
>> unpredictable
> UnicodeSets are low level constructs, as are their string
> representations. Like all strings, the string format of a UnicodeSet
> may change if it is normalized. That is nothing new.
> - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A
> through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390
> code points.
> - Under NFC it would change to "[a-Ω]" (that is, U+0061 LATIN
> SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and
> contain 841 code points.
At least this gives the same range whether normalised to NFC or to
NFD. Using NFD, the preferred normalisation for regular
expressions semi-respecting canonical equivalence, [{x̀}-{ẍ}] would
not include the 2-character string "xa", as both bounds would decompose
to two characters. Using NFC, the preferred normalisation for LDML
(and for XML, I think), this would be a contraction for [{x̀}-{xẍ}],
and would include the 2-character string "xa". If the two strings had
to have the same length, [{x̀}-{ẍ}] would be flagged as erroneous if
interpreted in NFC, and with any luck, similar errors that were not
detected would then also be corrected. It's not perfect, but il meglio
è l’inimico del bene.
> You really don't want to normalize the string format of UnicodeSets.
> Or if you suspect that those string formats might be normalized, then
> just use escaped format \x{...} for anything that might change under
> normalization.
It would probably be sensible to issue a warning if the specification
of a string bound had more than one canonical equivalent.
I'm thinking of accidents. While an XML processor must not be Unicode
compliant, I thought most regular expression engine environments were
allowed to be Unicode compliant.
TUS 8.0 Chapter 3 C6: "A process shall not assume that the
interpretations of two canonical-equivalent character sequences are
distinct."
> Note that while it is fine to bring up topics for discussion here (or,
> better yet, on the "cldr-users at unicode.org" <cldr-users at unicode.org>
> list),
As this impacts regular expressions in general, I think this is the
better list for the impact on Unicode sets outside CLDR.
> anything that requires a change will have to be filed as a
> CLDR ticket. Richard, I'm sure you know this, and also raised this
> topic here because of the relation to UTS18, so this is a reminder
> for others.
Exactly.
Richard.
More information about the Unicode
mailing list