String Ranges in Unicode Sets

Mark Davis ☕️ mark at macchiato.com
Tue Sep 8 02:14:44 CDT 2015


Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Mon, Sep 7, 2015 at 9:46 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Mon, 7 Sep 2015 16:54:16 +0200
> Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> > On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham <
> > richard.wordingham at ntlworld.com> wrote:
>
> >> By my reading, adding string ranges will initially make regular
> >> expression engines that don't use ICU non-compliant with Level 1 of
> >> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction
> >> and
>
> > I don't see where you are getting that. UTS 35 isn't referenced by
> > UTS 18 except for some examples of possible extensions in 1.2.3 Other
> > Properties, and locale id syntax in level 3. I may be missing
> > something, however. Can you tell me where #18 is referencing
> > UnicodeSet?
>
> In http://unicode.org/mail-arch/unicode-ml/y2014-m05/0052.html ,
> you stated that the Unicode sets referred to in UTS#18 RL1.3 are the
> Unicode sets defined in UTS #35.  We are now waiting for you to add the
> reference under Action 141-A76 - 'Make changes in UTS #18 based on
> general feedback in
> L2/14-277' (http://www.unicode.org/L2/L2014/14277-pubrev-ovrflw.html).
>

​Good point. I tend to think that any new syntax would need to be
approached charfully, and might only be mentioned as optional at first. But
you'll get a chance for
public review
​ once you see them.​


> I presume no change has been made yet because there are no *urgent*
> changes for UTS #18.
>

​Right, it was backed up behind Unicode 8.0.​


> > String ranges need not be implemented internally (and I don't think
> > the CLDR committee would expect them to be, in general). They are
> > simply a way of expressing the *string format* of a UnicodeSet in a
> > more compact fashion. (And UnicodeSets themselves can have a variety
> > of different implementations, in any event).
>
> [\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a
> very compact way of expressing a lot of strings.  You wouldn't
> decompose that into a list of strings.
>

Clearly there will be various memory/performance issues that ​would need to
be taken into account. Not every implementation will be designed to handle
extreme cases, and may simply not allow the creation of such as set. Not
every string can be parsed by a BigDecimal system, etc. Not every regex
expressions can be used (without DOS) on common implementations, and so on.


> >> String ​ ​
> >> ranges seem particularly vulnerable to the ill-effects of
> >> unpredictable
>
> > UnicodeSets are low level constructs, as are their string
> > representations. Like all strings, the string format of a UnicodeSet
> > may change if it is normalized. That is nothing new.
>
> >    - The string format "[a-Ω]" (that is, U+0061 LATIN SMALL LETTER A
> > through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390
> > code points.
> >    - Under NFC it  would change to "[a-Ω]" (that is,  U+0061 LATIN
> > SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and
> > contain 841 code points.
>
> At least this gives the same range whether normalised to NFC or to
> NFD.  Using NFD, the preferred normalisation for regular
> expressions semi-respecting canonical equivalence, [{x̀}-{ẍ}] would
> not include the 2-character string "xa", as both bounds would decompose
> to two characters.  Using NFC, the preferred normalisation for LDML
> (and for XML, I think), this would be a contraction for [{x̀}-{xẍ}],
> and would include the 2-character string "xa".



> If the two strings had
> to have the same length, [{x̀}-{ẍ}] would be flagged as erroneous if
> interpreted in NFC,


​If you look at the text in
http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Lists_of_Code_Points,
there was already a a restriction on the lengths.


> and with any luck, similar errors that were not
> detected would then also be corrected.  It's not perfect, but


​I think that would just give people a false sense of security. Normalizing
string format of a UnicodeSet (or regex) can change what the set matches,
pretty dramatically, and is to be avoided (or as I said, one should use
escaped strings where it can't be avoided).


> il meglio
> è l’inimico del bene.
>

​LOL​



> > You really don't want to normalize the string format of UnicodeSets.
> > Or if you suspect that those string formats might be normalized, then
> > just use escaped format \x{...} for anything that might change under
> > normalization.
>
> It would probably be sensible to issue a warning if the specification
> of a string bound had more than one canonical equivalent.
>

​Issue a warning works in a UI. Not necessarily so well in production
code...
​


>
> I'm thinking of accidents.  While an XML processor must not be Unicode
> compliant, I thought most regular expression engine environments were
> allowed to be Unicode compliant.
>
> TUS 8.0 Chapter 3 C6: "A process shall not assume that the
> interpretations of two canonical-equivalent character sequences are
> distinct."
>

​A compiler will take source code containing String x="á"; and compile it
to a certain binary. If that same source code is NFD'd, the compiler will
produce a different result.

Do you really think that such compiler is not compliant to Unicode?? If so,
then we should add some more clarifications around C6.


> > Note that while it is fine to bring up topics for discussion here (or,
> > better yet, on the "cldr-users at unicode.org" <cldr-users at unicode.org>
> > list),
>
> As this impacts regular expressions in general, I think this is the
> better list for the impact on Unicode sets outside CLDR.
>>

​​


> > anything that requires a change will have to be filed as a
> > CLDR ticket. Richard, I'm sure you know this, and also raised this
> > topic here because of the relation to UTS18, so this is a reminder
> > for others.
>
> Exactly.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150908/d5750a9a/attachment.html>


More information about the Unicode mailing list