String Ranges in Unicode Sets

Richard Wordingham richard.wordingham at
Tue Sep 8 17:01:35 CDT 2015

On Tue, 8 Sep 2015 13:46:48 +0200
Mark Davis ☕️ <mark at> wrote:

> On Tue, Sep 8, 2015 at 9:53 AM, Asmus Freytag (t)
> <asmus-inc at> wrote:


> > What about set operations on sets with string ranges?

> ​Again, the range notation is just a formatting issue. Anything you
> can do with [{ax}-{bz}​] you can also do with
> [{ax}{ay}{az}{bx}{by}{bz}​], and vice versa, since the former is
> defined to be equivalent to the latter. These are just string
> representations of the same *logical* underlying implementation.
> > Can they be expressed (other than working them out and writing down
> > the full enumeration of the resulting set)?

> I'm not quite sure what you mean. That's like asking, "Can [a-z] be
> expressed, ​other than by writing out the full enumeration [a b c d
> e ... z]?". Well, yes. You could represent [a-z] in many ways:
> [\p{ASCII}&\p{lu}], for example. Or [\u0061 \u0062 ...]. Or....

> ​But I'm probably misunderstanding what you are trying to say.​

I think Asmus is asking if there is a more compact representation of
the result of a string operation than just listing all the string
elements.  The answer would then be yes.  Just [a-z]~~[e-s] can be
written (and represented internally) as [a-dt-z], so
[{aa}-{zz}]-[{ee}-{ss}] can be written (and represented internally) as
the union of four non-overlapping string ranges [{aa}-{dz} {ea}-{sd}
{et}-{sz} {ta}-{tz}].  Fortunately, unions of string ranges of the same
length commute, which is not necessarily the case for Unicode sets.
(It is possible that [[a][{ab}]] might preferentially match "a" while
[[{ab}][a]] preferentially matched "ab".)


More information about the Unicode mailing list