Name Property in Regular Expressions
Karl Williamson
public at khwilliamson.com
Thu Jun 13 20:55:51 CDT 2024
On 5/10/24 02:11, Martin J. Dürst via Unicode wrote:
> Dear Unicoders,
>
> I hope this more on-topic than the most recent discussions.
>
> I have some questions regarding name properties in regular expressions,
> i.e. about
> https://www.unicode.org/reports/tr18/#Name_Properties
>
> 1) When matching (see also
> https://www.unicode.org/reports/tr44/#Matching_Rules), it's clear that
> "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
> "zerowidthspace", but should something like
> "Ze-rowi-dThsp ace" (hyphens or spaces in the wrong places) also be
> equivalent?
Perl, since version 5.16, released May 2012, implements this, though it
requires explicit enabling. You may find its documentation illuminating
(the references to outside the paragraph are not relevant here):
LOOSE MATCHES
By specifying ":loose", Unicode's loose character name matching
<http://www.unicode.org/reports/tr44#Matching_Rules> rules are
selected
instead of the strict exact match used otherwise. That means that
*CHARNAME* doesn't have to be so precisely specified. Upper/lower case
doesn't matter (except with scripts as mentioned above), nor do any
underscores, and the only hyphens that matter are those at the
beginning
or end of a word in the name (with one exception: the hyphen in U+1180
"HANGUL JUNGSEONG O-E" does matter). Also, blanks not adjacent to
hyphens don't matter. The official Unicode names are quite variable as
to where they use hyphens versus spaces to separate word-like
units, and
this option allows you to not have to care as much. The reason
non-medial hyphens matter is because of cases like U+0F60 "TIBETAN
LETTER -A" versus U+0F68 "TIBETAN LETTER A". The hyphen here is
significant, as is the space before it, and so both must be included.
":loose" slows down look-ups by a factor of 2 to 3 versus ":full", but
the trade-off may be worth it to you. Each individual look-up
takes very
little time, and the results are cached, so the speed difference would
become a factor only in programs that do look-ups of many different
spellings, and probably only when those look-ups are through
"vianame()"
and "string_vianame()", since "\N{...}" look-ups are done at compile
time.
>
> 2) TR 18 suggests wildcards such as \p{name=/ALIEN/}. This looks very
> convenient, but I have doubts that implementation was really considered
> when writing this down. In essence, this would have to run a regular
> expression over close to one megabyte of name data (+some additional
> processing for the algorithmically defined names), just to compile the
> regular expression. (It's possible to speed that up with some clever
> indexing, but this would only add additional complexity and space.)
> So my question is whether anybody actually knows about some
> implementation of this name wildcard feature.
>
Perl, since version 5.32, released June 2020, implements this, though it
is still marked as experimental. I ran
time perl -l -e 'qr(\p{name=/ALIEN/})'
The output was
The Unicode property wildcards feature is experimental at -e line 1.
real 0m00.01s
user 0m00.01s
sys 0m00.00s
on a 2 year-old Linux box. Turning on the display of what got compiled gave
ANYOFRb[1F47D-1F47E] (First UTF-8 byte=F0)
That is basically an assembly language level statement for the perl
pattern matcher. But you can see that there are only two code points
that match in all of Unicode, and that any UTF-8-encoded string that
matches either of these must contain the byte 0xF0. This knowledge
allows the matcher to use a fast hardware instruction to rule out likely
long stretches of a string being matched (Of course, further
optimizations would be possible.)
More information about the Unicode
mailing list