Name Property in Regular Expressions

Thu Jun 13 20:55:51 CDT 2024

On 5/10/24 02:11, Martin J. Dürst via Unicode wrote:
> Dear Unicoders,
> 
> I hope this more on-topic than the most recent discussions.
> 
> I have some questions regarding name properties in regular expressions, 
> i.e. about
> https://www.unicode.org/reports/tr18/#Name_Properties
> 
> 1) When matching (see also 
> https://www.unicode.org/reports/tr44/#Matching_Rules), it's clear that 
> "zero-width space" is equivalent to "ZERO WIDTH SPACE" or 
> "zerowidthspace", but should something like
> "Ze-rowi-dThsp ace" (hyphens or spaces in the wrong places) also be 
> equivalent?

Perl, since version 5.16, released May 2012, implements this, though it 
requires explicit enabling.  You may find its documentation illuminating 
(the references to outside the paragraph are not relevant here):

LOOSE MATCHES
      By specifying ":loose", Unicode's loose character name matching
      <http://www.unicode.org/reports/tr44#Matching_Rules> rules are 
selected
      instead of the strict exact match used otherwise. That means that
      *CHARNAME* doesn't have to be so precisely specified. Upper/lower case
      doesn't matter (except with scripts as mentioned above), nor do any
      underscores, and the only hyphens that matter are those at the 
beginning
      or end of a word in the name (with one exception: the hyphen in U+1180
      "HANGUL JUNGSEONG O-E" does matter). Also, blanks not adjacent to
      hyphens don't matter. The official Unicode names are quite variable as
      to where they use hyphens versus spaces to separate word-like 
units, and
      this option allows you to not have to care as much. The reason
      non-medial hyphens matter is because of cases like U+0F60 "TIBETAN
      LETTER -A" versus U+0F68 "TIBETAN LETTER A". The hyphen here is
      significant, as is the space before it, and so both must be included.

      ":loose" slows down look-ups by a factor of 2 to 3 versus ":full", but
      the trade-off may be worth it to you. Each individual look-up 
takes very
      little time, and the results are cached, so the speed difference would
      become a factor only in programs that do look-ups of many different
      spellings, and probably only when those look-ups are through 
"vianame()"
      and "string_vianame()", since "\N{...}" look-ups are done at compile
      time.

> 
> 2) TR 18 suggests wildcards such as \p{name=/ALIEN/}. This looks very 
> convenient, but I have doubts that implementation was really considered 
> when writing this down. In essence, this would have to run a regular 
> expression over close to one megabyte of name data (+some additional 
> processing for the algorithmically defined names), just to compile the 
> regular expression. (It's possible to speed that up with some clever 
> indexing, but this would only add additional complexity and space.)
> So my question is whether anybody actually knows about some 
> implementation of this name wildcard feature.
> 

Perl, since version 5.32, released June 2020, implements this, though it 
is still marked as experimental.  I ran

time perl -l -e 'qr(\p{name=/ALIEN/})'

The output was

The Unicode property wildcards feature is experimental at -e line 1.
real    0m00.01s
user    0m00.01s
sys     0m00.00s

on a 2 year-old Linux box.  Turning on the display of what got compiled gave

ANYOFRb[1F47D-1F47E] (First UTF-8 byte=F0)

That is basically an assembly language level statement for the perl 
pattern matcher.  But you can see that there are only two code points 
that match in all of Unicode, and that any UTF-8-encoded string that 
matches either of these must contain the byte 0xF0.  This knowledge 
allows the matcher to use a fast hardware instruction to rule out likely 
long stretches of a string being matched  (Of course, further 
optimizations would be possible.)