Loose character-name matching

Richard Wordingham via Unicode unicode at unicode.org
Fri Jan 18 18:53:16 CST 2019


On Thu, 17 Jan 2019 18:44:50 -0500
"J. S. Choi" via Unicode <unicode at unicode.org> wrote:

> I’m implementing a Unicode names library. I’m confused about loose
> character-name matching, even after rereading The Unicode Standard §
> 4.8, UAX #34 § 4, #44 § 5.9.2 – as well as
> [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt
> <http://www.unicode.org/L2/L2013/13142-name-match.txt>),
> [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035
> <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and
> the [meeting in which those two items were
> resolved](https://www.unicode.org/L2/L2014/14026.htm
> <https://www.unicode.org/L2/L2014/14026.htm>).
> 
> In particular, I’m confused by the claim in The Unicode Standard §
> 4.8 saying, “Because Unicode character names do not contain any
> underscore (“_”) characters, a common strategy is to replace any
> hyphen-minus or space in a character name by a single “_” when
> constructing a formal identifier from a character name. This strategy
> automatically results in a syntactically correct identifier in most
> formal languages. Furthermore, such identifiers are guaranteed to be
> unique, because of the special rules for character name matching.”

Unfortunately, the loose matching rules don't distinguish '__' and
'_'.  Note that '__' is sometimes forbidden in identifiers.

> I’m also confused by the relationship between UAX34-R3 and UAX44-LM2.
> 
> To make these issues concrete, let’s say that my library provides a
> function called getCharacter that takes a name argument, tries to
> find a loosely matching character, and then returns it (or a null
> value if there is no currently loosely matching character). So then
> what should the following expressions return?
> 
Loose matching of names may be looser than prescribed; it shall not be
stricter.

> getCharacter(“HANGUL-JUNGSEONG-O-E”)
U+1180 HANGUL JUNGSEONG O-E, or just possibly null.

> getCharacter(“HANGUL_JUNGSEONG_O_E”)
U+116C HANGUL JUNGSEONG OE*

> getCharacter(“HANGUL_JUNGSEONG_O_E_”)
U+116C

> getCharacter(“HANGUL_JUNGSEONG_O__E”)
U+116C

> getCharacter(“HANGUL_JUNGSEONG_O_-E”)
U+1180

> getCharacter(“HANGUL JUNGSEONGCHARACTERO E”)
null or U+116C - up to you.  The sequence 'CHARACTER' shall not
distinguish names, but loose matching is not required to know this fact.

> getCharacter(“HANGUL JUNGSEONG CHARACTER OE”)
null or U+116C - up to you.

> getCharacter(“TIBETAN_LETTER_A”)
U+0F68 TIBETAN LETTER A

> getCharacter(“TIBETAN_LETTER__A”)
U+0F68 TIBETAN LETTER A**

> getCharacter(“TIBETAN_LETTER _A”)
U+0F68

> getCharacter(“TIBETAN_LETTER_-A”)
U+0F60 TIBETAN LETTER -A

*This is unfortunate, as the usual symbolic name for U+1180 would be
HANGUL_JUNGSEONG_O_E.

**This is also unfortunate, as the usual symbolic
name for U+0F60 would be TIBETAN_LETTER__A.

The key problem here is that the hyphen after a space is required in
names as understood by the name property.  The hyphen is also required
in  "HANGUL JUNGSEONG O-E".  The simple tactic is:

1)      Canonicalise, by stripping out spaces, underscores and medial
hyphens and lowercasing.  (It's probably better to fold the character
U+0131 LATIN SMALL LETTER I' to 'i'.)

2)      Look the result up.

3)      If you get the result U+116C but the input matches
".*[oO]-[eE][_- ]*$", convert to U+1180.

Symbolic identifiers in programs need not match the name; one may
choose to depend on the compiler or interpreter to catch duplicates;
some will, some won't.  Replacing '-' by '_' to convert a name to an
identifier looses the distinction between a hyphen and an arbitrarily
inserted space,

Richard.



More information about the Unicode mailing list