Loose character-name matching

"J. S. Choi" via Unicode unicode at unicode.org
Sun Jan 20 16:13:08 CST 2019


Thanks for the reply. These answers make sense. 

However, I am still confused by that passage from the Standard in § 4.8. To review, it says: “Because Unicode character names do not contain any underscore (“_”) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single “_” when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.” 

How is this system supposed encode names with non-medial hyphens (or U+116C’s name)? Many (most?) programming languages disallow both spaces and hyphens in identifiers. For instance, among the most-popular programming languages as ranked by TIOBE, *none* of them allow hyphens in identifiers as far as I can tell, and many of them (e.g., C, Python, MATLAB) do not allow *any* other ASCII identifier characters, including the dollar sign $.

Does this mean that, it would impossible to create valid identifiers in these popular programming languages for characters with non-medial hyphens (or U+116C HANGUL JUNGSEONG O-E), contrary to the Standard’s claim in § 4.8?

One system of making valid identifiers in those languages is to make the underscore equivalent to hyphen-minus and then use camel case on space-separated words. For instance:
hangulJunseongOE for U+116C HANGUL JUNGSEONG OE,
hangulJunseongO_E for U+116C HANGUL JUNGSEONG O-E,
tibetanLetterA for U+0F68 TIBETAN LETTER A,
tibetanLetter_A for U+0F60 TIBETAN LETTER -A.

A second albeit clunky method is to make the double underscore equivalent to a space then hyphen-minus (or vice versa) and then use single underscores on space-separated words. For instance:
Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE,
Hangul_Junseong_O__E for U+116C HANGUL JUNGSEONG O-E,
Tibetan_Letter_A for U+0F68 TIBETAN LETTER A,
Tibetan_Letter__A for U+0F60 TIBETAN LETTER -A.

Lastly, if the programming language allows the dollar sign $ to be in identifiers, as several such as Java and JavaScript do, then the dollar sign could be used instead of the underscore:
hangulJunseongOE for U+116C HANGUL JUNGSEONG OE,
hangulJunseongO$E for U+116C HANGUL JUNGSEONG O-E,
tibetanLetterA for U+0F68 TIBETAN LETTER A,
tibetanLetter$A for U+0F60 TIBETAN LETTER -A.
…or:
Hangul_Junseong_OE for U+116C HANGUL JUNGSEONG OE,
Hangul_Junseong_O$E for U+116C HANGUL JUNGSEONG O-E,
Tibetan_Letter_A for U+0F68 TIBETAN LETTER A,
Tibetan_Letter_$A for U+0F60 TIBETAN LETTER -A.

Unfortunately, the first and second systems are not compatible with loose matching as prescribed by UAX44-LM2, so I daresay that they are not what the Standard’s claim in § 4.8 has in mind. (The second system also assumes that there are no two characters whose names differ only by switching the positions of a space and an adjacent hyphen, which cannot be guaranteed forever without a stability policy.) But the third system is not possible in numerous popular programming languages (C, Python, etc.). How is the Standard’s system in § 4.8 supposed encode names with non-medial hyphens (or U+116C’s name)?

…Oh, wait, I get it. This system is not supposed to necessarily be compatible with standard loose matching. I had the impression that they were supposed to be compatible, but rereading the original paragraph shows that they don’t actually mention loose matching, which is explained elsewhere in the chapter. That’s unfortunate.

Thanks again for your help.

> On Jan 18, 2019, at 7:53 PM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Thu, 17 Jan 2019 18:44:50 -0500
> "J. S. Choi" via Unicode <unicode at unicode.org> wrote:
> 
>> I’m implementing a Unicode names library. I’m confused about loose
>> character-name matching, even after rereading The Unicode Standard §
>> 4.8, UAX #34 § 4, #44 § 5.9.2 – as well as
>> [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt
>> <http://www.unicode.org/L2/L2013/13142-name-match.txt>),
>> [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035
>> <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and
>> the [meeting in which those two items were
>> resolved](https://www.unicode.org/L2/L2014/14026.htm
>> <https://www.unicode.org/L2/L2014/14026.htm>).
>> 
>> In particular, I’m confused by the claim in The Unicode Standard §
>> 4.8 saying, “Because Unicode character names do not contain any
>> underscore (“_”) characters, a common strategy is to replace any
>> hyphen-minus or space in a character name by a single “_” when
>> constructing a formal identifier from a character name. This strategy
>> automatically results in a syntactically correct identifier in most
>> formal languages. Furthermore, such identifiers are guaranteed to be
>> unique, because of the special rules for character name matching.”
> 
> Unfortunately, the loose matching rules don't distinguish '__' and
> '_'.  Note that '__' is sometimes forbidden in identifiers.
> 
>> I’m also confused by the relationship between UAX34-R3 and UAX44-LM2.
>> 
>> To make these issues concrete, let’s say that my library provides a
>> function called getCharacter that takes a name argument, tries to
>> find a loosely matching character, and then returns it (or a null
>> value if there is no currently loosely matching character). So then
>> what should the following expressions return?
>> 
> Loose matching of names may be looser than prescribed; it shall not be
> stricter.
> 
>> getCharacter(“HANGUL-JUNGSEONG-O-E”)
> U+1180 HANGUL JUNGSEONG O-E, or just possibly null.
> 
>> getCharacter(“HANGUL_JUNGSEONG_O_E”)
> U+116C HANGUL JUNGSEONG OE*
> 
>> getCharacter(“HANGUL_JUNGSEONG_O_E_”)
> U+116C
> 
>> getCharacter(“HANGUL_JUNGSEONG_O__E”)
> U+116C
> 
>> getCharacter(“HANGUL_JUNGSEONG_O_-E”)
> U+1180
> 
>> getCharacter(“HANGUL JUNGSEONGCHARACTERO E”)
> null or U+116C - up to you.  The sequence 'CHARACTER' shall not
> distinguish names, but loose matching is not required to know this fact.
> 
>> getCharacter(“HANGUL JUNGSEONG CHARACTER OE”)
> null or U+116C - up to you.
> 
>> getCharacter(“TIBETAN_LETTER_A”)
> U+0F68 TIBETAN LETTER A
> 
>> getCharacter(“TIBETAN_LETTER__A”)
> U+0F68 TIBETAN LETTER A**
> 
>> getCharacter(“TIBETAN_LETTER _A”)
> U+0F68
> 
>> getCharacter(“TIBETAN_LETTER_-A”)
> U+0F60 TIBETAN LETTER -A
> 
> *This is unfortunate, as the usual symbolic name for U+1180 would be
> HANGUL_JUNGSEONG_O_E.
> 
> **This is also unfortunate, as the usual symbolic
> name for U+0F60 would be TIBETAN_LETTER__A.
> 
> The key problem here is that the hyphen after a space is required in
> names as understood by the name property.  The hyphen is also required
> in  "HANGUL JUNGSEONG O-E".  The simple tactic is:
> 
> 1)      Canonicalise, by stripping out spaces, underscores and medial
> hyphens and lowercasing.  (It's probably better to fold the character
> U+0131 LATIN SMALL LETTER I' to 'i'.)
> 
> 2)      Look the result up.
> 
> 3)      If you get the result U+116C but the input matches
> ".*[oO]-[eE][_- ]*$", convert to U+1180.
> 
> Symbolic identifiers in programs need not match the name; one may
> choose to depend on the compiler or interpreter to catch duplicates;
> some will, some won't.  Replacing '-' by '_' to convert a name to an
> identifier looses the distinction between a hyphen and an arbitrarily
> inserted space,
> 
> Richard.
> 




More information about the Unicode mailing list