Canonical block names: spaces vs. underscores

Ken Whistler kenwhistler at att.net
Thu May 26 11:03:20 CDT 2016



On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.
>
> However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
>
> Which is it?
>
> If proper canonical block names

Well, first of all, "canonical block name" is not a defined term in the 
standard. Unlike
normalization of Unicode strings, there is no "normalization" of 
property values that
defines a particular form as *the* canonical form to which other strings 
normalize.

>   use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn’t `Blocks.txt` reflect that?
>
>
>

See the matching rules in UAX #44:

http://www.unicode.org/reports/tr44/#Matching_Rules

and in particular, the matching rule for symbolic values, which applies 
in this case:

http://www.unicode.org/reports/tr44/#UAX44-LM3

For enumerated properties, and especially for catalog properties such as 
Block and Script,
the value of the property may be multi-word, and the best form to use in 
one context might
not be exactly (as in binary string equality exact) the same as in another.

For Blocks.txt, all block names are given with spaces and with the 
casing conventions that
would be most consistent with returning values for a block name in an 
API. The
property values used in PropertyValueAliases.txt, on the other hand, are 
systematically
turned into forms that are more identifier friendly, as the typical 
context of use for those
values is in regex expressions and the like.

There are invariant rules in place that guarantee that any new property 
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in 
their namespace,
given the application of that matching rule.

--Ken





More information about the Unicode mailing list