Canonical block names: spaces vs. underscores

Ken Whistler kenwhistler at
Thu May 26 11:03:20 CDT 2016

On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` ( lists blocks such as `Cyrillic Supplement`.
> However, `PropertyValueAliases.txt` ( refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> Which is it?
> If proper canonical block names

Well, first of all, "canonical block name" is not a defined term in the 
standard. Unlike
normalization of Unicode strings, there is no "normalization" of 
property values that
defines a particular form as *the* canonical form to which other strings 

>   use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn’t `Blocks.txt` reflect that?

See the matching rules in UAX #44:

and in particular, the matching rule for symbolic values, which applies 
in this case:

For enumerated properties, and especially for catalog properties such as 
Block and Script,
the value of the property may be multi-word, and the best form to use in 
one context might
not be exactly (as in binary string equality exact) the same as in another.

For Blocks.txt, all block names are given with spaces and with the 
casing conventions that
would be most consistent with returning values for a block name in an 
API. The
property values used in PropertyValueAliases.txt, on the other hand, are 
turned into forms that are more identifier friendly, as the typical 
context of use for those
values is in regex expressions and the like.

There are invariant rules in place that guarantee that any new property 
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in 
their namespace,
given the application of that matching rule.


More information about the Unicode mailing list