Canonical block names: spaces vs. underscores
kenwhistler at att.net
Thu May 26 11:03:20 CDT 2016
On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.
> However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> Which is it?
> If proper canonical block names
Well, first of all, "canonical block name" is not a defined term in the
normalization of Unicode strings, there is no "normalization" of
property values that
defines a particular form as *the* canonical form to which other strings
> use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn’t `Blocks.txt` reflect that?
See the matching rules in UAX #44:
and in particular, the matching rule for symbolic values, which applies
in this case:
For enumerated properties, and especially for catalog properties such as
Block and Script,
the value of the property may be multi-word, and the best form to use in
one context might
not be exactly (as in binary string equality exact) the same as in another.
For Blocks.txt, all block names are given with spaces and with the
casing conventions that
would be most consistent with returning values for a block name in an
property values used in PropertyValueAliases.txt, on the other hand, are
turned into forms that are more identifier friendly, as the typical
context of use for those
values is in regex expressions and the like.
There are invariant rules in place that guarantee that any new property
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in
given the application of that matching rule.
More information about the Unicode