Canonical block names: spaces vs. underscores
Ken Whistler
kenwhistler at att.net
Thu May 26 11:03:20 CDT 2016
On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.
>
> However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
>
> Which is it?
>
> If proper canonical block names
Well, first of all, "canonical block name" is not a defined term in the
standard. Unlike
normalization of Unicode strings, there is no "normalization" of
property values that
defines a particular form as *the* canonical form to which other strings
normalize.
> use spaces instead of underscores, why doesn’t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn’t `Blocks.txt` reflect that?
>
>
>
See the matching rules in UAX #44:
http://www.unicode.org/reports/tr44/#Matching_Rules
and in particular, the matching rule for symbolic values, which applies
in this case:
http://www.unicode.org/reports/tr44/#UAX44-LM3
For enumerated properties, and especially for catalog properties such as
Block and Script,
the value of the property may be multi-word, and the best form to use in
one context might
not be exactly (as in binary string equality exact) the same as in another.
For Blocks.txt, all block names are given with spaces and with the
casing conventions that
would be most consistent with returning values for a block name in an
API. The
property values used in PropertyValueAliases.txt, on the other hand, are
systematically
turned into forms that are more identifier friendly, as the typical
context of use for those
values is in regex expressions and the like.
There are invariant rules in place that guarantee that any new property
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in
their namespace,
given the application of that matching rule.
--Ken
More information about the Unicode
mailing list