Canonical block names: spaces vs. underscores

Philippe Verdy verdy_p at
Thu May 26 13:44:55 CDT 2016

2016-05-26 20:07 GMT+02:00 Ken Whistler <kenwhistler at>:

> Well, let's take an example. The entry in Blocks.txt for the Arabic
> Presentation Forms-A block is:
> FB50..FDFF; Arabic Presentation Forms-A
> The entry for that block in PropertyValueAliases.txt is:
> blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      ;
> Arabic_Presentation_Forms-A
> So then which would it be? Should Blocks.txt be changed to the long
> preferred alias:
> FB50..FDFF; Arabic_Presentation_Forms_A
> or to the abbreviated preferred alias:
> FB50..FDFF; Arabic_PF_A

I think that this would break parsers that expect the alias used in
Blocks.txt to be directly "readable" with spaces. My opinion is to keep
Blocks.txt untouched (with spaces) as it's part of the core standard since
too long (and in sync with the ISO standard) as being the *normative* block

But we could add this normative value (with spaces) into
PropertyValueAliases.txt (that ISO 10646 does not have or need in its

blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      ;
Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A

The other solution would be to *add* the abbreviated prefered alias in

FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A

But this could break existing Block.txt parsers, when parsers should not
bug if finding new aliases in PropertyValueAliases.txt

Another solution would be to properly explain that to lookup values in
PropertyValues.txt, you can search it by replacing spaces in block names by
underscores, or make sure that underscores and spaces in the *middle* of
values are considered equivalent (so that even if they are rendered
visually, we can also display the listed aliases using spaces instead of

However it must be clear that these aliases are case-sensitive by default
("Arabic_Presentation_Forms_A" is not the same as
"Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms
A), unless the block names property is normatively said to be
case-insensitive (in that case the followings are also aliases:
"arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost,
which is much higher than *only* allowing basic replacements of spaces and
underscores (this will work, provided that there's no "special" aliases
starting by underscores, or using pairs of underscores: I doubt ISO will
use pairs of spaces in block names which are supposed to be trimmed with
whitespaces in the middle compressed).

Removing or replacing the space-separated words in block names in the UCD
would break the compatibility and synchronization with the ISO standard
which list them with spaces.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list