Names for control characters (Was: "(in 6429)" in allkeys.txt)

Whistler, Ken ken.whistler at sap.com
Wed Mar 12 11:48:25 CDT 2014


Please be very careful here. Having a non-empty value in field 1 of
UnicodeData.txt is *not* the same has "having a Unicode name".

See:

http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207

for the gory details.

The "Unicode name" is formally defined in terms of the Name property,
which itself is a combination of enumerated values extracted from
UnicodeData.txt, plus a number of rules.

For all characters whose General_Category=Cc, the formal definition
of the Name property is a null string. The string "<control>" is *never*
to be interpreted as a "Unicode name". It is a field placeholder with
legacy status. See "Interpretation of Field 1 of UnicodeData.txt" in
the section I cited above.

As far as user interfaces and other applications needing "names" for
Unicode control characters -- one of the reasons that the namespace
for Unicode characters includes all of the formal name aliases provided
in NameAliases.txt is so that applications can safely treat any formal
name alias for a control character (or the other abbreviations, etc.,
also listed in NameAliases.txt) *as if* they were Unicode names, without
running into name collisions with the actual Name property value
for Unicode characters.

The history of the name collision for the (relatively) recently encoded
U+1F514 BELL with the traditional usage for the U+0007 control function
"BELL" led the UTC to extend the namespace as noted, so we won't be
running into more such problems in the future.

If Emacs were to use "ALERT" or the abbreviation "BEL" for U+0007,
instead of "<control>", that would avoid the collision with U+1F514 BELL,
be conformant to the Unicode Standard, and presumably be helpful
to users, as well. See the entries for U+0007 in NameAliases.txt:

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation

--Ken


> > Regarding these names in ISO 6429 again, how come these control
> > characters don't have Unicode names?
> 
> They have a non-empty "old name" field:
> 
>   0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
>                                ^^^^





More information about the Unicode mailing list