Canonical block names: spaces vs. underscores

Philippe Verdy verdy_p at
Thu May 26 14:32:12 CDT 2016

2016-05-26 20:48 GMT+02:00 Mathias Bynens <mathias at>:

> > On 26 May 2016, at 20:07, Ken Whistler <kenwhistler at> wrote:
> Perhaps the “Note:” in the commented header in `Blocks.txt` could be
> extended to point out that the ~~canonical block names~~, nay, ++preferred
> block aliases++ are listed in `PropertyValueAliases.txt`? That would’ve
> been enough to avoid the question that spawned this thread.

I'd say that the "preferred block aliases" should be stable and always in
the first entry.

And the last entry should be the preferred version for display and
unabbreviated (but not necessarily stable, it may change over time, and
applications are free to use better display names, including translations;
this last entry should be the best suitable for US English in a *technical*
glossary and preferably used in Unicode documentations and proposals, but
may be different for British English, or for vernacular names, but for
reference the 1st entry should not change)

Note also that the 1st entry in property aliases is not necessarily the
most abbreviated one: there may be other aliases in the middle of the list
using shorter names, provided that they don't conflict with others; or
special aliases used for specific lookups matching some pattern with a
known prefixes/suffixes (e.g. Hangul syllable types) so that another
specification specific for this usage could simply drop those implied
prefixes/suffixes, using even shorter aliases internally than the listed

The rules for lookling up aliases in PropertyAliases should be independant
of the property type:
- capitalization should be preserved (with lookups always case-sensive,
even of the listed values for a property type are currently using only
ASCII capital letters, or only ASCII lowercase letters): the capitalization
form may need to be distinguished in some future of the standard (without
having to use a broken orthography to distinguish them), and we should not
be using a slow UCA collator to match entries.
- only underscores/spaces should be considered equivalent, and there will
NEVER be special entries using leading or trailing underscores, or pairs of
underscores, or pairs of whitespaces (all aliases are assumed to be
trimmable and compressible, like in XML or HTML by default): applications
may then choose the "canonicalization" form they prefer (with underscores,
or with spaces)
- some "camelCased" bijective transform could suppress spaces/underscores,
provided that the transform includes an "escaping" mechanism for case
distinctions; but alternatively we could also list conforming "camelCased"
aliases (from which lowercase-only aliases with ASCII hyphens could be
infered for use in CSS selectors also with a bijective transform)
- however some programming languages (e.g. BASIC) do not have any case
distinction for identifiers (and there's no easy escaping mechanism without
using separators like underscores, which should also not be used in leading
or traling positions), or use lettercase (of the initial) for special
meaning (e.g. in several IA languages to distinguish variables and atoms:
the escaping mechanism may need to prepend a leading underscore or some
common prefix).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list