Canonical block names: spaces vs. underscores
kenwhistler at att.net
Thu May 26 13:07:14 CDT 2016
On 5/26/2016 10:05 AM, Mathias Bynens wrote:
>> On 26 May 2016, at 17:47, Mark Davis ☕️ <mark at macchiato.com> wrote:
>> The canonical property and property value formats are in the *Alias* files.
> Thanks for confirming!
Well, not quite... See below.
> Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files.
There's always a chance, I guess. But if we did so, we'd end up having
to just invent some
other more-or-less ad hoc property: Block_Name_Usable_For_Display, with
we already have in the Blocks.txt file. Or we would have to change the
format to include
the block short alias as an additional field in the file, which would
have its own maintenance
and consistency issues. Or we would be introducing a historical
inconsistency in the UCD
between versions, which would *complicate* certain other scripts that
parse the UCD.
>> On 26 May 2016, at 18:03, Ken Whistler <kenwhistler at att.net> wrote:
>> […] "canonical block name" is not a defined term in the standard.
> I didn’t mean to imply it was — it’s just an English word. I meant “canonical” as in “without loose matching applied”.
Ah, but "canonical" is a very freighted word in Unicode parlance. There
are 58 instances
of the word "canonical" in the current version of UAX #44, Unicode
Every one of them is a term of art, and none of them means what you mean
What are actually in PropertyValueAliases.txt are "preferred aliases"
and one "long"), plus a few "other aliases" for various compatibility
UAX #42 follows suit. The block property is represented by the blk
attribute, and the
enumerated values of the blk attribute:
use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt.
>> For enumerated properties, and especially for catalog properties such as Block and Script,
>> the value of the property may be multi-word, and the best form to use in one context might
>> not be exactly (as in binary string equality exact) the same as in another.
> That makes sense, but shouldn’t it be consistent throughout the Unicode database text files?
Well, let's take an example. The entry in Blocks.txt for the Arabic
Presentation Forms-A block is:
FB50..FDFF; Arabic Presentation Forms-A
The entry for that block in PropertyValueAliases.txt is:
blk; Arabic_PF_A ; Arabic_Presentation_Forms_A
So then which would it be? Should Blocks.txt be changed to the long
or to the abbreviated preferred alias:
which would be more consistent with the XML attribute and with most
If the latter, you would end up with systematically less identifiable
labels in Blocks.txt,
which would make it a bit more obscure for other uses, and which would
create ambiguities about what might be the "best" or "preferred" label
for blocks for
an API returning a block name -- which certainly wouldn't be the
abbreviated "preferred alias".
I suppose a proposal to the UTC to further modify the UCD handling of
could change this situation. But I'm not convinced that we shouldn't
things as they stand -- for stability. And then live with the
for scripts or other parsing algorithms that actually need to deal with
either parse out block ranges (its main function) or to get usable block
(its subsidiary function).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode