Canonical block names: spaces vs. underscores

Thu May 26 13:07:14 CDT 2016

On 5/26/2016 10:05 AM, Mathias Bynens wrote:
>> On 26 May 2016, at 17:47, Mark Davis ☕️ <mark at macchiato.com> wrote:
>>
>> The canonical property and property value formats are in the *Alias* files.
> Thanks for confirming!

Well, not quite... See below.

>
> Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files.

There's always a chance, I guess. But if we did so, we'd end up having 
to just invent some
other more-or-less ad hoc property: Block_Name_Usable_For_Display, with 
the values
we already have in the Blocks.txt file. Or we would have to change the 
format to include
the block short alias as an additional field in the file, which would 
have its own maintenance
and consistency issues. Or we would be introducing a historical 
inconsistency in the UCD
between versions, which would *complicate* certain other scripts that 
parse the UCD.

>
>> On 26 May 2016, at 18:03, Ken Whistler <kenwhistler at att.net> wrote:
>>
>> […] "canonical block name" is not a defined term in the standard.
> I didn’t mean to imply it was — it’s just an English word. I meant “canonical” as in “without loose matching applied”.

Ah, but "canonical" is a very freighted word in Unicode parlance. There 
are 58 instances
of the word "canonical" in the current version of UAX #44, Unicode 
Character Database.
Every one of them is a term of art, and none of them means what you mean 
there. ;-)

What are actually in PropertyValueAliases.txt are "preferred aliases" 
(one "abbreviated",
and one "long"), plus a few "other aliases" for various compatibility 
reasons.

UAX #42 follows suit. The block property is represented by the blk 
attribute, and the
enumerated values of the blk attribute:

http://www.unicode.org/reports/tr42/#w1aac13c13c19b1

use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt.

>
>> For enumerated properties, and especially for catalog properties such as Block and Script,
>> the value of the property may be multi-word, and the best form to use in one context might
>> not be exactly (as in binary string equality exact) the same as in another.
> That makes sense, but shouldn’t it be consistent throughout the Unicode database text files?

Well, let's take an example. The entry in Blocks.txt for the Arabic 
Presentation Forms-A block is:

FB50..FDFF; Arabic Presentation Forms-A

The entry for that block in PropertyValueAliases.txt is:

blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      
; Arabic_Presentation_Forms-A

So then which would it be? Should Blocks.txt be changed to the long 
preferred alias:

FB50..FDFF; Arabic_Presentation_Forms_A

or to the abbreviated preferred alias:

FB50..FDFF; Arabic_PF_A

which would be more consistent with the XML attribute and with most 
regex usage?
If the latter, you would end up with systematically less identifiable 
labels in Blocks.txt,
which would make it a bit more obscure for other uses, and which would 
also then
create ambiguities about what might be the "best" or "preferred" label 
for blocks for
an API returning a block name -- which certainly wouldn't be the 
abbreviated "preferred alias".

I suppose a proposal to the UTC to further modify the UCD handling of 
block names
could change this situation. But I'm not convinced that we shouldn't 
just leave
things as they stand -- for stability. And then live with the 
complications required
for scripts or other parsing algorithms that actually need to deal with 
Blocks.txt to
either parse out block ranges (its main function) or to get usable block 
names
(its subsidiary function).

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/44151018/attachment.html>