U+hhhh[h[h]] NAME syntax

Asmus Freytag (c) asmusf at ix.netcom.com
Sat Aug 13 19:19:15 CDT 2016


On 8/13/2016 2:47 PM, Doug Ewell wrote:
> PDF is a presentation format. If the editorial committee sets 
> character names in lowercase "under the hood" so that they will end up 
> looking good in Minion smallcaps in the PDF file, and a user 
> subsequently scrapes the PDF file for content, it doesn't mean there's 
> anything formal or normative about setting character names in lowercase.
>

Character names, when presented in the Unicode character database are 
uppercase. The general approach by Unicode is to define property names 
and values so that case distinctions are not needed to unambiguously 
resolve identifiers (same for space and most hyphens). That means, the 
presentation can be flexibly adapted to the style of the document (e.g. 
the Core Specification has a different style than other documents), yet 
still retain unambiguous identification of the character.

I believe that small-caps generally looks nice and distinctive. For HTML 
the way to do this is with a CSS style that allows the underlying text 
representation to be uppercase while showing lowercase small-cap 
letters. Marcel, I believe, gave some example, although something like 
this was used as early as Unicode 5.0 for the UAXs, when we printed them 
as part of the book.

For plain text, all caps is the easiest way to make the character name 
stick out and prevent misinterpretation of it as part of the surrounding 
text. The question becomes then, how much of the character name to show 
and in which order.

I'm personally partial to U+nnnn (x) CHARACTER NAME. In some cases, this 
requires some edits to make the text flow, but it has the advantage of 
being unambiguous, and something that works well for characters of all 
scripts and categories, including marks and punctuation. In some 
instances U+nnnn (x) transliterated name works well. I like the use of ( 
) instead of " " (curly or not) because the latter is hopeless in 
showing any combining marks above (the get lost among the "").

However, notations like x (U+nnnn) work pretty well, also, especially 
when all the "x" are from a distinct-looking script. The same goes for x 
CHARACTER NAME (U+nnnn). In many cases, there really isn't a need to 
quote the glyph, and not doing so, can reduce clutter.

In short, this isn't a one-size fits all kind of situation.

A./


More information about the Unicode mailing list