U+hhhh[h[h]] NAME syntax

lists+unicode at seantek.com lists+unicode at seantek.com
Sat Aug 13 10:12:06 CDT 2016


> On Aug 13, 2016, at 2:33 AM, Marcel Schneider <charupdate at orange.fr> wrote:
> 
> On Sat, 13 Aug 2016 09:29:05 +0200, Philippe Verdy wrote:
> […]
>> I see little interest to force anyone to use the U+NNNN NAME convention 
>> everywhere, as it is overlong and may instead obscure the discussions. Even 
>> when it is used, the NAME will be frequently abbreviated (such as dropping the 
>> script name prefix or common words such as LETTER or DIGIT). And given that 
>> character names are not case-significant, they will be frequently written 
>> using lowercase, or mixed case, or just by presenting the verbatim character 
>> itself.
>> 
> One advantage I see in using capitalized character names is in making them 
> unambigously recognizable as identifiers, in order to prevent readers from 
> mistaking them as descriptors.
> 
> However I admit that I often unify casing pairs by dropping the CAPITAL and 
> SMALL attributes, as in LATIN LETTER AE, but it would be more accurate to write
> LATIN CAPITAL/SMALL LETTER AE. By contrast I wouldnʼt do that when referring to 
> the LATIN CAPITAL and SMALL LIGATURE OE, because the term “ligature” is an abusive 
> relict enforced by the ISO redactor at the time, and set back to “letter” in the 
> case of the Æ (as discussed past year). Here the advantage of using a translation
> is to be able to correct without risking confusions.
> 
> Another advantage is in highlighting the names against the surrounding text. 
> Avoiding uppercase—e.g. from people hating their Caps Lock toggle key, who 
> Iʼve read they do exist but are very uncommon in the country where we live—
> would need workarounds like using quotation marks, which in this context are 
> almost always misleading.
> 
> As of the U+ notational prefix for current text, I see it as extremely useful
> and I always apply it except, as Philippe states, in some tabular data, 
> which is but following the pattern used in the NamesList (which Iʼm keeping 
> constantly opened in my text editor). 
> 
> Using the U+ prefix throughout has the additional advantage of promoting 
> Unicode in the mind of people—an urgent challenge, […]

Thank you.

I have been reviewing draft-iab-rfc-nonascii-02 <https://tools.ietf.org/html/draft-iab-rfc-nonascii-02>, which formally opens the RFC series to UTF-8 encoded characters. (Look at the PDF version, which shows characters beyond the ASCII range.)

I was surprised that Section 3.4 provides no less than *six* notational alternatives, none of which conform to Appendix A of TUS. There might be valid grammatical reasons to notate differently than Appendix A, but I would think that Appendix A style U+2206 INCREMENT would be the best choice, as in:

   1.  Temperature changes in the Temperature Control Protocol are
       indicated by "Δ" U+2206 INCREMENT.

where U+ NAME replaces the part-of-speech “the XYZ character”, the character itself is quoted directly in front of the U+, and parentheses are not needed.

(I am actually in favor of curly quotes “Δ” in such a case, but that discussion should probably be had in the IETF.)

Interestingly, TUS 9.0.0 is not internally consistent, but there is a trend that when the character is quoted, it is put in curly quotes and is placed between the U+ syntax and the NAME, as in:

Section 3.13

Uppercasing of U+00DF “ß” latin small letter sharp s to …

Section 5.21

U+2061 Ê function application has no effect on the text display…

(Note: the Ê character appears in TUS as f() in a box…I am copying and pasting the text directly on my Mac from Acrobat to Mail.app. And, obviously, it’s copying and pasting the small-caps in lowercase.)


In plain text, ALL-CAPS names are superior to mixed case or lowercase names. However, in stylized text, small-caps not only looks better but offers a more convenient visual and semantic way to differentiate the part-of-speech.

I may have to suggest that small-caps be added as a stylistic element to the new xml2rfc format, or, that a new element be provisioned specifically to identify Unicode code points, which automatically get stylized appropriately to the output format (ALL-CAPS for plain text, stylized small-caps for marked up text).

Sean




More information about the Unicode mailing list