Use of tag characters in emoji sequences (was: Re: Proposal for BiDi in terminal emulators)

Doug Ewell via Unicode unicode at unicode.org
Sat Feb 2 14:50:59 CST 2019


Philippe Verdy wrote:

> Actually not all U+E0020 through U+E007E are "un-deprecated" for this
> use.

Characters in Unicode are not "deprecated" for some purposes and not for others. "Deprecated" is a clearly defined property in Unicode. The only reference that matters here is in PropList.txt:

E0000         ; Other_Default_Ignorable_Code_Point # Cn       <reserved-E0000>
E0001         ; Deprecated # Cf       LANGUAGE TAG
E0002..E001F  ; Other_Default_Ignorable_Code_Point # Cn  [30] <reserved-E0002>..<reserved-E001F>
E0020..E007F  ; Other_Grapheme_Extend # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Other_Default_Ignorable_Code_Point # Cn [128] <reserved-E0080>..<reserved-E00FF>

Note carefully that the code point marked "Deprecated" is deprecated, and the others listed here are not. (My earlier post saying that U+E007F was still deprecated was incorrect, as Andrew noted.)

> For now emoji flags only use:
> - U+E0041 through U+E005A (mapping to ASCII letters A through Z used
> in 2-letter ISO3166-1 codes). These are usable in pairs, without
> requiring any modifier (and only for ISO3166-1 registered codes).

Section C.1 of UTS #51 says otherwise:

tag_base    U+1F3F4 BLACK FLAG
tag_spec    (U+E0030 TAG DIGIT ZERO .. U+E0039 TAG DIGIT NINE,
            U+E0061 TAG LATIN SMALL LETTER A .. U+E007A TAG LATIN SMALL LETTER Z)+

Emoji flags use lowercase tag letters, not uppercase, and may also use digits. The digits are for CLDR subdivision IDs containing ISO 3166-2 code elements that happen to be numeric, and there are plenty of these. For example, "fr75" is the subdivision ID for Paris. Almost all ISO 3166-2 code elements in France are numeric.

> - I think that U+0030 through U+E0039 (mapping to ASCII digits 0
> through 9) are reserved for ISO3166 extensions, started with only the
> 3 "countries" added in the United Kingdom ("ENENG", "ENSCO" and
> "ENWLS"), with possible pending additions for other ISO3166-2, but not
> mapping any dash separator).

There is no top-level country "EN", and if there were, I doubt Scotland and Wales would be enthusiastic to be considered part of it.

In any case, "gbeng" and "gbsco" and "gbwls" are merely the only subdivision IDs that are designated "RGI," or "recommended for general interchange," in CLDR. Any other subdivision ID can be used in a flag tag sequence, although the lack of RGI designation may cause vendors to think the sequence is "recommended against" and not support it in fonts.

As shown above, tag digits are not reserved for "ISO 3166 extensions" (possibly implying ISO 3166-1), but are already usable for ISO 3166-2 code elements.

> These tags are used as modifiers in sequences starting by a leading
> U+1F3F4
> <http://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0065_e006e_e0067_e007f>
> (WAVING BLACK FLAG) emoji.

This is true. (Note the lowercase tag letters.)

> - U+E007F (CANCEL TAG) is already used too for the regional extensions
> as a mandatory terminator, as seen in the three British countries.

This is true.

> It is not used for country flags made of 2-letter emoji codes without
> any leading flag emoji.

This is true, but not particularly relevant, as these use Regional Indicator Symbols and have nothing to do with the "emoji codes" discussed elsewhere.

> And the proposal discussed here to use U+E003C, mapped to the ASCII
> "<" LOWER THAN

LESS-THAN SIGN

> as a leading tag sequence for reencoding HTML tags in sequences
> terminated by U+E003E ">" (and containing HTML element names using
> lowercase letter tags,

Only "b", "i", "u", and "s" by definition.

> possibly digit tags in these names,

No.

> and "/" for HTML tags terminator, possibly also U+E0020 SPACE TAG for
> separating HTML attributes, U+003D "=" for attribute values, U+E0022
> (') or U+E0027 (") around attribute values, but a problem if the
> mapped element names or attributes contain non-ASCII characters...)

None of these are part of Andrew's mechanism. It's just b, i, u, and s.

> is not standard

Neither Andrew nor anyone else claimed it was.

> (it's just an experiment in one font),

It applies to any TrueType font, because the rendering engine can apply these four styles (in any combination) to any TrueType font.

> and would in fact not be compatible with the existing specification
> for tags.

Good thing nobody claimed they were.

> So only E+E0020 through U+E0040, and U+E005B through U+E007E remain
> deprecated.

Da capo.

--
Doug Ewell | Thornton, CO, US | ewellic.org





More information about the Unicode mailing list