Tag characters

Doug Ewell doug at ewellic.org
Thu May 14 17:13:39 CDT 2015


http://www.unicode.org/L2/L2015/15107.htm

points indirectly to:

http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf

which says:

> The proposal has two parts
>
> 1. Un-deprecate TAG characters E0020-E007E.

Hee hee.

Hee hee.

> 2. Define a character as the “base” for a following sequence of
> TAG characters that specifies a region or subregion to be
> represented using a sequence of TAG characters. There are two
> possibilities for the base character:
>
> a. Preferred: Use the Unicode 7.0 character WAVING WHITE FLAG:
> 1F3F3;WAVING WHITE FLAG;So;0;ON;;;;;N;;;;;
> The advantage is no new characters need be encoded.

"Add language to UTR #51 describing the mechanism given in 2A" means
that U+1F3F3 will be the tag introducer, basically the "flag emoji"
equivalent of U+E0001 LANGUAGE TAG.

I think I understand why the TAG/CANCEL TAG start-end mechanism which
was invented for Plane 14 language tags wasn't reused for flag emoji.
Adding U+E0002 FLAG TAG would have implied that the sequence ends with
CANCEL TAG. Flags don't have scope and there is no need to indicate the
end of the sequence explicitly for scoping purposes, as there is with
tagged text.

I assume that existing text with U+1F3F3 followed by no tag characters
should continue to display the waving white flag glyph, whereas text
conforming to this new mechanism should suppress that glyph and show the
Scottish, Welsh, Delawarean, or Nordlending flag instead.

> Using the following notation -
> B designates the chosen base character (U+1F3F3 or new U+1F1E5)
> TL designates a TAG LATIN CAPITAL LETTER (A..Z)
> TD designates a TAG DIGIT (ZERO..NINE)
> TH designates TAG HYPHEN-MINUS
> 
> - a well-formed sequence for for designating flags for ISO 3166-1,
> 3166-2 or UN M49 codes would be
>
> B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))

Will the subdivision sequence always be exactly 3 characters long? CLDR
ticket #8423 seems to say that ISO 3166-2 code elements that are only 1
or 2 characters long will be prepended with "xx" or "x" to make them all
exactly 3. Obviously some research will need to be done to ensure this
doesn't result in conflicts with existing code elements, and of course
3166-2 makes no promises that future assignments will deliberately avoid
such a conflict.

Will both mechanisms, old and new, be available for encoding national
flags? For example, for a French flag:

<1F1EB 1F1F7>

or

<1F3F3 E0046 E0052>

> In CLDR 28, LDML will define a unicode_subdivision_subtag which also
> provides validity criteria for the codes used for regional
> subdivisions (see CLDR ticket #8423). When representing regional
> subdivisions using ISO 3166-2 codes, only those codes that are valid
> for the LDML unicode_subdivision_subtag should be used.

I note that a preliminary file is already available at
http://unicode.org/repos/cldr/trunk/common/supplemental/subdivisions.xml
.

--
Doug Ewell | http://ewellic.org | Thornton, CO ����




More information about the Unicode mailing list