Traditional and Simplified Han in UTS 39

Wed Dec 27 23:24:52 CST 2017

The full excerpt from the UTS reads:

> Mark Chinese strings as “mixed script” if they contain both simplified 
> (S) and traditional (T) Chinese characters, using the Unihan data in 
> the Unicode Character Database [UCD 
> <http://www.unicode.org/reports/tr39/#UCD>].
>
>  1. The criterion can only be applied if the language of the string is
>     known to be Chinese. So, for example, the string “写真だけの結婚式 ”
>     is Japanese, and should not be marked as mixed script because of a
>     mixture of S and T characters.
>  2. Testing for whether a character is S or T needs to be based not on
>     whether the character /has/ a S or T variant , but whether the
>     character /is/ an S or T variant.
>

There are several issues with this.

First and foremost, the definition of S and T variants is not something 
that is universally agreed upon. The .cn, .hk or .tw registries are 
using a definition of S and T variants that does not agree with the 
Unihan data in many particulars. Therefore, using the Unihan data would 
result in false positives. (And false negatives).

Second, there are many characters that are variants that are acceptable 
with both "S" or "T" labels. You only have to look at the published 
Label Generation Rulesets (or IDN tables) for these domains to see many 
examples. And, as mentioned above, you cannot reverse engineer these 
tables from Unihan data.

Third, the same domains mentioned have a policy of delegating up to 
three label to the same applicant: a "traditional", "simplified" and a 
mixed label matching the spelling of the label in the original 
application (for situations where a mixed label is appropriate). In 
other words, certain mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing any 
other mixed label that is a variant of the three from being allocated to 
an unrelated party. If you "know" that the language has to be Chinese, 
because the domain is a ccTLD, then at the same time the check is 
superfluous. Other registries are not known to have similar policies, so 
for them additional spoof detection may be useful --- however it is 
precisely those cases where it's impossible to know whether a label is 
intended to be in the Chinese language.

Fifth, generally the only thing that can be ascertained is that a label 
is *not* in Chinese: by virtue of having Kana or Hangul characters mixed 
in. However, the reverse is not true. You will find labels registered 
under .jp that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the state 
of the art is to have a coordinated policy that prevents "random" 
variant labels from coexisting in the registry. An example of this kind 
of effort is being developed for the root zone. By definition, for the 
root zone, there is no implied information about the language context, 
unlike the case for the second level, where the presence of a ccTLD in 
the full domain name may give a clue.

Seventh, attempting to determine whether a label is potentially valid 
based on variant data (of any kind) is doomed, because actual usage is 
not limited to "pure" labels. The variant mechanism is something that 
works differently (in those registries that apply it): instead of 
looking at a single label, the registry can implement "mutual 
exclusion". Once one variant label from a given set has been delegated, 
all others are excluded (or in practice, all but three, which are 
limited to the same applicant). Without access to the registry data, you 
cannot predict which variants in a set are the "good ones", and with 
access to the data, spoof labels are rejected and cannot be registered.

In conclusion, my recommendation would be to retract this particular 
passage.

A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:
> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as “mixed script” if they contain both 
> simplified (S) and traditional (T) Chinese characters, using the 
> Unihan data in the Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is 
> known to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"? Is 
> this something algorithmically determinable, or does it come from 
> information about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text.  That clearly 
> indicates the language isn't Chinese.  So in this example we can 
> algorithmically rule out that its Chinese.
>
> And what does Chinese really mean here?
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171227/d2b84e9b/attachment.html>