Hanb in domain labels

Jim DeLaHunt list+unicode at jdlh.com
Fri Aug 16 14:32:11 CDT 2024


On 2024-08-15 02:08, Henri Sivonen via Unicode wrote:

> UTS #39 is commonly used as the baseline for detecting IDN spoofs, and 
> UTS #39 explicitly allows combining Han and Bopomofo. Considering that 
> ㄚ looks confusable with 丫 and ㄠ looks confusable with 幺, I’m wondering 
> if it’s appropriate to explicitly allow this combination in the spoof 
> detection context.…

Are you asking about whether UTS #39 should allow this combination vs 
being changed to forbid this combination? Or are you asking about 
whether the rules of the Domain Name System should allow this combination?

I am involved with Universal Acceptance advocacy[1]. That means I have 
one foot in the DNS world, and the ICANN rules which govern it. I am not 
an expert, but I am aware of some principles there. My understanding is 
that the DNS world writes its own rules for detecting and preventing IDN 
spoofs. I have not heard that UTS #39 is a fundamental document for them.

> Is combining Han and Bopomofo in one domain label something that 
> occurs commonly enough in domains…?

This sounds like a question about what the DNS, what names are already 
registered, and what are the rules for registering further names. The 
former is backward-looking, the latter is forward-looking. Thus the 
answer has two parts.

For the backward-looking question, I have some awareness of the rules 
ICANN has put into place. Again, I am not an expert, but I have heard 
experts talk about some of the terminology and concepts.

The ICANN communities have put a lot of effort in recent years into 
"Label Generation Rules". ("Label" means the identifiers separated by 
periods in a domain name. In "example.com", "example" and "com" are 
Labels.) The LGRs are script-specific, so there are LGRs for scripts 
like Chinese, Bangla, Arabic, etc. The LGRs specifically try to prevent 
spoofs and confusion between labels. The LGRs define a repertoire of 
characters which may be used in a label. They define characters or 
strings which are variants of each other, which a human reader might 
consider to have the same meaning. There are rules about the 
registration of one variant label requires that the other variant labels 
either be registered to the same entity, or be protected from registration.

There are a set of Label Generation Rules for the root zone[2] of the 
DNS. They include rules for Chinese script labels[3] in the root zone. 
In my simple-minded reading of those rules, Bopomofo characters are not 
included in the repertoire. I suspect that means that the rules prevent 
anyone from registering a .ㄅㄆㄇㄈ top-level domain, or a Chinese domain 
with Bopomofo inclusions.

I understand that each top-level registry sets the rules for 
second-level labels they will accept, though there is pressure from 
ICANN communities to adopt standard LGRs. There are a set of suggested 
Label Generation Rules for second-level labels[4]. As I read those 
rules, at a superficial level they also seem to rule out Bopomofo 
characters within Chinese language labels or Bopomofo-only labels.

If you really want to understand what rules govern domain names, don't 
rely on my simple-minded understanding. Get in touch with ICANN 
communities[5] who specialise in those rules. The Generic Names 
Supporting Organisation might be a good place to start.

For the backward-looking question, about what names are already 
registered in various top-level domains, I don't have specific 
information. I have the impression that a lot of domain names were 
registered before the current LGRs were developed. I won't be surprised 
to hear that some of them don't comply with the LGRs. For instance, the 
.com and .org domains might have registered some labels with Bopomofo 
characters in the page. Again, the ICANN communities[5] would be a place 
to ask.

All of that seems to say that (if my understanding is correct), 
"combining Han and Bopomofo in one domain label" is not "something that 
occurs commonly… in domains" registered under the LGRs, but that might 
have occurred with legacy labels registered in the past.

Does this help answer your questions?
       —Jim DeLaHunt

[1] <https://uasg.tech/>
[2] <https://icannwiki.org/Root_Zone_Label_Generation_Rules>
[3] 
<https://www.icann.org/sites/default/files/lgr/rz-lgr-5-chinese-script-26may22-en.html>
[4] 
<https://www.icann.org/sites/default/files/packages/lgr/lgr-second-level-chinese-full-variant-script-24jan24-en.html>
[5] <https://www.icann.org/community>

-- 
.   --Jim DeLaHunt, jdlh at jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada



More information about the Unicode mailing list