Hanb in domain labels
Henri Sivonen
hsivonen at mozilla.com
Mon Aug 19 05:33:48 CDT 2024
On Fri, Aug 16, 2024 at 10:32 PM Jim DeLaHunt <list+unicode at jdlh.com> wrote:
> On 2024-08-15 02:08, Henri Sivonen via Unicode wrote:
>
> > UTS #39 is commonly used as the baseline for detecting IDN spoofs, and
> > UTS #39 explicitly allows combining Han and Bopomofo. Considering that
> > ㄚ looks confusable with 丫 and ㄠ looks confusable with 幺, I’m wondering
> > if it’s appropriate to explicitly allow this combination in the spoof
> > detection context.…
>
> Are you asking about whether UTS #39 should allow this combination vs
> being changed to forbid this combination? Or are you asking about
> whether the rules of the Domain Name System should allow this combination?
>
Foremost I'm asking if it's appropriate that browsers that in general
refuse to render mixed-script domain labels in the Unicode form in the user
interface (in the URL bar in particular) make an exception, due to UTS #39
making an exception, for the combination of Han and Bopomofo. Alternative
possible behaviors would be treating Han and Bopomofo in one label the way
e.g. mixing Greek and Cyrillic in one label is treated: Refusing to render
the label in the Unicode form. A more complex possibility would be to check
if a label that contains both Han and Bopomofo contains specific confusable
characters and refuse to render Hanb labels in the Unicode form if specific
confusable characters (Han or Bopomofo) are present.
If the conclusion is that either of the alternative behaviors above would
be more appropriate than special-casing Han+Bopomofo as a permitted
mixed-script combination, the next question is whether UTS #39 should
change accordingly.
> I am involved with Universal Acceptance advocacy[1]. That means I have
> one foot in the DNS world, and the ICANN rules which govern it. I am not
> an expert, but I am aware of some principles there. My understanding is
> that the DNS world writes its own rules for detecting and preventing IDN
> spoofs. I have not heard that UTS #39 is a fundamental document for them.
>
UTS #39 is a fundamental document for Firefox and Chrome (could be for
Safari, too, but I don't know) as the baseline of IDN spoof detection.
(More checks are layered on top, though.)
> > Is combining Han and Bopomofo in one domain label something that
> > occurs commonly enough in domains…?
>
> This sounds like a question about what the DNS, what names are already
> registered, and what are the rules for registering further names. The
> former is backward-looking, the latter is forward-looking. Thus the
> answer has two parts.
>
Indeed. Though the backward-looking history is long enough that future
demand can probably be inferred from the backward-looking part.
> For the backward-looking question, I have some awareness of the rules
> ICANN has put into place. Again, I am not an expert, but I have heard
> experts talk about some of the terminology and concepts.
>
> The ICANN communities have put a lot of effort in recent years into
> "Label Generation Rules". ("Label" means the identifiers separated by
> periods in a domain name. In "example.com", "example" and "com" are
> Labels.) The LGRs are script-specific, so there are LGRs for scripts
> like Chinese, Bangla, Arabic, etc. The LGRs specifically try to prevent
> spoofs and confusion between labels. The LGRs define a repertoire of
> characters which may be used in a label. They define characters or
> strings which are variants of each other, which a human reader might
> consider to have the same meaning. There are rules about the
> registration of one variant label requires that the other variant labels
> either be registered to the same entity, or be protected from registration.
>
> There are a set of Label Generation Rules for the root zone[2] of the
> DNS. They include rules for Chinese script labels[3] in the root zone.
> In my simple-minded reading of those rules, Bopomofo characters are not
> included in the repertoire. I suspect that means that the rules prevent
> anyone from registering a .ㄅㄆㄇㄈ top-level domain, or a Chinese domain
> with Bopomofo inclusions.
>
It indeed looks like the root LGRs currently don't allow Bopomofo, but it
appears that they also don't allow Cyrillic TLDs, which do exist, so it
seems that root LGRs are enough in a work-in-progress state not to draw
definite conclusions from.
> I understand that each top-level registry sets the rules for
> second-level labels they will accept, though there is pressure from
> ICANN communities to adopt standard LGRs. There are a set of suggested
> Label Generation Rules for second-level labels[4]. As I read those
> rules, at a superficial level they also seem to rule out Bopomofo
> characters within Chinese language labels or Bopomofo-only labels.
>
That particular rule set also excludes Hiragana and Katakana, so it's not
clear that LGRs for Hani existing means the exclusion of Hanb, Jpan, and
Kore. (I didn't ask about Jpan in my initial post, despite Han 口 and
Katakana ロ existing, because of the different role of Hiragana and Katakana
compared to the role of Bopomofo. I didn't ask about Kore, because I'm not
aware of a confusability issue even if I have doubts about demand for Han +
Hangul domain labels. I am curious, though, how users and domain holders
deal with the 口 vs. ロ issue. Is the glyph size distinction consistent and
obvious enough?)
> All of that seems to say that (if my understanding is correct),
> "combining Han and Bopomofo in one domain label" is not "something that
> occurs commonly… in domains" registered under the LGRs, but that might
> have occurred with legacy labels registered in the past.
>
Thanks.
--
Henri Sivonen
hsivonen at mozilla.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240819/333b1bc6/attachment.htm>
More information about the Unicode
mailing list