Hanb in domain labels

Mon Aug 19 14:19:35 CDT 2024

On 2024-08-19 03:33, Henri Sivonen wrote:

> On Fri, Aug 16, 2024 at 10:32 PM Jim DeLaHunt <list+unicode at jdlh.com 
> <mailto:list%2Bunicode at jdlh.com>> wrote:
>
>     On 2024-08-15 02:08, Henri Sivonen via Unicode wrote:
>
>     > UTS #39 is commonly used as the baseline for detecting IDN
>     spoofs, and
>     > UTS #39 explicitly allows combining Han and Bopomofo.
>     Considering that
>     > ㄚ looks confusable with 丫 and ㄠ looks confusable with 幺, I’m
>     wondering
>     > if it’s appropriate to explicitly allow this combination in the
>     spoof
>     > detection context.…
>
>     Are you asking about whether UTS #39 should allow this combination vs
>     being changed to forbid this combination? Or are you asking about
>     whether the rules of the Domain Name System should allow this
>     combination?
>
>
> Foremost I'm asking if it's appropriate that browsers that in general 
> refuse to render mixed-script domain labels in the Unicode form in the 
> user interface (in the URL bar in particular) make an exception… for 
> the combination of Han and Bopomofo.…

Ah. I did not interpret "allow this combination" as referring to browser 
location bar behaviour, nor to it meaning "display in Unicode (U-Label) 
form instead of encoded ASCII (A-Label) form".

So you asking whether browsers should indicate to users that a domain 
name which combines Han and Bopmofo is untrustworthy?

…

Also,

>     …There are a set of Label Generation Rules for the root zone[2] of
>     the
>     DNS. They include rules for Chinese script labels[3] in the root
>     zone.
>     In my simple-minded reading of those rules, Bopomofo characters
>     are not
>     included in the repertoire. I suspect that means that the rules
>     prevent
>     anyone from registering a .ㄅㄆㄇㄈ top-level domain, or a Chinese
>     domain
>     with Bopomofo inclusions.
>     …
>     [2] <https://icannwiki.org/Root_Zone_Label_Generation_Rules>
>     [3]
>     <https://www.icann.org/sites/default/files/lgr/rz-lgr-5-chinese-script-26may22-en.html>
>
>
>
> It indeed looks like the root LGRs currently don't allow Bopomofo, but 
> it appears that they also don't allow Cyrillic TLDs, which do exist, 
> so it seems that root LGRs are enough in a work-in-progress state not 
> to draw definite conclusions from.

I overlooked something important in [2]: the ICANNWiki content is not 
ICANN content, it is a separate org documenting ICANN. And it turns out 
that their Root Zone Label Generation Rules page at [2] has stale 
content. ICANN's own page on Root Zone Label Generation Rules [6] 
describes version 5 of the root zone LGRs, which include entries for 
Cyrllic, Japanese, and Korean scripts in addition to Chinese.

(I am making a note to update the ICANNWiki Root Zone LGRs page, [2], if 
that is how their wiki works.)

[6] <https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en> 
(Content dates from 2022. No, I don't know why they have a 2015 date in 
their URL.)

>     I understand that each top-level registry sets the rules for
>     second-level labels they will accept, though there is pressure from
>     ICANN communities to adopt standard LGRs. There are a set of
>     suggested
>     Label Generation Rules for second-level labels[4]. As I read those
>     rules, at a superficial level they also seem to rule out Bopomofo
>     characters within Chinese language labels or Bopomofo-only labels.
>
>
> That particular rule set also excludes Hiragana and Katakana, so it's 
> not clear that LGRs for Hani existing means the exclusion of Hanb, 
> Jpan, and Kore.…

Have a look at the version 5 LGRs [6]. There may also be second-level 
LGRs for other scripts like Japanese, Korean, and Cyrillic. I have not 
checked. Does that clarify?

> …(I didn't ask about Jpan in my initial post, despite Han 口 and 
> Katakana ロ existing, because of the different role of Hiragana and 
> Katakana compared to the role of Bopomofo. I didn't ask about Kore, 
> because I'm not aware of a confusability issue even if I have doubts 
> about demand for Han + Hangul domain labels. I am curious, though, how 
> users and domain holders deal with the 口 vs. ロ issue. Is the glyph 
> size distinction consistent and obvious enough?)

You are not the first person to ask this question. Answers at Japanese 
Stack Exchange[7], Reddit[8], WaniKani[9]. Summary: readers 
differentiate the based on context, and sometimes when the context is 
ambiguous people interpret the written kanji to be the kana. The best 
summary: "Context is always the key in Japanese." Those replies also 
point out other visually similar kana and kanji pairs.

[7] 
<https://japanese.stackexchange.com/questions/13678/%E5%8F%A3%E3%83%AD-those-are-supposed-to-be-different-characters-how-can-you-tell/3025>
[8] 
<https://www.reddit.com/r/LearnJapanese/comments/ck3w4w/the_one_time_its_okay_to_confuse_%E3%83%AD_and_%E5%8F%A3_%E3%83%AD%E3%83%91%E3%82%AF/>
[9] <https://community.wanikani.com/t/katakana-ro-vs-mouth-kanji/26641>

I hope this is helpful. Cheers!
      —Jim DeLaHunt

-- 
.   --Jim DeLaHunt,jdlh at jdlh.com      http://blog.jdlh.com/  (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240819/b2a0d36d/attachment.htm>