<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>On 2024-08-19 03:33, Henri Sivonen wrote:</p>
<blockquote type="cite"
cite="mid:CAJHk+8QHsNv_dkJ5iOvUZ+KJrxSnsS7AaxmR4z_NFjS8uVhhpg@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Aug 16, 2024 at
10:32 PM Jim DeLaHunt <<a
href="mailto:list%2Bunicode@jdlh.com" target="_blank"
moz-do-not-send="true">list+unicode@jdlh.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On
2024-08-15 02:08, Henri Sivonen via Unicode wrote:<br>
<br>
> UTS #39 is commonly used as the baseline for detecting
IDN spoofs, and <br>
> UTS #39 explicitly allows combining Han and Bopomofo.
Considering that <br>
> ㄚ looks confusable with 丫 and ㄠ looks confusable with
幺, I’m wondering <br>
> if it’s appropriate to explicitly allow this
combination in the spoof <br>
> detection context.…<br>
<br>
Are you asking about whether UTS #39 should allow this
combination vs <br>
being changed to forbid this combination? Or are you asking
about <br>
whether the rules of the Domain Name System should allow
this combination?<br>
</blockquote>
<div><br>
</div>
<div>Foremost I'm asking if it's appropriate that browsers
that in general refuse to render mixed-script domain labels
in the Unicode form in the user interface (in the URL bar in
particular) make an exception… for the combination of Han
and Bopomofo.…</div>
</div>
</div>
</blockquote>
<p>Ah. I did not interpret "allow this combination" as referring to
browser location bar behaviour, nor to it meaning "display in
Unicode (U-Label) form instead of encoded ASCII (A-Label) form". <br>
</p>
<p>So you asking whether browsers should indicate to users that a
domain name which combines Han and Bopmofo is untrustworthy?</p>
<p>…</p>
<p>Also,<br>
</p>
<blockquote type="cite"
cite="mid:CAJHk+8QHsNv_dkJ5iOvUZ+KJrxSnsS7AaxmR4z_NFjS8uVhhpg@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
…There are a set of Label Generation Rules for the root
zone[2] of the <br>
DNS. They include rules for Chinese script labels[3] in the
root zone. <br>
In my simple-minded reading of those rules, Bopomofo
characters are not <br>
included in the repertoire. I suspect that means that the
rules prevent <br>
anyone from registering a .ㄅㄆㄇㄈ top-level domain, or a
Chinese domain <br>
with Bopomofo inclusions.<br>
…<br>
[2] <a class="moz-txt-link-rfc2396E"
href="https://icannwiki.org/Root_Zone_Label_Generation_Rules"><https://icannwiki.org/Root_Zone_Label_Generation_Rules><br>
[3] </a><a class="moz-txt-link-rfc2396E"
href="https://www.icann.org/sites/default/files/lgr/rz-lgr-5-chinese-script-26may22-en.html"><https://www.icann.org/sites/default/files/lgr/rz-lgr-5-chinese-script-26may22-en.html></a>
<br>
</blockquote>
<div><br>
</div>
<div>It indeed looks like the root LGRs currently don't allow
Bopomofo, but it appears that they also don't allow Cyrillic
TLDs, which do exist, so it seems that root LGRs are enough
in a work-in-progress state not to draw definite conclusions
from.<br>
</div>
</div>
</div>
</blockquote>
<p>I overlooked something important in [2]: the ICANNWiki content is
not ICANN content, it is a separate org documenting ICANN. And it
turns out that their Root Zone Label Generation Rules page at [2]
has stale content. ICANN's own page on Root Zone Label Generation
Rules [6] describes version 5 of the root zone LGRs, which include
entries for Cyrllic, Japanese, and Korean scripts in addition to
Chinese.</p>
<p>(I am making a note to update the ICANNWiki Root Zone LGRs page,
[2], if that is how their wiki works.)<br>
</p>
<p>[6]
<a class="moz-txt-link-rfc2396E" href="https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en"><https://www.icann.org/resources/pages/root-zone-lgr-2015-06-21-en></a>
(Content dates from 2022. No, I don't know why they have a 2015
date in their URL.)<br>
</p>
<blockquote type="cite"
cite="mid:CAJHk+8QHsNv_dkJ5iOvUZ+KJrxSnsS7AaxmR4z_NFjS8uVhhpg@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
I understand that each top-level registry sets the rules for
<br>
second-level labels they will accept, though there is
pressure from <br>
ICANN communities to adopt standard LGRs. There are a set of
suggested <br>
Label Generation Rules for second-level labels[4]. As I read
those <br>
rules, at a superficial level they also seem to rule out
Bopomofo <br>
characters within Chinese language labels or Bopomofo-only
labels.<br>
</blockquote>
<div><br>
</div>
<div>That particular rule set also excludes Hiragana and
Katakana, so it's not clear that LGRs for Hani existing
means the exclusion of Hanb, <span>Jpan, and Kore</span>.…
</div>
</div>
</div>
</blockquote>
<p>Have a look at the version 5 LGRs [6]. There may also be
second-level LGRs for other scripts like Japanese, Korean, and
Cyrillic. I have not checked. Does that clarify?</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:CAJHk+8QHsNv_dkJ5iOvUZ+KJrxSnsS7AaxmR4z_NFjS8uVhhpg@mail.gmail.com">
<div dir="ltr">
<div class="gmail_quote">
<div>…(I didn't ask about Jpan in my initial post, despite Han
口 and Katakana ロ existing, because of the different role of
Hiragana and Katakana compared to the role of Bopomofo. I
didn't ask about Kore, because I'm not aware of a
confusability issue even if I have doubts about demand for
Han + Hangul domain labels. I am curious, though, how users
and domain holders deal with the 口 vs. ロ issue. Is the glyph
size distinction consistent and obvious enough?)<br>
</div>
</div>
</div>
</blockquote>
<p>You are not the first person to ask this question. Answers at
Japanese Stack Exchange[7], Reddit[8], WaniKani[9]. Summary:
readers differentiate the based on context, and sometimes when the
context is ambiguous people interpret the written kanji to be the
kana. The best summary: "Context is always the key in Japanese."
Those replies also point out other visually similar kana and kanji
pairs.</p>
<p>[7]
<a class="moz-txt-link-rfc2396E" href="https://japanese.stackexchange.com/questions/13678/%E5%8F%A3%E3%83%AD-those-are-supposed-to-be-different-characters-how-can-you-tell/3025"><https://japanese.stackexchange.com/questions/13678/%E5%8F%A3%E3%83%AD-those-are-supposed-to-be-different-characters-how-can-you-tell/3025></a><br>
[8]
<a class="moz-txt-link-rfc2396E" href="https://www.reddit.com/r/LearnJapanese/comments/ck3w4w/the_one_time_its_okay_to_confuse_%E3%83%AD_and_%E5%8F%A3_%E3%83%AD%E3%83%91%E3%82%AF/"><https://www.reddit.com/r/LearnJapanese/comments/ck3w4w/the_one_time_its_okay_to_confuse_%E3%83%AD_and_%E5%8F%A3_%E3%83%AD%E3%83%91%E3%82%AF/></a><br>
[9]
<a class="moz-txt-link-rfc2396E" href="https://community.wanikani.com/t/katakana-ro-vs-mouth-kanji/26641"><https://community.wanikani.com/t/katakana-ro-vs-mouth-kanji/26641></a></p>
<p>I hope this is helpful. Cheers!<br>
—Jim DeLaHunt<br>
</p>
<pre class="moz-signature" cols="72">--
. --Jim DeLaHunt, <a class="moz-txt-link-abbreviated" href="mailto:jdlh@jdlh.com">jdlh@jdlh.com</a> <a class="moz-txt-link-freetext" href="http://blog.jdlh.com/">http://blog.jdlh.com/</a> (<a class="moz-txt-link-freetext" href="http://jdlh.com/">http://jdlh.com/</a>)
multilingual websites consultant, Vancouver, B.C., Canada</pre>
</body>
</html>