21-bit codepoints versus JIS?

Fri Nov 8 08:16:45 CST 2024

Dear Jim,

On 2024/11/08 20:58, Jim Breen wrote:
> On Fri, 8 Nov 2024 at 19:05, suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> wrote:
> 
>> I understand your background is academic study of Japanese language, but
>> is there any special reason to mention to JIS X 0213, during the discussion
>> of general purpose encoding scheme of UTF-8?
> 
> It was an aside. (My academic background is in computer science;
> Japanese NLP is a diversion which I have followed in my retirement.)

Oh, thank you for correcting my misunderstanding!

> The original question was about the source code for UTF-8, and the OP
> mentioned using Debian Linux I wanted to point out that there was
> source code available for conversion of codes to UTF-8. I tossed in a
> representation of the conversion of 16-bit Unicode points into 3-byte
> UTF-8 sequences. (All the characters in JIS X 0208 and JIS X 0212 were
> incorporated in the initial Unicode version.) Markus Scherer added
> the representation of 21-bit Unicode in UTF-8, so I pointed out that
> relatively few kanji in the JIS standards have 21-bit codepoints.

Correct. But, why we should restrict the focus to JIS character set?
I could not find any priority to JIS charset in the (painful) discussion...

If I focus iso-8859 character set, the usage of 16-bit codepoints
is rare, but if I say such, (I believe) many experts in the mailing
list may say, "sorry, our discussion is more generic".

>> In Japan, many running systems keep the restriction of JIS X 0208,
>> especially in public sectors.
> 
> Interesting comment. I guess you are aware that several of the changes
> and additions made in the 2010 revision of the 常用漢字 involved the use
> of kanji from outside JIS X 0208. Also, government bodies such as 文化庁
> have been encouraging the use of Unicode-only kanji in lists such as
> the 表外漢字字体表.
> [...]

It's questionable whether 文化庁 was so ambitious to replace JIS
X 0208 + 0212 by ISO/IEC 10646. I guess they did not understand
the character encoding, and the industrial standard.

I guess, the earliest motivation of 表外漢字字体表 was not the
extension of the character set - their motivation would be an
elimination of the "non-authentic simplified form" of the characters,
as far as they had been exceptionally permitted by 常用漢字1981.

Maybe, the driving people of 表外漢字字体表 had a dream that their
result would urge Japanese IT companies to replace simplified glyphs
on JIS X 0208:1983-based system by more traditional glyph shape,
like 鷗, 𠮟, 噓, etc, without changing the character encoding scheme.

Unfortunately, it was too late to realize such a dream. As you know,
these glyph shapes were already coded as different characters in
ISO/IEC 10646, and Japanese IT companies could not afford to
recreate a system without ISO/IEC 10646-based frameworks anymore.
Even if the governmental customers ask Japanese vendors to build
a system supporting the characters which are not in JIS X 0208
but exist in ISO/IEC 10646, some Japanese vendors sell the system
which non-JIS characters are coded at the PUA codepoints of JIS
X 0208-based encoding (like Windows-31J), because they have no
experience to design other mechanism.

The "authentic traditional forms" coded in JIS X 0213:2004 are
still tagged as [環境依存文字] by Microsoft IME, so many people
think they are non-portable characters. In fact, if I make a
file whose filename including "叱" (U+20B9F) instead of "叱"
(U+53F1) on Microsoft Windows sold in Japan (running under Japanese
locale), I cannot put it in a ZIP file by builtin file manager
(so-called "Explorer"). The file manager warn "there are characters
which cannot be used in a compressed folder". Clearly, there is
a restriction of Windows-31J.

>> I think, the popularity of "21-bit Unicode codepoint" in Japanese text is
>> highly dependent with the category of the text.
> 
> Absolutely.  Despite some misguided grumbling in Japan about Unicode
> in its early days, it's what virtually everyone uses now, and no-one
> is really aware whether the codepoints are 16 or 21 bits.

I remember, some lecturers in Japanese universities, at the role to
teach the information technology to young students, are still teaching as:
there are two kind of character encodings, one is single byte encoding
like ASCII, and another is double byte encoding like JIS-kanji...

Regards,
mpsuzuki