21-bit codepoints versus JIS?

Fri Nov 8 05:58:40 CST 2024

On Fri, 8 Nov 2024 at 19:05, suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp> wrote:

> I understand your background is academic study of Japanese language, but
> is there any special reason to mention to JIS X 0213, during the discussion
> of general purpose encoding scheme of UTF-8?

It was an aside. (My academic background is in computer science;
Japanese NLP is a diversion which I have followed in my retirement.)

The original question was about the source code for UTF-8, and the OP
mentioned using Debian Linux I wanted to point out that there was
source code available for conversion of codes to UTF-8. I tossed in a
representation of the conversion of 16-bit Unicode points into 3-byte
UTF-8 sequences. (All the characters in JIS X 0208 and JIS X 0212 were
incorporated in the initial Unicode version.) Markus Scherer added
the representation of 21-bit Unicode in UTF-8, so I pointed out that
relatively few kanji in the JIS standards have 21-bit codepoints.

> In Japan, many running systems keep the restriction of JIS X 0208,
> especially in public sectors.

Interesting comment. I guess you are aware that several of the changes
and additions made in the 2010 revision of the 常用漢字 involved the use
of kanji from outside JIS X 0208. Also, government bodies such as 文化庁
have been encouraging the use of Unicode-only kanji in lists such as
the 表外漢字字体表.
[...]
> I think, the popularity of "21-bit Unicode codepoint" in Japanese text is
> highly dependent with the category of the text.

Absolutely.  Despite some misguided grumbling in Japan about Unicode
in its early days, it's what virtually everyone uses now, and no-one
is really aware whether the codepoints are 16 or 21 bits.

Cheers

Jim

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University
http://www.jimbreen.org/
http://nihongo.monash.edu/