Correct way to encode mixed width text in Unicode?

Hu Jialun hujialun at comp.nus.edu.sg
Wed Mar 13 12:29:04 CDT 2024


 From what I read [^1], the fullwidth glyphs in Unicode are provided
solely for backward compatibility and lossless roundtrip with legacy
standards such as Shift-JIS. The rationale [^2] seems to be that Unicode
views it as a presentational issue that is better dealt with by the
renderer based on linguistic context, and use of such characters is
generally discouraged. In some cases, no compatibility character is
provided at all, such as fullwidth left/right single/double quotation
marks, because no legacy encoding contains both full- and half-width
forms, and Unicode explicitly states the rejection of any more of such.

Unicode recommends in the same document,

     Ambiguous quotation marks are generally resolved to wide when they
     enclose and are adjacent to a wide character, and to narrow
     otherwise.

However, there are cases where the width gets tricky to resolve, which
sometimes yields incorrect results across current fonts and renderer
implementations,

     他们一致认为,目前最大的敌人无疑是“N问题”,即Nostalgia,思乡病。

     “Make a wish! Make a wish!”琳琳和盼盼喊。

     The term “char kway teow” is a transliteration of the Chinese
     characters “炒粿條”.

     教授昨天讲了:“Hamlet的原文其实是Polonius (II.ii.) ‘Though this be
     madness, yet there is method in‘t.’“。

     在大韩民国,这个语言的名称是“한국어/韓國語”。在中国大陆、香港、澳门的名称是
     “韩语”或“朝鲜语”。台湾则通称为“韩语”。

It seems that the recommended algorithm fails in such cases (rendered
inconsistently e.g. with fullwidth left quote and halfwidth right
quote), and such cases may just be too complex for an algorithm to
render without intricate and fragile rulesets for the language itself.

This issue mainly affects Simplified Chinese but not other East Asian
languages, due to the fact that Traditional Chinese, Japanese and
vertically written Korean commonly use the U+300C-300F CORNER BRACKET
family (East_Asian_Width=Wide).

My question is thus, is there a common way to provide a hint in
plaintext for the width of an ambiguous width character, maybe as a
Unicode variation selector or something like RLM?

[^1]: https://harjit.moe/hwfwblame.html
[^2]: https://www.unicode.org/reports/tr11/tr11-41.html#Relation
Originally asked at:
<https://superuser.com/questions/1828050/correct-way-to-encode-mixed-width-text-in-unicode>

~hujialun


More information about the Unicode mailing list