Correct way to encode mixed width text in Unicode?

Erik Carvalhal Miller ecm.unicode at gmail.com
Wed Mar 13 16:07:12 CDT 2024


Itʼs planned.  See <https://www.unicode.org/L2/L2023/23231.htm#177-C36>.

On Wed, Mar 13, 2024 at 1:50 PM Hu Jialun via Unicode <
unicode at corp.unicode.org> wrote:

>  From what I read [^1], the fullwidth glyphs in Unicode are provided
> solely for backward compatibility and lossless roundtrip with legacy
> standards such as Shift-JIS. The rationale [^2] seems to be that Unicode
> views it as a presentational issue that is better dealt with by the
> renderer based on linguistic context, and use of such characters is
> generally discouraged. In some cases, no compatibility character is
> provided at all, such as fullwidth left/right single/double quotation
> marks, because no legacy encoding contains both full- and half-width
> forms, and Unicode explicitly states the rejection of any more of such.
>
> Unicode recommends in the same document,
>
>      Ambiguous quotation marks are generally resolved to wide when they
>      enclose and are adjacent to a wide character, and to narrow
>      otherwise.
>
> However, there are cases where the width gets tricky to resolve, which
> sometimes yields incorrect results across current fonts and renderer
> implementations,
>
>      他们一致认为,目前最大的敌人无疑是“N问题”,即Nostalgia,思乡病。
>
>      “Make a wish! Make a wish!”琳琳和盼盼喊。
>
>      The term “char kway teow” is a transliteration of the Chinese
>      characters “炒粿條”.
>
>      教授昨天讲了:“Hamlet的原文其实是Polonius (II.ii.) ‘Though this be
>      madness, yet there is method in‘t.’“。
>
>      在大韩民国,这个语言的名称是“한국어/韓國語”。在中国大陆、香港、澳门的名称是
>      “韩语”或“朝鲜语”。台湾则通称为“韩语”。
>
> It seems that the recommended algorithm fails in such cases (rendered
> inconsistently e.g. with fullwidth left quote and halfwidth right
> quote), and such cases may just be too complex for an algorithm to
> render without intricate and fragile rulesets for the language itself.
>
> This issue mainly affects Simplified Chinese but not other East Asian
> languages, due to the fact that Traditional Chinese, Japanese and
> vertically written Korean commonly use the U+300C-300F CORNER BRACKET
> family (East_Asian_Width=Wide).
>
> My question is thus, is there a common way to provide a hint in
> plaintext for the width of an ambiguous width character, maybe as a
> Unicode variation selector or something like RLM?
>
> [^1]: https://harjit.moe/hwfwblame.html
> [^2]: https://www.unicode.org/reports/tr11/tr11-41.html#Relation
> Originally asked at:
> <
> https://superuser.com/questions/1828050/correct-way-to-encode-mixed-width-text-in-unicode
> >
>
> ~hujialun
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240313/a613e12c/attachment-0001.htm>


More information about the Unicode mailing list