normalization: dotless i + COMBINING ACUTE ACCENT doesn't combine to I ACUTE
Nico Schlömer
nico.schloemer at gmail.com
Tue Jun 14 14:56:36 CDT 2022
Hi everyone,
I was wondering about Unicode normalization with the dotless i/j characters.
In Python (and all other implementations I've checked), i + COMBINING
ACUTE ACCENT combine to LATIN SMALL LETTER I WITH ACUTE
```
from unicodedata import normalize
normalize("NFC", "i\N{COMBINING ACUTE ACCENT}").encode("ascii", "namereplace")
```
```
b'\\N{LATIN SMALL LETTER I WITH ACUTE}'
```
When doing the same with a dotless i, it does _not_ combine:
```
from unicodedata import normalize
normalize("NFC", "\N{LATIN SMALL LETTER DOTLESS I}\N{COMBINING ACUTE
ACCENT}").encode("ascii", "namereplace")
```
```
b'\\N{LATIN SMALL LETTER DOTLESS I}\\N{COMBINING ACUTE ACCENT}'
```
Is this consistent with the standard, and oversight in the standard,
or intended?
Perhaps someone here can shed some light on it. See also this
stackoverflow request [1] and this Python bug report [2].
Cheers,
Nico
[1] https://stackoverflow.com/q/72608183/353337
[2] https://github.com/python/cpython/issues/93767
More information about the Unicode
mailing list