Combining Characters
Alex Shpilkin
ashpilkin at gmail.com
Fri Dec 19 15:02:55 CST 2025
On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode
<unicode at corp.unicode.org> wrote:
> I do wish the documents on migration[1] had explicitly explained that
> these
> new characters have ccc=0 conjoiners, it may imply it when discussing
> them,
> and maybe I'm still a bit green on the details to put 2 and 2 together
> but it would have saved me some time.
No objection here despite the foregoing.
> On the topic I did find the suggested resolution of using the
> quickcheck value a bit strange, as far as I know use of quickcheck
> was not strictly required for normalziation prior to this update. Or
> well, my v15 implementation did not use it and passed all the
> normalization
> tests.
I haven’t gotten to implementing canonical composition yet, nor have
I looked at any other implementation including yours, but AFAICT the QC
properties aren’t required now either: looking at the 3.11
Normalization Forms in Unicode 13, predating this change, the
recomposition algorithm that suggests itself is:
starter = 0 # sentinel not part of any compositions
starter index = uninitialized
index = 0
while index < length of string:
composition = try to compose (starter, string[index])
if succeeded:
assert ccc[composition] = 0
string[starter index] = composition
delete string[index]
else:
if ccc[string[index]] = 0: # NB only this late
starter = string[index]
starter index = index
index = index + 1
If you check conditions in this order, then the handling of
starter+starter compositions falls out naturally. (Also note that the
composition table only needs to contain pairs of an NFC-form starter
and an NFD character, and there are possible optimizations connected to
the fact that, if the next character after a successful composition is
a nonstarter too, then the first character in the next lookup will be
the result of this one.)
Trying to merge de- and recomposition into a single streaming process
(e.g. with limits on the length of a composing character sequence to
avoid worst-case linear memory consumption) will of course make things
much more difficult.
> I guess as an upside I found that with these changes and the
> inclusion of quickcheck hangul no longer needed to be special cased.
I don’t believe you ever actually *have* to special-case Hangul after
you’ve generated your tables, it’s just that if you are trying to
keep your table size down (as I am) then doing so will give you
something like 2x savings.
--
HTH,
Alex
More information about the Unicode
mailing list