Combining Characters

Alex Shpilkin ashpilkin at gmail.com
Fri Dec 19 15:02:55 CST 2025


On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode 
<unicode at corp.unicode.org> wrote:
> I do wish the documents on migration[1] had explicitly explained that 
> these
> new characters have ccc=0 conjoiners, it may imply it when discussing 
> them,
> and maybe I'm still a bit green on the details to put 2 and 2 together
> but it would have saved me some time.

No objection here despite the foregoing.

> On the topic I did find the suggested resolution of using the 
> quickcheck value a bit strange, as far as I know use of quickcheck 
> was not strictly required for normalziation prior to this update. Or 
> well, my v15 implementation did not use it and passed all the 
> normalization
> tests.

I haven’t gotten to implementing canonical composition yet, nor have 
I looked at any other implementation including yours, but AFAICT the QC 
properties aren’t required now either: looking at the 3.11 
Normalization Forms in Unicode 13, predating this change, the 
recomposition algorithm that suggests itself is:

starter = 0  # sentinel not part of any compositions
starter index = uninitialized

index = 0
while index < length of string:
    composition = try to compose (starter, string[index])
    if succeeded:
        assert ccc[composition] = 0
        string[starter index] = composition
        delete string[index]
    else:
        if ccc[string[index]] = 0:  # NB only this late
            starter = string[index]
            starter index = index
        index = index + 1

If you check conditions in this order, then the handling of 
starter+starter compositions falls out naturally. (Also note that the 
composition table only needs to contain pairs of an NFC-form starter 
and an NFD character, and there are possible optimizations connected to 
the fact that, if the next character after a successful composition is 
a nonstarter too, then the first character in the next lookup will be 
the result of this one.)

Trying to merge de- and recomposition into a single streaming process 
(e.g. with limits on the length of a composing character sequence to 
avoid worst-case linear memory consumption) will of course make things 
much more difficult.

> I guess as an upside I found that with these changes and the 
> inclusion of quickcheck hangul no longer needed to be special cased.

I don’t believe you ever actually *have* to special-case Hangul after 
you’ve generated your tables, it’s just that if you are trying to 
keep your table size down (as I am) then doing so will give you 
something like 2x savings.

-- 
HTH,
Alex





More information about the Unicode mailing list