Additional decompositions in decomps.txt

Ken Whistler kenwhistler at att.net
Mon Feb 22 12:10:35 CST 2016


Eli,

You're not missing anything. This is a bug in the documentation of
decomps.txt. Initially, added decompositions for the DUCET default
weights were all tagged as <sort>. This results in a distinct *tertiary*
weight in the initial collation weight values in DUCET. Later on,
there turned up cases where an added decomposition for the DUCET
input worked better *without* a distinct tertiary weight. In
particular, this applies to the large collection of combining marks
whose secondary weights are now collapsed into a smaller set of
distinct values. It also applies to the o with stroke character you
cite below. The documentation for decomps.txt just needs to be
updated to reflect that new pattern.

--Ken

On 2/21/2016 8:32 AM, Eli Zaretskii wrote:
>    # 3. In some cases a new decomposition is added for a character which
>    #    has no decomposition mapping in UnicodeData.txt. In this third case,
>    #    a new decomposition tag "<sort>" is introduced, to distinguish these
>    #    introduced decompositions from those derived from UnicodeData.txt.
>
> However, I see in decomps.txt entries that seem to belong to neither
> of the 3 classes described above.  Here are 2 notable examples:
>
>    00F8;;006F 0338 # LATIN SMALL LETTER O WITH STROKE => LATIN SMALL LETTER O + COMBINING LONG SOLIDUS OVERLAY
>    0142;;006C 0335 # LATIN SMALL LETTER L WITH STROKE => LATIN SMALL LETTER L + COMBINING SHORT STROKE OVERLAY
>
> In both these cases, UnicodeData.txt defines no decomposition
> properties, but the "<sort>" tag I expected to see is absent from
> decomps.txt.  Is there something I'm missing here?
>



More information about the Unicode mailing list