UCA unnecessary collation weight 0000

Philippe Verdy via Unicode unicode at unicode.org
Thu Nov 1 15:10:16 CDT 2018


For example, Figure 3 in the UTR#10 contains:

Figure 3. Comparison of Sort Keys
<http://unicode.org/reports/tr10/#Comparison_Of_Sort_Keys_Table>
 StringSort Key
1 cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0002* 0002 0002
2 Cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0008* 0002 0002
3 cáb *0706* 06D9 06EE *0000* 0020 0020 *0021* *0000* 0002 0002 0002 0002
4 dab *0712* 06D9 06EE *0000* 0020 0020 0020 *0000* 0002 0002 0002


The 0000 weights are never needed, even if any of the source strings
("cab", "Cab", "cáb", "dab") is followed by ANY other string, or if any
other string (higher than "b") replaces their final "b".
What is really important is to understand where the input text (after
initial transforms like reodering and expansion) is broken at specific
boundaries between collatable elements.
But the boundaries of weights indicated each part of the sort key can
always be infered for example between 06EE and 0020, or between 0020 and
0002.
So this can obviously be changed to just:

Figure 3. Comparison of Sort Keys
<http://unicode.org/reports/tr10/#Comparison_Of_Sort_Keys_Table>

 StringSort Key
1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002
2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002
3 cáb *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002
4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002
As well (emphasized by black blackground above),
* when the secondary weights in the sort key are terminated by any sequence
of 0020 (the minimal secondary weight), you can suppress them from the
collation key.
* when the tertiary weights are in the sort key are terminated by any
sequence of 0002 (the minimal tertiary weight), you can suppress them from
the collation key.
This gives:

Figure 3. Comparison of Sort Keys
<http://unicode.org/reports/tr10/#Comparison_Of_Sort_Keys_Table>
 StringSort Key
1 cab *0706* 06D9 06EE
2 Cab *0706* 06D9 06EE *0008*
3 cáb *0706* 06D9 06EE 0020 0020 *0021*
4 dab *0712* 06D9 06EE
See the reduction !

Le jeu. 1 nov. 2018 à 18:39, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :

> I just remarked that there's absolutely NO utility of the collation weight
> 0000 anywhere in the algorithm.
>
> For example in UTR #10, section 3.3.1 gives a collection element :
>   [.0000.0021.0002]
> for COMBINING GRAVE ACCENT. However it can also be simply:
>   [.0021.0002]
> for a simple reason: the secondary or tertiary weights are necessarily
> LOWER then any primary weight (for conformance reason):
>  any tertiary weight < any secondary weight < any primary weight
> (the set of all weights for all levels is fully partitioned into disjoint
> intervals in the same order, each interval containing all its weights, so
> weights are sorted by decreasing level, then increasing weight in all cases)
>
> This also means that we never need to handle 0000 weights when creating
> sort keys from multiple collection elements, as we can easily detect that
> [.0021.0002] given above starts by a secondary weight 0021 and is not a
> primary weight.
>
> As well we don't need to use any level separator 0000 in the sort key.
>
> This allows more interesting optimizations, and reduction of length for
> sort keys.
> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua which uses
> a "map" to map source (binary) strings or regexps,into target (binary)
> strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized order
> (possibly tuned for the tailored locale) using basic regexps.
>
> - the second ":gsub(mapSecondary)"  will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight
>
> The "intermediary" collection elements are just like source text, except
> that higher level differences are eliminated, i.e.all source collation
> element string are replaced by the collection element string that have the
> smallest collation element weights. They must be just encoded so that they
> are HIGHER than any higher level weights.
>
> How to do that:
> - reserve the weight range between .0000 (yes! not just .0001) and .001E
> for the last (tertiary) weight, make sure that all other intermediary
> collation elements will use only code units higher than .0020 (this means
> that they can remain encoded in their existing UTF form!)
> - reserve the weight .001F for the case where you don't want to use
> secondary differences (like letter case) and them to tertiary differences.
>
> This will be used in the second mapping to decompose source collection
> elements into "intermediary collation elements" + tertiary weight. you may
> then decide to leave tertiary weights in the substitute string, or because
> the "gsub()" finds match from left to right, to accumulate the tertiary
> weights into a separate buffer, so that the subtitution itself will still
> return a valid UTF string, containing only "intermediary collation
> elements" (with all tertiary differences erased).
>
> You can repeat the process with the next gsub() to return the primary
> collation elements" (still in UTF form), and separately the secondary
> weights (also accumulable in a separate buffer).
>
> Now there remains only 3 strings:
> - one contains only the primary collection elements (still in UTF-form,
> but using code units always higher than or equal to 0020)
> - another one contains only secondary weights (between MINSECONDARYWEIGHT
> and 001F)
> - another one contains only tertiary weights. (between 0000 and
> MINSECONDARYWEIGHT-1)
>
> For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
> * primary weights are encoded with one or more code units in [0020..]
> (multiple code units are possible if you reserve some of these code units
> to be prefixes or longer sequences)
> * secondary weights are encoded with one or more code units in
> [0010..001E] (same remark about multiple code units if you need them)
> * tertiary weights are encoded  with one or more code units
> in  [0010..001F] (same remark about multiple code units if you need them)
>
> The last gsub() will only reorder the primary collection elements to remap
> them in a suitable binary order (it will be a simple bijective permutation,
> except that the target does not have to use multiple code units, but a
> single one, when there are contractions). It's always possible to make this
> permutation generate integers higher than 0020. The resulting weights can
> remain encodable with UTF-8 as if it was source text.
>
> And to return the sort key, all you need is to concatenate
> * the string containing all primary weights encoded with code units in
> [0020..], then
> * the string containing secondary weights encoded with code units in
> [0010..001E], then
> * the string containing tertiary weights encoded with code units in
> [0000..001F].
> * you don't need to insert ANY [0000] as a level separator in the final
> sort key, because each concatenated part in the final sort key respect the
> wellformedness constraint WF2 of the UCA algorithm.
>
> You may choose to not use tertiary weights encoded with [0000] code units,
> if you want the final string containing the sort key to be null-terminated.
>
> In summary:
> * there's no longer any special role given in UCA for [0000]. More
> compaction possible for storing the mapping of source collation element
> strings (in their original UTF encoding) to strings of collation weights
> (themselves still encodage with an UTF!).
> * Any tailored collation (except those requiring preprocessing that may
> apply specific reorderings, possibly made by using subtitution with one or
> more regexps to apply, repeated in a defined order) is just specified by
> one map per collation level, containing source UTF strings (or regexps) to
> replace by their mapped string of collation weights.
> * You are free to choose the UTF to use for the source string or for the
> collation weight (these UTF may be different or may be both UTF-8. If you
> use a conforming UTF, the only code units you cannot use are those in
> [D800..DFFF], reserved for surrogates.
> * Normal string library packages can be used to implement UCA, even those
> that can only work with texts encoded with a valid UTF.
> * Given that the resulting sort keys are valid UTF, they are displayable:
> in many circonstances, the initial part of the string (containing primary
> weights only) will display the normal UTF encoding of readable text; if
> there are additional secondary or tertiary weights after it, because they
> are represented using C0 controls, you may still display them using a
> notation like \xNN (you only need to escape '\' if it is present as a
> litteral in the readable part of the sort key containing primary weights).
>
> Note: Isolated surrogates found in a non-conforming source string need to
> be preprocessed if you want to accept them in a collator:
> - You can do that by preprocessing [0000] or [D800..DFFF], into [0000]
> followed by only one codeunits in [0020..], so they form a single collation
> element [0000][0020..]; use [0000][0020] as the collation element
> representing the source [0000] and just insert a single [0000] before any
> isolated surrogate you'll replace by a code unit in [0800..0FFF]. The
> result will be a conforming UTF string on which your collator will return
> valid UTF strings of weights.
> - If you don't want to have any [0000] within sort keys, you can also
> preprocess the source string by reencoding [0000] into [0001][0020], and
> [0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into
> [0001][0800..0FFF]. Here also the result will be a conforming UTF string on
> which your collator will return valid UTF strings of weights.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181101/7aa29d15/attachment.html>


More information about the Unicode mailing list