UCA unnecessary collation weight 0000
Philippe Verdy via Unicode
unicode at unicode.org
Thu Nov 1 12:39:16 CDT 2018
I just remarked that there's absolutely NO utility of the collation weight
0000 anywhere in the algorithm.
For example in UTR #10, section 3.3.1 gives a collection element :
[.0000.0021.0002]
for COMBINING GRAVE ACCENT. However it can also be simply:
[.0021.0002]
for a simple reason: the secondary or tertiary weights are necessarily
LOWER then any primary weight (for conformance reason):
any tertiary weight < any secondary weight < any primary weight
(the set of all weights for all levels is fully partitioned into disjoint
intervals in the same order, each interval containing all its weights, so
weights are sorted by decreasing level, then increasing weight in all cases)
This also means that we never need to handle 0000 weights when creating
sort keys from multiple collection elements, as we can easily detect that
[.0021.0002] given above starts by a secondary weight 0021 and is not a
primary weight.
As well we don't need to use any level separator 0000 in the sort key.
This allows more interesting optimizations, and reduction of length for
sort keys.
What this means is that we can safely implement UCA using basic substitions
(e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to
map source (binary) strings or regexps,into target (binary) strings:
For a level-3 collation, you just then need only 3 calls to "string:gsub()"
to compute any collation:
- the first ":gsub(mapNormalize)" can decompose a source text into
collation elements and can perform reordering to enforce a normalized order
(possibly tuned for the tailored locale) using basic regexps.
- the second ":gsub(mapSecondary)" will substitute any collection elements
by their "intermediary" collation elements+tertiary weight.
- the third ":gsub(mapSecondary)" will substitute any "intermediary"
collation element by their primary weight + secondary weight
The "intermediary" collection elements are just like source text, except
that higher level differences are eliminated, i.e.all source collation
element string are replaced by the collection element string that have the
smallest collation element weights. They must be just encoded so that they
are HIGHER than any higher level weights.
How to do that:
- reserve the weight range between .0000 (yes! not just .0001) and .001E
for the last (tertiary) weight, make sure that all other intermediary
collation elements will use only code units higher than .0020 (this means
that they can remain encoded in their existing UTF form!)
- reserve the weight .001F for the case where you don't want to use
secondary differences (like letter case) and them to tertiary differences.
This will be used in the second mapping to decompose source collection
elements into "intermediary collation elements" + tertiary weight. you may
then decide to leave tertiary weights in the substitute string, or because
the "gsub()" finds match from left to right, to accumulate the tertiary
weights into a separate buffer, so that the subtitution itself will still
return a valid UTF string, containing only "intermediary collation
elements" (with all tertiary differences erased).
You can repeat the process with the next gsub() to return the primary
collation elements" (still in UTF form), and separately the secondary
weights (also accumulable in a separate buffer).
Now there remains only 3 strings:
- one contains only the primary collection elements (still in UTF-form, but
using code units always higher than or equal to 0020)
- another one contains only secondary weights (between MINSECONDARYWEIGHT
and 001F)
- another one contains only tertiary weights. (between 0000 and
MINSECONDARYWEIGHT-1)
For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
* primary weights are encoded with one or more code units in [0020..]
(multiple code units are possible if you reserve some of these code units
to be prefixes or longer sequences)
* secondary weights are encoded with one or more code units in [0010..001E]
(same remark about multiple code units if you need them)
* tertiary weights are encoded with one or more code units
in [0010..001F] (same remark about multiple code units if you need them)
The last gsub() will only reorder the primary collection elements to remap
them in a suitable binary order (it will be a simple bijective permutation,
except that the target does not have to use multiple code units, but a
single one, when there are contractions). It's always possible to make this
permutation generate integers higher than 0020. The resulting weights can
remain encodable with UTF-8 as if it was source text.
And to return the sort key, all you need is to concatenate
* the string containing all primary weights encoded with code units in
[0020..], then
* the string containing secondary weights encoded with code units in
[0010..001E], then
* the string containing tertiary weights encoded with code units in
[0000..001F].
* you don't need to insert ANY [0000] as a level separator in the final
sort key, because each concatenated part in the final sort key respect the
wellformedness constraint WF2 of the UCA algorithm.
You may choose to not use tertiary weights encoded with [0000] code units,
if you want the final string containing the sort key to be null-terminated.
In summary:
* there's no longer any special role given in UCA for [0000]. More
compaction possible for storing the mapping of source collation element
strings (in their original UTF encoding) to strings of collation weights
(themselves still encodage with an UTF!).
* Any tailored collation (except those requiring preprocessing that may
apply specific reorderings, possibly made by using subtitution with one or
more regexps to apply, repeated in a defined order) is just specified by
one map per collation level, containing source UTF strings (or regexps) to
replace by their mapped string of collation weights.
* You are free to choose the UTF to use for the source string or for the
collation weight (these UTF may be different or may be both UTF-8. If you
use a conforming UTF, the only code units you cannot use are those in
[D800..DFFF], reserved for surrogates.
* Normal string library packages can be used to implement UCA, even those
that can only work with texts encoded with a valid UTF.
* Given that the resulting sort keys are valid UTF, they are displayable:
in many circonstances, the initial part of the string (containing primary
weights only) will display the normal UTF encoding of readable text; if
there are additional secondary or tertiary weights after it, because they
are represented using C0 controls, you may still display them using a
notation like \xNN (you only need to escape '\' if it is present as a
litteral in the readable part of the sort key containing primary weights).
Note: Isolated surrogates found in a non-conforming source string need to
be preprocessed if you want to accept them in a collator:
- You can do that by preprocessing [0000] or [D800..DFFF], into [0000]
followed by only one codeunits in [0020..], so they form a single collation
element [0000][0020..]; use [0000][0020] as the collation element
representing the source [0000] and just insert a single [0000] before any
isolated surrogate you'll replace by a code unit in [0800..0FFF]. The
result will be a conforming UTF string on which your collator will return
valid UTF strings of weights.
- If you don't want to have any [0000] within sort keys, you can also
preprocess the source string by reencoding [0000] into [0001][0020], and
[0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into
[0001][0800..0FFF]. Here also the result will be a conforming UTF string on
which your collator will return valid UTF strings of weights.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181101/223e8734/attachment.html>
More information about the Unicode
mailing list