New Swift API for Unicode Normalisation - feedback wanted about stabilised strings
Asmus Freytag
asmusf at ix.netcom.com
Sat Jan 25 12:47:00 CST 2025
On 1/25/2025 9:25 AM, Karl Wagner via Unicode wrote:
> ### "Is `x` Normalized?"
>
>
> It's helpful to start by considering what it means when we say a
> string "is normalised". It's very simple; literally all it means is
> that normalising the string returns the same string.
>
> ```
> isNormalized(x):
> normalize(x) == x
> ```
>
> For me, it was a bit of a revelation to grasp that in general, the
> result of `isNormalized` is_only locally meaningful_. Asking the same
> question, at another point in space or in time, may yield a different
> result:
>
> - Two machines communicating over a network may disagree about whether
> x is normalised.
> - The same machine may think x is normalised one day, then after an OS
> update, suddenly think the same x is not normalised.
That is exactly not as it should be.
"once a string is normalized, it remains normalized".
The corollary is that you cannot normalize a "string" that contains
unassigned characters for the version of Unicode that you know about.
So your two systems must agree on the isNormalized if the string can be
normalized on both of them.
>
>
> ### "Are `x` and `y` Equivalent?"
>
>
> Normalisation is how we define equivalence. Two strings, x and y, are
> equivalent if normalising each of them produces the same result:
>
> ```
> areEquivalent(x, y):
> normalize(x) == normalize(y)
> ```
>
> And so following from the previous section, when we deal in pairs (or
> larger collections) of strings, it follows that:
>
> - Two machines communicating over a network may disagree about whether
> x and y are equivalent or distinct.
> - The same machine may think x and y are distinct one day, then after
> an OS update, suddenly think that the same x and y are equivalent.
>
> This has some interesting implications. For instance:
>
> - If you encode a `Set<String>` in a JSON file, when you (or another
> machine) decodes it later, the resulting Set's `count` may be less
> than what it was when it was encoded.
> - And if you associate values with those strings, such as in a
> `Dictionary<String, SomeValue>`, some values may be discarded because
> we would think they have duplicate keys.
> - If you serialise a sorted list of strings, they may not be
> considered sorted when you (or another machine) loads them. Sorting
> involves normalisation, since equivalent strings sort identically.
Other than code point order, two systems cannot apply any common sort on
a list of strings where some of the strings contain unassigned
characters for at least one system.
That restriction also applies to any linguistic sorting.
The (overwhelming) majority of data will be from the subset that both
systems know about. You might think of ways to mediate the interaction
by putting a Unicode version number on your list.
If you are trying to process a list that requires knowing
as-yet-undefined equivalences you would be able to flag that as an error.
A./
PS: there are lots of things you shouldn't do with unassigned code
points, as you cannot produce results that are "correct".
More information about the Unicode
mailing list