New Swift API for Unicode Normalisation - feedback wanted about stabilised strings

Sat Jan 25 13:34:56 CST 2025

On Sat, 25 Jan 2025 at 19:49, Asmus Freytag via Unicode
<unicode at corp.unicode.org> wrote:
>
> On 1/25/2025 9:25 AM, Karl Wagner via Unicode wrote:
> > ### "Is `x` Normalized?"
> >
> >
> > It's helpful to start by considering what it means when we say a
> > string "is normalised". It's very simple; literally all it means is
> > that normalising the string returns the same string.
> >
> > ```
> > isNormalized(x):
> >    normalize(x) == x
> > ```
> >
> > For me, it was a bit of a revelation to grasp that in general, the
> > result of `isNormalized` is_only locally meaningful_. Asking the same
> > question, at another point in space or in time, may yield a different
> > result:
> >
> > - Two machines communicating over a network may disagree about whether
> > x is normalised.
> > - The same machine may think x is normalised one day, then after an OS
> > update, suddenly think the same x is not normalised.
>
> That is exactly not as it should be.
>
> "once a string is normalized, it remains normalized".
>
> The corollary is that you cannot normalize a "string" that contains
> unassigned characters for the version of Unicode that you know about.
>
> So your two systems must agree on the isNormalized if the string can be
> normalized on both of them.
>

This is how normalisation is defined in UAX15 - it is a process which
cannot fail for any string, even one which contains unassigned
characters. It has a defined graceful fallback behaviour which
passes-through the original characters in that case. This has the
advantage that it will not corrupt already-normalised text, but of
course it cannot normalise previously-unnormalised text if it contains
unknown characters.

However, given the definition of 'isNormalized', that means
should-be-unnormalised text technically counts as "normalised" if it
contains unassigned characters (because the function didn't change the
original string). That's why I say the result has local significance
only.

If you want a result with global significance, you need to check for
unassigned characters, and the process becomes failable. That's the
process for creating a stabilised string.

> >
> >
> > ### "Are `x` and `y` Equivalent?"
> >
> >
> > Normalisation is how we define equivalence. Two strings, x and y, are
> > equivalent if normalising each of them produces the same result:
> >
> > ```
> > areEquivalent(x, y):
> >    normalize(x) == normalize(y)
> > ```
> >
> > And so following from the previous section, when we deal in pairs (or
> > larger collections) of strings, it follows that:
> >
> > - Two machines communicating over a network may disagree about whether
> > x and y are equivalent or distinct.
> > - The same machine may think x and y are distinct one day, then after
> > an OS update, suddenly think that the same x and y are equivalent.
> >
> > This has some interesting implications. For instance:
> >
> > - If you encode a `Set<String>` in a JSON file, when you (or another
> > machine) decodes it later, the resulting Set's `count` may be less
> > than what it was when it was encoded.
> > - And if you associate values with those strings, such as in a
> > `Dictionary<String, SomeValue>`, some values may be discarded because
> > we would think they have duplicate keys.
> > - If you serialise a sorted list of strings, they may not be
> > considered sorted when you (or another machine) loads them. Sorting
> > involves normalisation, since equivalent strings sort identically.
>
> Other than code point order, two systems cannot apply any common sort on
> a list of strings where some of the strings contain unassigned
> characters for at least one system.
>
> That restriction also applies to any linguistic sorting.
>
> The (overwhelming) majority of data will be from the subset that both
> systems know about. You might think of ways to mediate the interaction
> by putting a Unicode version number on your list.

Right, but I'm talking about how to implement Unicode standards with
the kind of robustness you would expect in a major production system,
and I provided an example of where using plain normalisation (as
opposed to the "normalisation process for stabilised strings" [NPSS])
would result in invalid content making it through the UTS46 processing
pipeline and appearing as valid. Given that UTS46 is supposed to
validate internet domains, it's not impossible that such errors could
lead to security vulnerabilities (two different ASCII domains with
canonically-equivalent Unicode representations should not exist), so I
would consider this an unacceptable defect.

Stabilised strings prevent such errors, which is why I am surprised to
see that no major libraries thought it was worth exposing API for
them. I don't think this problem is unique to Swift. If I were
implementing UTS46 in Javascript and relying on the built-in
`String.normalize()` function in a similar way, I would have to take
similar precautions to check that I was getting a result based on
actual data tables vs. the graceful fallback.

>
> If you are trying to process a list that requires knowing
> as-yet-undefined equivalences you would be able to flag that as an error.
>
> A./
>
> PS: there are lots of things you shouldn't do with unassigned code
> points, as you cannot produce results that are "correct".