New Swift API for Unicode Normalisation - feedback wanted about stabilised strings

Sat Jan 25 11:25:55 CST 2025

Hi!

I am proposing a set of new Unicode Normalization APIs
<https://github.com/swiftlang/swift-evolution/pull/2512/files?short_path=029373a>
for the Swift programming language, and my research has raised some
concerns related to normalisation, versioning, and software
distribution. I've spent some time thinking about them and believe I
have a good design (both in terms of the API I want to expose to users
and the documentation/advice that would accompany it), but it seems
quite novel compared to other languages and libraries, and that means
it's probably worth asking some Unicode experts whether the reasoning
makes sense.

> Note: I am not affiliated with Apple. Swift is an open source project and anybody is welcome to contribute (sometimes it is assumed).

## Background

An interesting feature of the Swift language is that its default
`String` type is designed for correct Unicode processing - for
instance, canonically-equivalent Strings compare as being equal to
each other and produce the same hash value, so you can do things like
insert a `String` in to a `Set` (a hash table) and retrieve it using
any canonically-equivalent string.

```swift
var strings: Set<String> = []

strings.insert("\u{00E9}") // precomposed e + acute accent
assert(strings.contains("e\u{0301}")) // decomposed e + acute accent
```

The Swift standard library contains its own implementations of several
Unicode standards such as UAX15 for normalisation and UAX29 for
grapheme breaking, and there is a desire to expose these so that more
interesting text-related structures and algorithms can be built
out-of-the-box. It includes a lot, and will probably grow, but I don't
believe there is a desire to implement _every_ Unicode standard
(certainly not in the near term, at least), so it isn't like having
ICU built-in. Instead, if a developer needs something very specialised
such as UTS46 (IDNA) or UAX39 (spoof checking), they can create a
third-party library and make use of the built-in functionality
together with their own data tables and algorithms. And this is what
we want - we want the Unicode support in the standard library to be
useful when implementing text algorithms, and ideally I suppose that
would scale all the way up to something like rolling your own UTS46 or
UAX39.

But this compositional approach creates some interesting challanges if
the libraries contain embedded data tables and are not distributed
together - their tables may each contain data for different Unicode
versions, and correctly handling those mismatches requires a great
deal of care. The API that I am proposing for Swift would be part of
the language's standard library, and on Apple platforms this library
is distributed as part of the operating system. That means its version
(and the version of any Unicode tables it contains) depends on the
user's operating system version, while any data tables embedded in
your already-built application are static.

## Normalisation and versioning

While this topic of version mismatches is quite interesting, I have
some specific concerns about normalisation that I'd like to run by you
all:

### "Is `x` Normalized?"

It's helpful to start by considering what it means when we say a
string "is normalised". It's very simple; literally all it means is
that normalising the string returns the same string.

```
isNormalized(x):
  normalize(x) == x
```

For me, it was a bit of a revelation to grasp that in general, the
result of `isNormalized` is _only locally meaningful_. Asking the same
question, at another point in space or in time, may yield a different
result:

- Two machines communicating over a network may disagree about whether
x is normalised.
- The same machine may think x is normalised one day, then after an OS
update, suddenly think the same x is not normalised.

### "Are `x` and `y` Equivalent?"

Normalisation is how we define equivalence. Two strings, x and y, are
equivalent if normalising each of them produces the same result:

```
areEquivalent(x, y):
  normalize(x) == normalize(y)
```

And so following from the previous section, when we deal in pairs (or
larger collections) of strings, it follows that:

- Two machines communicating over a network may disagree about whether
x and y are equivalent or distinct.
- The same machine may think x and y are distinct one day, then after
an OS update, suddenly think that the same x and y are equivalent.

This has some interesting implications. For instance:

- If you encode a `Set<String>` in a JSON file, when you (or another
machine) decodes it later, the resulting Set's `count` may be less
than what it was when it was encoded.
- And if you associate values with those strings, such as in a
`Dictionary<String, SomeValue>`, some values may be discarded because
we would think they have duplicate keys.
- If you serialise a sorted list of strings, they may not be
considered sorted when you (or another machine) loads them. Sorting
involves normalisation, since equivalent strings sort identically.

### Demo: Normalization depending on system version

I created a simple demo to test my understanding so far.

```swift
let strings = [
  "e\u{1E08F}\u{031F}",
  "e\u{031F}\u{1E08F}",
]

print(strings)
print(Set(strings).count)
```

Each of these strings contains an "e" and the same two combining
marks. One of them, U+1E08F, is `COMBINING CYRILLIC SMALL LETTER
BYELORUSSIAN-UKRAINIAN I` which was added in Unicode 15.0, 09/2022. If
we were to run the above code snippet on iOS 14 (from 2020), we would
find that the Set has 2 strings. But if we took that very same binary
and ran it on iOS 18 (from 2024), it would only contain 1 string.
Here's how I explain this:

Everything (all of our definitions) are built upon the result of
`normalize(x)`. The way that we test if something is normalised (and
therefore whether two things are equivalent) is all based on what
`normalize(x)` does to it. Without explaining the entire process, one
of the things it does is sort the two combining characters.

```swift
let strings = [
  "e\u{1E08F}\u{031F}",
  "e\u{031F}\u{1E08F}",
]
```

The second string is in the correct canonical order - `\u{031F}`
before `\u{1E08F}`, and if the library in the OS supports at least
Unicode 15.0, we will know to rearrange them like that. That means:

```swift
// On recent systems:

isNormalized(strings[0]) // false
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // true
```

And that is why recent systems only have 1 string in their Set.
The older system, on the other hand, doesn't know that it's safe to
rearrange those characters (one of them is completely unknown to it!)
so `normalize(x)` is conservative and leaves the string as it is. That
means:

```swift
// On a system from 2020:

isNormalized(strings[0]) // true <-----
isNormalized(strings[1]) // true
areEquivalent(strings[0], strings[1]) // false <-----
```

This is quite an important result - it considers _both_ strings
normalised, and therefore not equivalent! (this is what I mean when I
said `isNormalized` is only locally meaningful)

### Example: UTS46

As an example of how this could affect somebody implementing a Unicode
standard, consider UTS46 (IDNA compatibility processing)
<https://www.unicode.org/reports/tr46/>. It requires both a mapping
table, and normalisation to NFC. From the standard:

> **Processing**
>
> 1. Map. For each code point in the *domain_name* string, look up the Status value in *Section 5, IDNA Mapping Table*, and take the following actions: [snip]
> 2. Normalize. Normalize the *domain_name* string to Unicode Normalization Form C.
> 3. Break. Break the string into labels at U+002E ( . ) FULL STOP.
> 4. Convert/Validate. For each label in the *domain_name* string: [snip]

If a developer were implementing this as a third-party library, they
would have to supply their own mapping table, but they would
presumably be interested in using the standard library's built-in
normaliser. That could lead to an issue where the mapping table is
built for Unicode 15.0, but the user is running on an older system
that only has a Unicode 14.0 normaliser.

Consider U+1E08F `COMBINING CYRILLIC SMALL LETTER
BYELORUSSIAN-UKRAINIAN I` from the example above: it is considered
valid by the UTS46 mapping table, so both of our previous example
strings would pass this step. They will then meet the (older)
normaliser, which as we have seen may not return the expected result.
Furthermore a later check in the validation process, "the label must
be in Unicode Normalization Form NFC", would confusingly return
_true_.

I worry that these kinds of bugs could be very difficult to spot, even
for experts. Standards documents like UTS46 generally assume that you
bring your own normaliser with you. Identifying this issue requires
users to have some serious expertise regarding how Unicode
normalisation works _and_ about the nuances of how fundamental
software like the language's standard library gets distributed on
their platform(s), and even if you know all of these things it would
still be very easy to miss.

## The Solution - Stabilised Strings

It turns out there is already a solution for this - Stabilised strings.

Basically it's just normalisation, but it can fail, and does fail if
the string contains any unassigned code-points. Together with the
normalisation stability policy, any strings which pass this check get
some very attractive guarantees:

> Once a string has been normalized by the NPSS for a particular normalization form, it will never change if renormalized for that same normalization form by an implementation that supports any version of Unicode, past or future.
>
> For example, if an implementation normalizes a string to NFC, following the constraints of NPSS (aborting with an error if it encounters any unassigned code point for the version of Unicode it supports), the resulting normalized string would be stable: it would remain completely unchanged if renormalized to NFC by any conformant Unicode normalization implementation supporting a prior or a future version of the standard.

Since normalisation defines equivalence, it also follows that two
distinct stable normalisations will _never_ be considered equivalent.
>From a developer's perspective, if I store N stable normalisations
into my `Set<String>` I know for a fact that _any_ client that decodes
that data will see a collection of N distinct strings. If I have a
sorted list of stably-normalised strings, everybody will agree that
there are N sorted strings, etc. And if our UTS46 library were to
check for a stable normalisation, it would catch problematic version
mismatches. This sounds great!

Given the concerns I've outlined above, and how subtly these issues
can emerge, I think this is a really important feature to expose
prominently in the API. I want to expose these as sibling operations,
each providing values with a different "scope", and which developers
reaching for normalisation should consider as a pair. The thing is,
that seems to be basically without precedent in other languages or
Unicode libraries:

- ICU's `unorm2`
<https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/unorm2_8h.html#aee41897bd6c4ff7dc7a8db6d37efdcbd>
includes `normalize`, `is_normalized`, and `compare`, but no
interfaces for stabilised strings. I wondered if there might be flags
that would make these functions return an error for unstable
normalisations/comparisons, but I don't think there are (are there?).

- ICU4X's `icu_normalizer`
<https://unicode-org.github.io/icu4x/rustdoc/icu_normalizer/struct.ComposingNormalizerBorrowed.html>
interfaces also include `normalize` and `is_normalized`, but no
interfaces for stabilised strings.

- Javascript has `String.prototype.normalize`
<https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize>,
but no interfaces for stabilised strings. Given the variety in runtime
environments for Javascript, surely they would see an even wider
spread in Unicode versions than Swift?

- Python's `unicodedata`
<https://docs.python.org/3/library/unicodedata.html> has `normalize`
and `is_normalized`, but no interfaces for stabilised strings.

- Java's `java.text.Normalizer`
<https://docs.oracle.com/javase/8/docs/api/java/text/Normalizer.html>
has `normalize` and `isNormalized`, but no interfaces for stabilised
strings.

I can barely find any references to them elsewhere, either. I found
them mentioned in the NIST Digital Identity Guidelines for normalising
Unicode passwords, but that's about it:

> If Unicode characters are accepted in memorized secrets, the verifier SHOULD apply the Normalization Process for Stabilized Strings using either the NFKC or NFKD normalization defined in Section 12.1 of Unicode Standard Annex 15. This process is applied before hashing the byte string representing the memorized secret.

So naturally I'm left wondering: why not? I think stabilised strings
are extremely important, but really nobody else seems to agree. Have I
misunderstood something about Unicode versioning and normalisation?
I've tried to document my reasoning and verify it with tests. I think
it checks out.

Or is this just an aspect of designing Unicode libraries that has been
left underexplored until now?

## The End

Thank you very much for reading and I look forward to your thoughts. I
also welcome any other feedback about the API I am proposing for
Swift. Many of us in the language community are excited about making
it a great language for text processing.

Thanks!

Karl