String Ranges in Unicode Sets

Richard Wordingham richard.wordingham at ntlworld.com
Tue Sep 8 16:41:08 CDT 2015


On Tue, 08 Sep 2015 08:19:03 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Mark Davis ��️ <mark at macchiato dot com> wrote:
> 
> >> TUS 8.0 Chapter 3 C6: "A process shall not assume that the
> >> interpretations of two canonical-equivalent character sequences are
> >> distinct."
> >
> > ​A compiler will take source code containing String x="á"; and
> > compile it to a certain binary. If that same source code is NFD'd,
> > the compiler will produce a different result.
> >
> > Do you really think that such compiler is not compliant to Unicode??
> > If so, then we should add some more clarifications around C6.

It's not me who put mens rea into the conformance requirements. If
a compiler does no more than check strings for validity, than it may
simply naively copy the sequence of scalar values without being
non-compliant, so long as the *intent* is not to preserve differences.

For example, if a process changes strings to preferred canonically
equivalent strings, but treats characters with ccc=9 as though they had
ccc=0, it probably is in breach.  On the other hand, if it treated
characters with ccc=9 as though they had ccc=300 (not a possible value
of ccc), it is compliant.

I think it is quite possible to have two identical pieces of code of
which one is compliant and the other is non-compliant.  It all depends
on the code's motive, which I can only think refers to the motives of
the intelligent entity that caused the code to be as it is.

> I agree. The word "interpretations" in C6 can't have been intended to
> include the interpretation of code points qua code points. That would
> make a great many internal processes impossible.

I would make it even more extreme by saying that the intent is that the
rule apply to encoded text, as opposed to mere strings of code units.

The problem is that some procedures allow a character to represent
itself even where that is not consistent because the data will be seen
as text.  For example, it is my opinion that combining marks and control
characters only belong in the representation of Unicode sets when they
part of a non-defective string element.

> I think of C6 as meaning that spell-checkers, for example, should not
> treat José (NFC, four code points) and José (NFD, five code points)
> as separate entries.

C6 does not prohibit spell-checkers from neglecting to normalise.  The
authors of the code of a spell-checker could take the view that the
database writers should have included all canonically equivalent
forms.  Practically, that allows a spell-checker to enforce
normalisation.

There's another, subtle feature for spell checkers.  By any reading, C6
does not require a spell-checker to realise that 'find' might be spelt
with U+FB01 LATIN SMALL LIGATURE FI.  Applying NFKC or NFKD to the Thai
word for 'water' would be wrong, for that converts <NA, MAI THO, SARA
AM> to <NA, MAI THO, NIKHAHIT, SARA AA>, which is wrong and looks quite
different.  Moreover, U+FB01 is not an acceptable alternative to <f, i>
in Turkish.

Richard.



More information about the Unicode mailing list