Terminology (was: Latin glottal stop in ID in NWT, Canada)
richard.wordingham at ntlworld.com
Sat Oct 24 06:33:32 CDT 2015
On Sat, 24 Oct 2015 08:40:32 +0300
Eli Zaretskii <eliz at gnu.org> wrote:
> > Date: Fri, 23 Oct 2015 23:16:32 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>
> > "C6: A process shall not assume that the interpretations of two
> > canonical-equivalent character sequences are distinct."
> > Firstly, I have grave difficulties assigning mental activities to
> > processes.
> > Secondly, it may be possible to interpet "A process shall not
> > assume X" as "A process shall function correctly regardless of
> > whether X holds."
> > However, let image(Y) be the bitmap depicting the string Y. Then
> > the following logic would be non-compliant:
> > if A and B are canonically equivalent and image(A) and image(B) are
> > different, then
> > write(A, " and ", B, "are canonically equivalent but have
> > different images ", image(A), " and ", image(B));
> > end if
> > The logic is non-compliant, for if it is invoked then the write
> > statement will only work correctly if image(A) and image(B) are
> > different, i.e. if A and B are interpreted differently. Apparently
> > it is permissible to render canonically equivalent sequences
> > differently, so image(A) and image(B) might be different even
> > though canonically equivalent.
> > I therefore conclude that C6 is in some language that I do not
> > adequately understand.
> AFAIU, Unicode is about processing text, and only mentions display
> rarely, where it's directly related to the processing part. So the
> above is about _processing_ canonically-equivalent sequences, not
> about their display. When looked at in this way, I see no
> difficulties in understanding the text.
Display is part of interpretation - indeed, it is currently the most
important part. At least, I would interpret displaying U+0041 with a
glyph like 'X' (an example in 'D2 Character identity') as violating:
"C4: A process shall interpret a coded character sequence according
to the character semantics established by this standard, if that process
does interpret that coded character sequence."
I chose the complicated function image() as being less controversial.
However, as you do not think it interprets a string, consider the full,
default toUppercase() instead. The problem lies with troublesome
U+0345 COMBINING GREEK YPOGEGRAMMENI (subscript iota) with ccc=240,
which uppercases to U+0399 GREEK CAPITAL LETTER IOTA with ccc=0. While
U+0345 commutes with Greek accents, U+0399 does not.
Thus U+1F80 GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
uppercases, in full mode, to <U+1F08 GREEK CAPITAL LETTER ALPHA WITH
PSILI, U+0399>, but the canonically equivalent lower case form <U+1FB3
GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0313 COMBINING COMMA
ABOVE> uppercases, in full mode, to the inequivalent upper case <U+0391
GREEK CAPITAL LETTER ALPHA, U+0399, U+0313>.
The brute force solution to this in practice minor issue is to convert
strings to NFD before upper-casing, but this would foul of one guess of
the meaning of C6, namely "An author shall not assume that the
interpretations of two canonical-equivalent character sequences are
distinct". Of course, if that is the meaning, determining whether
X = toNFC(toUppercase(toNFD(X)))
is compliant depends on answering the question, "Did the author think
he could get a different result if he omitted the conversion to NFD?".
I'm not sure whether the code would be compliant under my
interpretation if the author was unsure as to whether omitting the
conversion would get a different result.
> The Hebrew script is never an alphabet, AFAIU, it's likely an abugida
> when the vowel marks are used.
No, the definition of an abugida is that there is a default vowel which
is indicated by the absence of any vowel mark. In fully pointed
Hebrew, it's only final, silent and quiescent consonants that lack vowel
marks. I don't like the definitions, because they are extremely
vulnerable to small changes in use. Indeed, having taken the name from
the consonant system underlying the Ethiopic syllabary, the inventors
of the term subsequently concluded that the eponymous abugida was not
actually an abugida!
> The so-called "full spelling", where
> some vowels are indicated by consonants, does not replace all the
> vowels with consonants, so it isn't, strictly speaking, an alphabet in
> the above sense.
Nor would I claim it as such.
More information about the Unicode