From unicode at unicode.org Thu Nov 1 02:20:51 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 07:20:51 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> Message-ID: <20181101072051.38cc6a8d@JRWUBU2> On Wed, 31 Oct 2018 14:57:37 -0700 Asmus Freytag via Unicode wrote: > On 10/31/2018 10:18 AM, Marcel Schneider via Unicode wrote: >> Sad that Arabic ? and ? are still missing. > How about all the other sets of native digits? They might not be in natural use this way! Also, there is the possibility of non-spacing superscript digits, as in Devanagari, though they are chiefly not used for counting. But why limit consideration to digits? But what about oxidation states, which use spacing superscript Roman numerals - I couldn't find superscript capital 'V'. Richard. From unicode at unicode.org Thu Nov 1 02:33:28 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 01 Nov 2018 08:33:28 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: (Ken Whistler via Unicode's message of "Wed, 31 Oct 2018 12:14:36 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> Message-ID: <86lg6djlpz.fsf_-_@mimuw.edu.pl> On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote: > On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote: >> >> but we don't have an agreement that reproducing all variations in >> manuscripts is in scope. > > In fact, I would say that in the UTC, at least, we have an agreement > that that clearly is out of scope! > > Trying to represent all aspects of text in manuscripts, including > handwriting conventions, as plain text is hopeless. There is no > principled line to draw there before you get into arbitrary > calligraphic conventions. Your statements are perfect examples of "attacking a straw man": Straw Man (Fallacy Of Extension): attacking an exaggerated or caricatured version of your opponent's position. http://www.don-lindsay-archive.org/skeptic/arguments.html https://en.wikipedia.org/wiki/Straw_man https://en.wikipedia.org/wiki/The_Art_of_Being_Right Perhaps you are joking? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Nov 1 02:46:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 07:46:40 +0000 Subject: use vs mention (was: second attempt) In-Reply-To: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> Message-ID: <20181101074640.2866a022@JRWUBU2> On Wed, 31 Oct 2018 23:35:06 +0100 Piotr Karocki via Unicode wrote: > These are only examples of changes in meaning with ^{or _{,
> not all of these examples can really exist - but, then, another
> question: can we know what author means? And as carbon and iodine
> cannot exist, then of course CI should be interpreted as carbon on
> first oxidation?

Are you sure about the non-existence? Some pretty weird
chemical species exist in interstellar space.

Richard.

From unicode at unicode.org Thu Nov 1 02:52:09 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 1 Nov 2018 07:52:09 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>

<23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
<84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com>
Message-ID: <20181101075209.5ffbba7d@JRWUBU2>

On Wed, 31 Oct 2018 11:35:19 -0700
Asmus Freytag via Unicode wrote:

> On the other hand, I'm a firm believer in applying certain styling
> attributes to things like e-mail or discussion papers. Well-placed
> emphasis can make such texts more readable (without requiring that
> they pay attention to all other facets of "fine typography".)

Unfortunately, your emails are extremely hard to read in plain text.
It is even difficult to tell who wrote what.

Richard.

From unicode at unicode.org Thu Nov 1 08:43:08 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 1 Nov 2018 06:43:08 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181101075209.5ffbba7d@JRWUBU2>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>

<23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
<84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com>
<20181101075209.5ffbba7d@JRWUBU2>
Message-ID: <97890362-7550-2e43-2266-a41853b89ba7@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 10:43:21 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 1 Nov 2018 08:43:21 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86lg6djlpz.fsf_-_@mimuw.edu.pl>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>
Message-ID:

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 12:23:05 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Thu, 01 Nov 2018 18:23:05 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: (Asmus
Freytag via Unicode's message of "Thu, 1 Nov 2018 08:43:21 -0700")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>

<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

Message-ID: <86d0roiufa.fsf@mimuw.edu.pl>

On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:33 AM, Janusz S. Bie? via Unicode wrote:
>
> On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote:
>
> On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
>
>
> but we don't have an agreement that reproducing all variations in
> manuscripts is in scope.
>
>
> In fact, I would say that in the UTC, at least, we have an agreement
> that that clearly is out of scope!
>
> Trying to represent all aspects of text in manuscripts, including
> handwriting conventions, as plain text is hopeless. There is no
> principled line to draw there before you get into arbitrary
> calligraphic conventions.
>
>
> Your statements are perfect examples of "attacking a straw man":
>
>
> Perhaps you are joking?
>
> Not sure which of us you were suggesting as the jokester here.
>
> I don't think it's a joke to recognize that there is a continuum here
> and that there is no line that can be drawn which is based on
> straightforward principles. This is a pattern that keeps surfacing the
> deeper you look at character coding questions.

Looks like you completely missed my point. Nobody ever claimed that
reproducing all variations in manuscripts is in scope of Unicode, so
whom do you want to convince that it is not?

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Thu Nov 1 12:39:16 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 18:39:16 +0100
Subject: UCA unnecessary collation weight 0000
Message-ID:

I just remarked that there's absolutely NO utility of the collation weight
0000 anywhere in the algorithm.

For example in UTR #10, section 3.3.1 gives a collection element :
[.0000.0021.0002]
for COMBINING GRAVE ACCENT. However it can also be simply:
[.0021.0002]
for a simple reason: the secondary or tertiary weights are necessarily
LOWER then any primary weight (for conformance reason):
any tertiary weight < any secondary weight < any primary weight
(the set of all weights for all levels is fully partitioned into disjoint
intervals in the same order, each interval containing all its weights, so
weights are sorted by decreasing level, then increasing weight in all cases)

This also means that we never need to handle 0000 weights when creating
sort keys from multiple collection elements, as we can easily detect that
[.0021.0002] given above starts by a secondary weight 0021 and is not a
primary weight.

As well we don't need to use any level separator 0000 in the sort key.

This allows more interesting optimizations, and reduction of length for
sort keys.
What this means is that we can safely implement UCA using basic substitions
(e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to
map source (binary) strings or regexps,into target (binary) strings:

For a level-3 collation, you just then need only 3 calls to "string:gsub()"
to compute any collation:

- the first ":gsub(mapNormalize)" can decompose a source text into
collation elements and can perform reordering to enforce a normalized order
(possibly tuned for the tailored locale) using basic regexps.

- the second ":gsub(mapSecondary)" will substitute any collection elements
by their "intermediary" collation elements+tertiary weight.

- the third ":gsub(mapSecondary)" will substitute any "intermediary"
collation element by their primary weight + secondary weight

The "intermediary" collection elements are just like source text, except
that higher level differences are eliminated, i.e.all source collation
element string are replaced by the collection element string that have the
smallest collation element weights. They must be just encoded so that they
are HIGHER than any higher level weights.

How to do that:
- reserve the weight range between .0000 (yes! not just .0001) and .001E
for the last (tertiary) weight, make sure that all other intermediary
collation elements will use only code units higher than .0020 (this means
that they can remain encoded in their existing UTF form!)
- reserve the weight .001F for the case where you don't want to use
secondary differences (like letter case) and them to tertiary differences.

This will be used in the second mapping to decompose source collection
elements into "intermediary collation elements" + tertiary weight. you may
then decide to leave tertiary weights in the substitute string, or because
the "gsub()" finds match from left to right, to accumulate the tertiary
weights into a separate buffer, so that the subtitution itself will still
return a valid UTF string, containing only "intermediary collation
elements" (with all tertiary differences erased).

You can repeat the process with the next gsub() to return the primary
collation elements" (still in UTF form), and separately the secondary
weights (also accumulable in a separate buffer).

Now there remains only 3 strings:
- one contains only the primary collection elements (still in UTF-form, but
using code units always higher than or equal to 0020)
- another one contains only secondary weights (between MINSECONDARYWEIGHT
and 001F)
- another one contains only tertiary weights. (between 0000 and
MINSECONDARYWEIGHT-1)

For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
* primary weights are encoded with one or more code units in [0020..]
(multiple code units are possible if you reserve some of these code units
to be prefixes or longer sequences)
* secondary weights are encoded with one or more code units in [0010..001E]
(same remark about multiple code units if you need them)
* tertiary weights are encoded with one or more code units
in [0010..001F] (same remark about multiple code units if you need them)

The last gsub() will only reorder the primary collection elements to remap
them in a suitable binary order (it will be a simple bijective permutation,
except that the target does not have to use multiple code units, but a
single one, when there are contractions). It's always possible to make this
permutation generate integers higher than 0020. The resulting weights can
remain encodable with UTF-8 as if it was source text.

And to return the sort key, all you need is to concatenate
* the string containing all primary weights encoded with code units in
[0020..], then
* the string containing secondary weights encoded with code units in
[0010..001E], then
* the string containing tertiary weights encoded with code units in
[0000..001F].
* you don't need to insert ANY [0000] as a level separator in the final
sort key, because each concatenated part in the final sort key respect the
wellformedness constraint WF2 of the UCA algorithm.

You may choose to not use tertiary weights encoded with [0000] code units,
if you want the final string containing the sort key to be null-terminated.

In summary:
* there's no longer any special role given in UCA for [0000]. More
compaction possible for storing the mapping of source collation element
strings (in their original UTF encoding) to strings of collation weights
(themselves still encodage with an UTF!).
* Any tailored collation (except those requiring preprocessing that may
apply specific reorderings, possibly made by using subtitution with one or
more regexps to apply, repeated in a defined order) is just specified by
one map per collation level, containing source UTF strings (or regexps) to
replace by their mapped string of collation weights.
* You are free to choose the UTF to use for the source string or for the
collation weight (these UTF may be different or may be both UTF-8. If you
use a conforming UTF, the only code units you cannot use are those in
[D800..DFFF], reserved for surrogates.
* Normal string library packages can be used to implement UCA, even those
that can only work with texts encoded with a valid UTF.
* Given that the resulting sort keys are valid UTF, they are displayable:
in many circonstances, the initial part of the string (containing primary
weights only) will display the normal UTF encoding of readable text; if
there are additional secondary or tertiary weights after it, because they
are represented using C0 controls, you may still display them using a
notation like \xNN (you only need to escape '\' if it is present as a
litteral in the readable part of the sort key containing primary weights).

Note: Isolated surrogates found in a non-conforming source string need to
be preprocessed if you want to accept them in a collator:
- You can do that by preprocessing [0000] or [D800..DFFF], into [0000]
followed by only one codeunits in [0020..], so they form a single collation
element [0000][0020..]; use [0000][0020] as the collation element
representing the source [0000] and just insert a single [0000] before any
isolated surrogate you'll replace by a code unit in [0800..0FFF]. The
result will be a conforming UTF string on which your collator will return
valid UTF strings of weights.
- If you don't want to have any [0000] within sort keys, you can also
preprocess the source string by reencoding [0000] into [0001][0020], and
[0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into
[0001][0800..0FFF]. Here also the result will be a conforming UTF string on
which your collator will return valid UTF strings of weights.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:08:05 2018
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Thu, 1 Nov 2018 13:08:05 -0700
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:
Message-ID:

There are lots of ways to implement the UCA.

When you want fast string comparison, the zero weights are useful for
processing -- and you don't actually assemble a sort key.

People who want sort keys usually want them to be short, so you spend time
on compression. You probably also build sort keys as byte vectors not
uint16 vectors (because byte vectors fit into more APIs and tend to be
shorter), like ICU does using the CLDR collation data file. The CLDR root
collation data file remunges all weights into fractional byte sequences,
and leaves gaps for tailoring.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:10:16 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:10:16 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:
Message-ID:

For example, Figure 3 in the UTR#10 contains:

Figure 3. Comparison of Sort Keys

StringSort Key
1 cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0002* 0002 0002
2 Cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0008* 0002 0002
3 c?b *0706* 06D9 06EE *0000* 0020 0020 *0021* *0000* 0002 0002 0002 0002
4 dab *0712* 06D9 06EE *0000* 0020 0020 0020 *0000* 0002 0002 0002

The 0000 weights are never needed, even if any of the source strings
("cab", "Cab", "c?b", "dab") is followed by ANY other string, or if any
other string (higher than "b") replaces their final "b".
What is really important is to understand where the input text (after
initial transforms like reodering and expansion) is broken at specific
boundaries between collatable elements.
But the boundaries of weights indicated each part of the sort key can
always be infered for example between 06EE and 0020, or between 0020 and
0002.
So this can obviously be changed to just:

Figure 3. Comparison of Sort Keys

StringSort Key
1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002
2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002
3 c?b *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002
4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002
As well (emphasized by black blackground above),
* when the secondary weights in the sort key are terminated by any sequence
of 0020 (the minimal secondary weight), you can suppress them from the
collation key.
* when the tertiary weights are in the sort key are terminated by any
sequence of 0002 (the minimal tertiary weight), you can suppress them from
the collation key.
This gives:

Figure 3. Comparison of Sort Keys

StringSort Key
1 cab *0706* 06D9 06EE
2 Cab *0706* 06D9 06EE *0008*
3 c?b *0706* 06D9 06EE 0020 0020 *0021*
4 dab *0712* 06D9 06EE
See the reduction !

Le jeu. 1 nov. 2018 ? 18:39, Philippe Verdy a ?crit :

> I just remarked that there's absolutely NO utility of the collation weight
> 0000 anywhere in the algorithm.
>
> For example in UTR #10, section 3.3.1 gives a collection element :
> [.0000.0021.0002]
> for COMBINING GRAVE ACCENT. However it can also be simply:
> [.0021.0002]
> for a simple reason: the secondary or tertiary weights are necessarily
> LOWER then any primary weight (for conformance reason):
> any tertiary weight < any secondary weight < any primary weight
> (the set of all weights for all levels is fully partitioned into disjoint
> intervals in the same order, each interval containing all its weights, so
> weights are sorted by decreasing level, then increasing weight in all cases)
>
> This also means that we never need to handle 0000 weights when creating
> sort keys from multiple collection elements, as we can easily detect that
> [.0021.0002] given above starts by a secondary weight 0021 and is not a
> primary weight.
>
> As well we don't need to use any level separator 0000 in the sort key.
>
> This allows more interesting optimizations, and reduction of length for
> sort keys.
> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua which uses
> a "map" to map source (binary) strings or regexps,into target (binary)
> strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized order
> (possibly tuned for the tailored locale) using basic regexps.
>
> - the second ":gsub(mapSecondary)" will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight
>
> The "intermediary" collection elements are just like source text, except
> that higher level differences are eliminated, i.e.all source collation
> element string are replaced by the collection element string that have the
> smallest collation element weights. They must be just encoded so that they
> are HIGHER than any higher level weights.
>
> How to do that:
> - reserve the weight range between .0000 (yes! not just .0001) and .001E
> for the last (tertiary) weight, make sure that all other intermediary
> collation elements will use only code units higher than .0020 (this means
> that they can remain encoded in their existing UTF form!)
> - reserve the weight .001F for the case where you don't want to use
> secondary differences (like letter case) and them to tertiary differences.
>
> This will be used in the second mapping to decompose source collection
> elements into "intermediary collation elements" + tertiary weight. you may
> then decide to leave tertiary weights in the substitute string, or because
> the "gsub()" finds match from left to right, to accumulate the tertiary
> weights into a separate buffer, so that the subtitution itself will still
> return a valid UTF string, containing only "intermediary collation
> elements" (with all tertiary differences erased).
>
> You can repeat the process with the next gsub() to return the primary
> collation elements" (still in UTF form), and separately the secondary
> weights (also accumulable in a separate buffer).
>
> Now there remains only 3 strings:
> - one contains only the primary collection elements (still in UTF-form,
> but using code units always higher than or equal to 0020)
> - another one contains only secondary weights (between MINSECONDARYWEIGHT
> and 001F)
> - another one contains only tertiary weights. (between 0000 and
> MINSECONDARYWEIGHT-1)
>
> For the rest I will assume that MINSECONDARYWEIGHT is 0010, so
> * primary weights are encoded with one or more code units in [0020..]
> (multiple code units are possible if you reserve some of these code units
> to be prefixes or longer sequences)
> * secondary weights are encoded with one or more code units in
> [0010..001E] (same remark about multiple code units if you need them)
> * tertiary weights are encoded with one or more code units
> in [0010..001F] (same remark about multiple code units if you need them)
>
> The last gsub() will only reorder the primary collection elements to remap
> them in a suitable binary order (it will be a simple bijective permutation,
> except that the target does not have to use multiple code units, but a
> single one, when there are contractions). It's always possible to make this
> permutation generate integers higher than 0020. The resulting weights can
> remain encodable with UTF-8 as if it was source text.
>
> And to return the sort key, all you need is to concatenate
> * the string containing all primary weights encoded with code units in
> [0020..], then
> * the string containing secondary weights encoded with code units in
> [0010..001E], then
> * the string containing tertiary weights encoded with code units in
> [0000..001F].
> * you don't need to insert ANY [0000] as a level separator in the final
> sort key, because each concatenated part in the final sort key respect the
> wellformedness constraint WF2 of the UCA algorithm.
>
> You may choose to not use tertiary weights encoded with [0000] code units,
> if you want the final string containing the sort key to be null-terminated.
>
> In summary:
> * there's no longer any special role given in UCA for [0000]. More
> compaction possible for storing the mapping of source collation element
> strings (in their original UTF encoding) to strings of collation weights
> (themselves still encodage with an UTF!).
> * Any tailored collation (except those requiring preprocessing that may
> apply specific reorderings, possibly made by using subtitution with one or
> more regexps to apply, repeated in a defined order) is just specified by
> one map per collation level, containing source UTF strings (or regexps) to
> replace by their mapped string of collation weights.
> * You are free to choose the UTF to use for the source string or for the
> collation weight (these UTF may be different or may be both UTF-8. If you
> use a conforming UTF, the only code units you cannot use are those in
> [D800..DFFF], reserved for surrogates.
> * Normal string library packages can be used to implement UCA, even those
> that can only work with texts encoded with a valid UTF.
> * Given that the resulting sort keys are valid UTF, they are displayable:
> in many circonstances, the initial part of the string (containing primary
> weights only) will display the normal UTF encoding of readable text; if
> there are additional secondary or tertiary weights after it, because they
> are represented using C0 controls, you may still display them using a
> notation like \xNN (you only need to escape '\' if it is present as a
> litteral in the readable part of the sort key containing primary weights).
>
> Note: Isolated surrogates found in a non-conforming source string need to
> be preprocessed if you want to accept them in a collator:
> - You can do that by preprocessing [0000] or [D800..DFFF], into [0000]
> followed by only one codeunits in [0020..], so they form a single collation
> element [0000][0020..]; use [0000][0020] as the collation element
> representing the source [0000] and just insert a single [0000] before any
> isolated surrogate you'll replace by a code unit in [0800..0FFF]. The
> result will be a conforming UTF string on which your collator will return
> valid UTF strings of weights.
> - If you don't want to have any [0000] within sort keys, you can also
> preprocess the source string by reencoding [0000] into [0001][0020], and
> [0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into
> [0001][0800..0FFF]. Here also the result will be a conforming UTF string on
> which your collator will return valid UTF strings of weights.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:13:46 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:13:46 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

I'm not speaking just about how collation keys will finally be stored (as
uint16 or bytes, or sequences of bits with variable length); I'm just
refering to the sequence of weights you generate.
You absolutely NEVER need ANYWHERE in the UCA algorithm any 0000 weight,
not even during processing, or un the DUCET table.

Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a ?crit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:31:15 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:31:15 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
?crit :

> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>
And no, I absolutely no case where any 0000 weight is useful during
processing, it does not distinguish any case, even for "fast" string
comparison.

Even if you don't build any sort key, may be you'll want to return 0000 it
you query the weight for a specific collatable element, but this would be
the same as querying if the collatable element is ignorable or not for a
given specific level; this query just returns a false or true boolean, like
this method of a Collator object:

bool isIgnorable(int level, string collatable element)

and you can also make this reliable for any collector:

int getLevel(int weight);
int getMinWeight(int level);
int getWeightAt(string element, int level, int position);

so you can use these two last functions to write the first one:

bool isIgnorable(int level, string element) {
return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
}

That's enough you can write the fast comparison...

What I said is not a complicate "compression" this is done on the fly,
without any complex transform. All that counts is that any primary weight
value is higher than any secondary weight, and any secondary weight is
higher than a tertiary weight.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:34:05 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 1 Nov 2018 13:34:05 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86d0roiufa.fsf@mimuw.edu.pl>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl>
Message-ID: <923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:35:29 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:35:29 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

Le jeu. 1 nov. 2018 ? 21:31, Philippe Verdy a ?crit :

> so you can use these two last functions to write the first one:
>
> bool isIgnorable(int level, string element) {
> return getLevel(getWeightAt(element, 0)) > getMinWeight(level);
> }
>
correction:
return getWeightAt(element, 0) > getMinWeight(level);
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:42:02 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 1 Nov 2018 21:42:02 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209>

Message-ID: <1bd96f61-d33a-258f-cd8e-9ab29db2bd92@orange.fr>

On 01/11/2018 01:21, Asmus Freytag via Unicode wrote:
> On 10/31/2018 3:37 PM, Marcel Schneider via Unicode wrote:
>> On 31/10/2018 19:42, Asmus Freytag via Unicode wrote:
[?]
>>> It is a fallacy that all text output on a computer should match the convention
>>> of "fine typography".
>>>
>>> Much that is written on computers represents an (unedited) first draft. Giving
>>> such texts the appearance of texts, which in the day of hot metal typography,
>>> was reserved for texts that were fully edited and in many cases intended for
>>> posterity is doing a disservice to the reader.
>>>
>> The disconnect is in many people believing the user should be disabled to write
>> [prevented from writing]

Thank you for correcting.

>> his or her language without disfiguring it by lack of decent keyboarding, and
>> that such input should be considered standard for user input. Making such text
>> usable for publishing needs extra work, that today many users cannot afford,
>> while the mass of publishing has increased exponentially over the past decades.
>> The result is garbage, following the rule of ?garbage in, garbage out.?
>
> No argument that there are some things that users cannot key in easily and that the common
> fallbacks from the days of typewritten drafts are not really appropriate in many texts that
> otherwise fall short of being "fine typography".

The goal I wanted to reach by discussing and invalidating the biased and misused concept
of ?fine typography? is that this thread could get rid of it, but I?m definitely unfortunate.
It?s hard for you to understand that relegating abbreviation indicators into the realm of
?fine typography? recalls me what I got to hear (undisclosed for privacy) when asking that
the French standard keyboard layouts (plural) support punctuation spacing with
NARROW NO-BREAK SPACE, and that is closely related to the issue about social media that
you pointed below.

Don?t worry about users not being able to ?key in easily? what is needed for the digital
representation of their language, as long as:

1. Unicode has encoded what is needed;

2. Unicode does not prohibit the use of the needed characters.

The rest is up to keylayout designers. Keying in anything else is not an issue so far.

>
>> The real
>> disservice to the reader is not to enable the inputting user to write his or her
>> language correctly. A draft whose backbone is a string usable as-is for publishing
>> is not a disservice, but a service to the reader, paying the reader due respect.
>> Such a draft is also a service to the user, enabling him or her to streamline the
>> workflow. Such streamlining brings monetary and reputational benefit to the user.
>
> I see a huge disconnect between "writing correctly" and "usable as-is for publishing". These
> two things are not at all the same.
>
> Publishing involves making many choices that simply aren't necessary for more "rough & ready"
> types of texts. Not every twitter or e-mail message needs to be "usable as-is for publishing", but
> should allow "correctly written" text as far as possible.

Not every message, especially not those whose readers expect a quick response.
The reverse is true with new messages (tweets, thread lauchers, requests, invitations).
As already discussed, there are several levels of correctness. We?re talking only about
the accurate digital representation of human languages, which includes correct punctuation.
E.g. in languages using letter apostrophe, hashtags made of a word including an apostrophe
are broken when ASCII or punctuation apostrophe (close quote) is used, as we?ve been told.

Supposedly part of this discussion would be streamlined if one could experience how easy
it can be to type in one?s language?s accurate digital representation. But it?s better
to be told what goes on, and what ?strawmen? we?re confused with, since, again,
informed discussion brings advancement.

>
> When "desktop publishing" as it was called then, became available, too many people started to
> obsess with form over content. You would get these beautifully laid out documents, the contents
> of which barely warranted calling them a first draft.

Typing in one?s language?s accurate digital representation is not being obsessed with form
over content, provided that appropriate keyboarding is available. E.g. the punctuation
apostrophe is on level 1 where the ASCII apostrophe is when digits are locked on level 1
on the French keyboard I?ve in use; else, digits are on level 3 where is also superscript e
for ready input of most of the ordinals (except 1??/1??, 2?? for ranges, and plural with ?):
2??3??4??5??6??7??8??9??10??11??12?. Hopefully that demo makes clear what is intended.
Users not needing accurate repsesentation in a given string are free to type in otherwise.

The goal of this discussion is that Unicode allow accurate representation, not impose it.
Actually Unicode is still imposing inaccurate representation to some languages due to TUS
prohibiting the use of precomposed superscript letters in text representing human languages
with standard orthography, which is what ?ordinary text? seems to boil down to.

>
>> That disconnect seems to originate from the time where the computer became a tool
>> empowering the user to write in all of the world?s languages thanks to Unicode.
>
> No, this has nothing to do with Unicode / multi-script support.

Why not? Accurate interoperable digital representation of French was totally impossible
before version 3.0 of Unicode (bringing the *new* NARROW NO-BREAK SPACE), while before,
the Standard was prevented to have such a character by misdefining the line-break
property of U+2008 PUCTUATION SPACE, that has the right width and serves no purpose
only because unlike related U+2007 FIGURE SPACE (but not U+2012 FIGURE DASH, mistakenly
added to the list in my previous e-mail), it is not non-breakable. Useful punctuation
spacing was dismissed as being too ?fine? a typography for being universally available
and interoperable, while the opposite is true: It?s the only way of writing French
without being at risk of conveying the impression of poor craftmanship (see below).

>> The concept of ?fine typography? was then used to draw a borderline between what
>> the user is supposed to input, and what he or she needs to get for publication.
>
> This same dividing line applies in English (or any of the other individual languages).

Yes of course. The four lines above only intended to set the scene. AFAICS, the
disconnect of an encoding standard designed for accuracy and interoperability, the use and
the usefulness of which is intentionally throttled down in order to get non-accurate and
non-interoperable digital representations of some languages, is unprecedented, and it
originates from the time the Unicode Standard was set up. Spacing has been fixed,
ordinal indicators are being fixed, and now, other abbreviation indicators still need
fixing.

>> In the same move, that concept was extended in a way that it should include the
>> quality of the string, additionally to what _fine typography_ really is: fine
>> tuning of the page layout, such as vertical justification, slight variations in
>> the width of non-breakable spaces, and of course, discretionary ligatures.
>
> Certain elements of styling are also part of fine typography. In some cases, readying a "string"
> for publication also means applying spelling conventions or grammatical conventions (for those
> cases where there are ambiguities in the common language, or applying preferred word choices
> or ways of formulating things that may be particular to individual publishers or types of publications.

None of these is a reason not to be able to input abbreviation indicators in plain text.
But for the rest, I cannot see that applying style guides? orthographies is part of fine
typography, just of publishing. These parameters are at the discretion of the management.
That does not preclude the input of superscript on a keyboard, and as a side note, the
intake of publishers is mainly at least rich text or another markup convention, most
currently TeX (for scientific publications). But Unicode promises accurate interoperable
representation of all of the world?s languages in plain text. Hence, authors are advised
that a good way to make TeX more human-readable is to use more Unicode.

>
> Using HYPHEN-MINUS instead of "EN DASH" or "HYPHEN" is perfectly OK for early stages of
> drafting a text. Attempting to follow those and similar conventions during that phase forces
> the author to pay attention to the wrong thing - his or her focus should be on the ideas and
> the content, not the form of the document.

There is some good point in that. But a close look at just these two conventions leads
to significantly lessen the advantage of not using accurate punctuation in one?s drafts.

1. HYPHEN-MINUS vs EN DASH or, should be added, EM DASH: That is not possible in locales
using no spacing around EM DASH. Right, SPACE, HYPHEN-MINUS, SPACE is easily replaced with
SPACE, EN DASH, SPACE or any other dashing convention at a later stage. But not using
a correct dash out of U+2013, U+2014 and U+2015 is not nearly useful if all these are
on level 2 of three digit keys (1, 2, 3 or another range). Additionally that brings the
advantage of being able to differenciate while thinking at the content. Nobody else can
do that job later with a comparable efficiency.

2. HYPHEN-MINUS vs HYPHEN: That has much of a non-starter. As already discussed in
detail on this List, HYPHEN is a useless duplicate encoding of HYPHEN-MINUS, which
in almost all fonts has the glyph of HYPHEN and is used for the system hyphen from
the automated hyphenation when a .docx is exported as a .pdf file. Using fonts
designed otherwise requires either a special keyboard layout or weird replacements
because the HYPHEN-MINUS in URLs and e-mail addresses must not be replaced. So using
HYPHEN-MINUS everywhere a HYPHEN is intended is OK even in publishing. Only some
fonts may need fixing (I don?t know more than a single one).

>
>> Producing a plain text string usable for publishing was then put out of reach
>> of most common mortals, by using the lever of deficient keyboarding, but also
>> supposedly by an ?encoding error? (scare quotes) in the line break property of
>> U+2008 PUNCTUATION SPACE, that should be non-breakable like its siblings
>> U+2007 FIGURE SPACE (still?as per UAX #14?recommended for use in numbers) and
>> U+2012 FIGURE DASH to gain the narrow non-breaking space needed to space the
[corrected, see above]
>> triads in numbers using space as a group separator, and to space big punctuation
>> in a Latin script using locale, where JTC1/SC2/WG2 had some meetings for the UCS:
>> French.
>
> Those details should be handled in a post-processing phase for documents that are intended
> for publication.

Not at all, as already stated above. Making a mess of any text file that is not print-ready,
is an insult to the reader. And any *French* text not spacing punctuations with NNBSP is at
risk of ending up as a mess.

> One of the big problem in current architectures is that things like "autocorrect"
> which attempt to overcome the limitations of the current keyboards,

That is another disconnect, already pointed out repeatedly. Current keyboards have no
intrinsic ?limitations?, and referring to outdated keyboard layouts as a fatality is
in disconnect with the reality, since all OS vendors offer facilities to complete,
enhance or change the keyboard layout.

> are applied at input time
> only; and authors need to constantly interact with these helpers to make sure they don't mis-
> fire.

Correct; that is also where originated what was called ?the apostrophe catastrophe.?

> Much text that is laboriously prepared this way, will not survive future revisions during
> the editing process needed to get the *content* to publication quality.

That only applies to files fed in an editing process. Many people are directly publishing
out-of-the-keyboard, and that is where complete and readily available Unicode support matters
most. Anything else can be made up by the rendering engine, as you already noted. The force
of Unicode being interoperability and data exchange, I can see no technical reason not to
type in Unicode on one?s keyboard, including abbrevation indicators of any kind.

>
> All because users have no convenient tool to "touch-up" these dashes, quotes, and spaces
> in a later phase; at the same time they apply copy-editing, for example.

Because once you are in a WYSIWYG environment, you cannot simply transfer the text to
your text editor to apply regexes, and people need to write macros in VBA to get things
done I figure out. Autocorrect is consistent with WYSIWYG. People not interested in seeing
what they?re typing may wish to use LaTeX, where they can see it in another window.
What I cannot see is why these important issues should preclude users from typing
preformatted superscripts on their keyboard, be it via a ?superscript? dead key.
Such a dead key is already standardized, but again, Karl Pentzlin?s proposal to
encode the missing characters has been rejected, while in this thread we could
see there is an interest for what could be called a UnicodeChem notation, a
nearly plain text encoding of chemical elements, compounds and processes.

>
>> For everybody having beneath his or her hands a keyboard whose layout driver is
>> programmed in a fully usable way, the disconnect implodes. At encoding and input
>> levels (the only ones that are really on-topic in this thread) the sorcery called
>> fine typography sums then up to nothing else than having the keyboard inserting
>> fully diacriticized letters, right punctuation, accurate space characters, and
>> superscript letters as ordinal indicators and abbreviation endings, depending
>> on the requirements.
>
> In the days of typewritten manuscripts you had to follow certain conventions that allowed the
> typesetter to select the intended symbols and styled letters. I'm not arguing that we should
> return to where such fallbacks are used. And certainly not arguing that we should be using
> ASCII fallbacks for letters with diacritics, such as "oe" for "?".
>
> But many issues around selecting the precise type of space or dash are not so much issues
> of correct content but precisely issues of typography.

That is right so far as the French national printing office recommends to use NBSP with the
colon, while the industry widely uses NNBSP for colon, too, Philippe Verdy reported on this
List. It also states that the same should be done for angle quotation marks, but does not so.
Here is indeed matter for fine-tuning, but as stated above and below, NBSP does not work in
every environment, even not in most of the most common ones where users are typing text. I
still call a string publication ready where big punctuations are spaced with NNBSP uniformely.

>
> Some occupy an intermediate level, where it would be quite appropriate to apply them to
> many automatically generated texts. (I am aware of your efforts in CLDR to that effect).

Thank you for the occasion to invite everyone to join in and contribute to the oncoming
surveys of Unicode?s Common Locale Data Repository. Much needs to be done in French and
in many locales already present, even if the stress should naturally be on adding *new*
locales still not in CLDR.

> But I still believe that they have no place in content focused writing.

That is only the effect of an error of perception, that is widely fueled by the deficient
keyboard design not supporting automated punctuation spacing for French. See ticket in Trac.

>
>> Now was I talking about ?all text output on a computer?? No, I wasn?t.
>>
>> The computer is able to accept input of publishing-ready strings, since we have
>> Unicode. Precluding the user from using the needed characters by setting up
>> caveats and prohibitions in the Unicode Standard seems to me nothing else than
>> an outdated operating mode. U+202F NARROW NO-BREAK SPACE, encoded in 1999 for
>> Mongolian [1][2], has been readily ripped off by the French graphic industry.
>> In 2014, TUS started mentioning its use in French [3]; in 2018, it put it on
>> top [4].
>> That seems to me a striking example of how things encoded for other purposes
>> are reused (or following a certain usage, ?abused?, ?hacked?, ?hijacked?) in
>> locales like French. If it wasn?t an insult to minority languages, that
>> language could be called, too, ?digitally disfavored? in a certain sense.
>>
>>> On the other hand, I'm a firm believer in applying certain styling attributes
>>> to things like e-mail or discussion papers. Well-placed emphasis can make such
>>> texts more readable (without requiring that they pay attention to all other
>>> facets of "fine typography".)
>> The parenthesized sidenote (that is probably the intended main content?) makes
>> this paragraph wrong. I?d buy it if either the parenthesis is removed or if it
>> comes after the following.
>
> Now you are copy-editing my e-mails. :)

:)

>
> I don't read or write French on the level that I can evaluate your contention that the language
> is digitally disadvantaged.

It was heavily disadvantaged until U+202F?NARROW NO-BREAK SPACE was encoded and widely
implemented. Implementation would have been speedy and straightforward if only it had
been present from the beginning on, as U+2008 PUNCTUATION SPACE. Even the character name
would have matched the purpose. Perhaps the Frenchmen implied were hindered in fixing
that bug while being aware of its gravity.

Then it was still disadvantaged by lack of ordinal indicators, but that is now fixed
thanks to CLDR Technical Committee, past summer. Many thanks.

Ultimately it is part of the languages using superscript as the abbreviation indicator,
and not allowed by Unicode to use even the already encoded superscript letters. That was
not fixed in CLDR for v34 because the browsers used to display the data, notably in the
SurveyTool implemented as a web interface, still are not using decent fonts having
Unicode conformant glyphs for all superscript letters and even digits as seen in some
webmail interfaces. The resulting ransome note effect made it impossible to responsively
back the use of those letters in natural languages as abbreviation indicators, because
unlike phonetics using these letters in isolation, natural languages may have abbreviation
endings encompassing more than the final letter.

For the abbreviation of Magister like on the Polish postcard, that is not a problem.

>
> To some extent, software will always reflect the biases of its creators, and in some subtle ways
> these will end up in conflict with conventions in other languages. In some cases, conventions
> applied by human typesetters cannot easily be duplicated by software that cannot recognize
> the meaning of the text,

Very good point. That is exactly the reason why the author should be enabled to take full control
over his or her text, and that is best and most universally done by correctly programming the
layout driver of the keyboard used.

> and in some cases we have seen languages abandoning these
> conventions in recent reforms in favor of a set of rules that are a bit more "mechanistic"
> if you will.
>
> In German, it used to be necessary to understand the word division to know whether or not
> to apply a ligature. Some of the rules for combining words into compounds were changed
> and that may have made that process more regular as well.

That is a fine step forward for good typography.

>
> But still, forcing all users to become typesetters was one of the wrong turns taken during the
> early development of publishing on computers.

I don?t think so at all. Users were not ?forced? to do anything. If the autocorrect facilities
helping over the deficient keyboarding were not welcome, they could easily be turned off. And
professional typesetters always remained active, turning to the computer in the wake.
I?ve experienced myself being able thanks to Microsoft?s word processor to do professionally
looking typesetting. (As I was responsible for the content anyway, it didn?t make a difference.)
But first I had to add some entries to Word?s autocorrect for tweaking the keyboard.

> You seem to revel in knowing all the little
> details in French usage,

Not at all. That knowledge is a sheer necessity, and fortunately it is so narrow that
you don?t need to know that much to digitally typeset French. But you need to know the
relevant points. The fact that NARROW NO-BREAK SPACE is narrow doesn?t make it little,
but it misleads people to classify it under ?fine typography?, even more in French where
(as found in TUS, in French in the text) it?s called an ?espace fine ins?cable?.

> but I bet not even all educated French people reach your level.

Precisely on this point, perhaps not but that point is relevant mainly to those
programming and documenting keyboard layouts. After that, punctuation spacing is
automated on level 2 (just press Shift) and easily turned off by several means.
I hope that will be welcome, as almost everyone in France is very careful to
always space the big punctuation marks by the means available so far.
And to always superscript the ordinal indicators and other abbreviation indicators,
at least while handwriting.

>
> The best keyboard drivers won't help.

Why do you see that they won?t help?

> So the idea that every string is supposed to be
> "publication-ready" remains a fallacy. However, there shouldn't be encoding obstacles
> to creating publication-ready strings. (Whether created by copy-editors, typesetters, or
> advanced tools that post-process draft texts).

What I?d mainly like to see is that Unicode (supposing that you are writing on behalf
of the Consortium) do not impose a division of the workflow. Everybody should be able
to apply to any task the most appropriate process, no matter of how many parts it will
consist. If a subset of end-users wish to input strings that won?t need to be modified
in detail for publishing (except headings), Unicode is here to empower them to do so.

Can that be taken for granted?

>
> If an Twitter message uses spaces around punctuation that are not the right width, who
> cares;

As pointed out in the paragraph of my previous e-mail just below, the main issue
around punctuation spacing in French in non-justifying layout is not the width of
the space characters, but their line-breaking property. Believe it or not, U+00A0
NO-BREAK SPACE is breakable in those environments, that are therefore messing around
with spaced punctuation unless the space used is U+202F NARROW NO-BREAK SPACE. Or
U+2007 FIGURE SPACE, but if we?re having to use an extra space character, we may as well
pick the right one, given FIGURE SPACE is not fit for publishing, while NNBSP is.

> but if your copy-editor can't prepare a manuscript for publication because of software
> limitations, that's a different can of worms.

My copy-editor is me. I wrote in my previous (perhaps too long, but couldn?t help) e-mail:
?Making such text usable for publishing needs extra work, that today many users cannot
afford?, and: ?Such a draft is also a service to the user, enabling him or her to
streamline the workflow. Such streamlining brings monetary and reputational benefit to
the user.? The working scheme used with TeX or regexes is not interoperable, and the
drafts are not all-purpose. A publishing-ready draft is in my opinion a plain text string
that can be copy-pasted as-is ? or typed directly ? in a blog post composer form while
being sure that all punctuation and punctuation spacing is fully operational. I don?t
currently do this, but many people do, and are doing word processing where the same
applies, given the autocorrect doesn?t use the up-to-date space and can hardly guess in
every case what the user intends to type, you pointed out.

>
> A./
>
>> With due respect, I need to add that the disconnect in that is visible only to
>> French readers. Without NNBSP, punctuation ? la fran?aise in e-mails is messed
>> up because even NBSP is ignored (I don?t know what exactly happens at backend;
>> anyway at frontend it?s like a normal space in at least one e-mail client and
>> in several if not all browsers, and if pasted in plain text from MS Word, it?s
>> truly replaced with SP. All that makes e-mails harder to read. Correct spacing
>> with punctuation in French is often considered ?fine-tuning?, but only if that
>> punctuation spacing is not supported by the keyboard driver, and that?s still
>> almost always the case, except on the updated version 1.1 of the b?po layout
>> (and some personal prototypes not yet released).
>>

Best regards,

Marcel

From unicode at unicode.org Thu Nov 1 15:42:05 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:42:05 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

The 0000 is there in the UCA only because the DUCET is published in a
format that uses it, but here also this format is useless: you never need
any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
just needs to indicate what is the minimum weight assigned for every level
(except the highest level where it is "implicitly" 0001, and not 0000).

Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a ?crit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 15:57:02 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 21:57:02 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

In summary, this step given in the algorithm is completely unneeded and can
be dropped completely:

*S3.2 *If L is not 1, append a *level
separator*

*Note:*The level separator is zero (0000), which is guaranteed to be lower
than any weight in the resulting sort key. This guarantees that when two
strings of unequal length are compared, where the shorter string is a
prefix of the longer string, the longer string is always sorted after the
shorter?in the absence of special features like contractions. For example:
"abc" < "abcX" where "X" can be any character(s).

Remove any reference to the "level separator" from the UCA. You never need
it.

As well this paragraph

7.3 Form Sort Keys

*Step 3.* Construct a sort key for each collation element array by
successively appending all non-zero weights from the collation element
array. Figure 2 gives an example of the application of this step to one
collation element array.

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706
06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002

can be written with this figure:

Figure 2. Collation Element Array to Sort Key

Collation Element ArraySort Key
[.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)

The parentheses mark the collation weights 0020 and 0002 that can be safely
removed if they are respectively the minimum secondary weight and minimum
tertiary weight.
But note that 0020 is kept in two places as they are followed by a higher
weight 0021. This is general for any tailored collation (not just the
DUCET).

Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a ?crit :

> The 0000 is there in the UCA only because the DUCET is published in a
> format that uses it, but here also this format is useless: you never need
> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
> just needs to indicate what is the minimum weight assigned for every level
> (except the highest level where it is "implicitly" 0001, and not 0000).
>
>
> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
> ?crit :
>
>> There are lots of ways to implement the UCA.
>>
>> When you want fast string comparison, the zero weights are useful for
>> processing -- and you don't actually assemble a sort key.
>>
>> People who want sort keys usually want them to be short, so you spend
>> time on compression. You probably also build sort keys as byte vectors not
>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>> collation data file remunges all weights into fractional byte sequences,
>> and leaves gaps for tailoring.
>>
>> markus
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 16:04:40 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 22:04:40 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

So it should be clear in the UCA algorithm and in the DUCET datatable that
"0000" is NOT a valid weight
It is just a notational placeholder used as ".0000", only indicating in the
DUCET format that there's NO weight assigned at the indicated level,
because the collation element is ALWAYS ignorable at this level.
The DUCET could have as well used the notation ".none", or just dropped
every ".0000" in its file (provided it contains a data entry specifying
what is the minimum weight used for each level). This notation is only
intended to be read by humans editing the file, so they don't need to
wonder what is the level of the first indicated weight or remember what is
the minimum weight for that level.
But the DUCET table is actually generated by a machine and processed by
machines.

Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a ?crit :

> In summary, this step given in the algorithm is completely unneeded and
> can be dropped completely:
>
> *S3.2 *If L is not 1, append a *level
> separator*
>
> *Note:*The level separator is zero (0000), which is guaranteed to be
> lower than any weight in the resulting sort key. This guarantees that when
> two strings of unequal length are compared, where the shorter string is a
> prefix of the longer string, the longer string is always sorted after the
> shorter?in the absence of special features like contractions. For example:
> "abc" < "abcX" where "X" can be any character(s).
>
> Remove any reference to the "level separator" from the UCA. You never need
> it.
>
> As well this paragraph
>
> 7.3 Form Sort Keys
>
> *Step 3.* Construct a sort key for each collation element array by
> successively appending all non-zero weights from the collation element
> array. Figure 2 gives an example of the application of this step to one
> collation element array.
>
> Figure 2. Collation Element Array to Sort Key
>
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
>
> can be written with this figure:
>
> Figure 2. Collation Element Array to Sort Key
>
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>
> The parentheses mark the collation weights 0020 and 0002 that can be
> safely removed if they are respectively the minimum secondary weight and
> minimum tertiary weight.
> But note that 0020 is kept in two places as they are followed by a higher
> weight 0021. This is general for any tailored collation (not just the
> DUCET).
>
> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a ?crit :
>
>> The 0000 is there in the UCA only because the DUCET is published in a
>> format that uses it, but here also this format is useless: you never need
>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>> just needs to indicate what is the minimum weight assigned for every level
>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>
>>
>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
>> ?crit :
>>
>>> There are lots of ways to implement the UCA.
>>>
>>> When you want fast string comparison, the zero weights are useful for
>>> processing -- and you don't actually assemble a sort key.
>>>
>>> People who want sort keys usually want them to be short, so you spend
>>> time on compression. You probably also build sort keys as byte vectors not
>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>> collation data file remunges all weights into fractional byte sequences,
>>> and leaves gaps for tailoring.
>>>
>>> markus
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 16:30:23 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 1 Nov 2018 21:30:23 +0000
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID: <20181101213023.51380fa7@JRWUBU2>

On Thu, 1 Nov 2018 22:04:40 +0100
Philippe Verdy via Unicode wrote:

> The DUCET could have as well used the notation ".none", or
> just dropped every ".0000" in its file (provided it contains a data
> entry specifying what is the minimum weight used for each level).
> This notation is only intended to be read by humans editing the file,
> so they don't need to wonder what is the level of the first indicated
> weight or remember what is the minimum weight for that level.
> But the DUCET table is actually generated by a machine and processed
> by machines.

A fair few humans have tailored it by hand.

Richard.

From unicode at unicode.org Thu Nov 1 16:32:01 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 1 Nov 2018 21:32:01 +0000
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID: <20181101213201.2a9a986d@JRWUBU2>

On Thu, 1 Nov 2018 21:13:46 +0100
Philippe Verdy via Unicode wrote:

> I'm not speaking just about how collation keys will finally be stored
> (as uint16 or bytes, or sequences of bits with variable length); I'm
> just refering to the sequence of weights you generate.

> You absolutely NEVER need ANYWHERE in the UCA algorithm any 0000
> weight, not even during processing, or un the DUCET table.

If you take the zero weights out, you have a different table structure
to store, e.g. the CLDR fractional weight tables.

Richard.

From unicode at unicode.org Thu Nov 1 16:47:40 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 1 Nov 2018 21:47:40 +0000
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:
Message-ID: <20181101214740.57853cc1@JRWUBU2>

On Thu, 1 Nov 2018 18:39:16 +0100
Philippe Verdy via Unicode wrote:

> What this means is that we can safely implement UCA using basic
> substitions (e.g. with a function like "string:gsub(map)" in Lua
> which uses a "map" to map source (binary) strings or regexps,into
> target (binary) strings:
>
> For a level-3 collation, you just then need only 3 calls to
> "string:gsub()" to compute any collation:
>
> - the first ":gsub(mapNormalize)" can decompose a source text into
> collation elements and can perform reordering to enforce a normalized
> order (possibly tuned for the tailored locale) using basic regexps.

Are you sure of this? Will you publish the algorithm? Have you
passed the official conformance tests? (Mind you, DUCET is a
relatively easy UCA collation to implement successfully.)

> - the second ":gsub(mapSecondary)" will substitute any collection
> elements by their "intermediary" collation elements+tertiary weight.
>
> - the third ":gsub(mapSecondary)" will substitute any "intermediary"
> collation element by their primary weight + secondary weight

Richard.

From unicode at unicode.org Thu Nov 1 16:56:06 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 1 Nov 2018 21:56:06 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86d0roiufa.fsf@mimuw.edu.pl>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>

<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl>
Message-ID: <20181101215606.30dd6ced@JRWUBU2>

On Thu, 01 Nov 2018 18:23:05 +0100
"Janusz S. Bie? via Unicode" wrote:

> On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote:

> > I don't think it's a joke to recognize that there is a continuum
> > here and that there is no line that can be drawn which is based on
> > straightforward principles. This is a pattern that keeps surfacing
> > the deeper you look at character coding questions.
>
> Looks like you completely missed my point. Nobody ever claimed that
> reproducing all variations in manuscripts is in scope of Unicode, so
> whom do you want to convince that it is not?

I think the counter-claim is that one will never be able to encode all
the meaning-conveying distinctions of text in Unicode.

Richard.

From unicode at unicode.org Thu Nov 1 18:38:08 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 2 Nov 2018 00:38:08 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

As well the step 2 of the algorithm speaks about a single "array" of
collation elements. Actually it's best to create one separate array per
level, and append weights for each level in the relevant array for that
level.
The steps S2.2 to S2.4 can do this, including for derived collation
elements in section 10.1, or variable weighting in section 4.

This also means that for fast string compares, the primary weights can be
processed on the fly (without needing any buffering) is the primary weights
are different between the two strings (including when one or both of the
two strings ends, and the secondary weights or tertiary weights detected
until then have not found any weight higher than the minimum weight value
for each level).
Otherwise:
- the first secondary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a secondary
buffer .
- the first tertiary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a tertiary buffer.
- and so on for higher levels (each buffer just needs to keep a counter,
when it's first used, indicating how many weights were not buffered while
processing and counting the primary weights, because all these weights were
all equal to the minimum value for the relevant level)
- these secondary/tertiary/etc. buffers will only be used once you reach
the end of the two strings when processing the primary level and no
difference was found: you'll start by comparing the initial counters in
these buffers and the buffer that has the largest counter value is
necessarily for the smaller compared string. If both counters are equal,
then you start comparing the weights stored in each buffer, until one of
the buffers ends before another (the shorter buffer is for the smaller
compared string). If both weight buffers reach the end, you use the next
pair of buffers built for the next level and process them with the same
algorithm.

Nowhere you'll ever need to consider any [.0000] weight which is just a
notation in the format of the DUCET intended only to be readable by humans
but never needed in any machine implementation.

Now if you want to create sort keys this is similar except that you don"t
have two strings to process and compare, all you want is to create separate
arrays of weights for each level: each level can be encoded separately, the
encoding must be made so that when you'll concatenate the encoded arrays,
the first few encoded *bits* in the secondary or tertiary encodings cannot
be larger or equal to the bits used by the encoding of the primary weights
(this only limits how you'll encode the 1st weight in each array as its
first encoding *bits* must be lower than the first bits used to encode any
weight in previous levels).

Nowhere you are required to encode weights exactly like their logical
weight, this encoding is fully reversible and can use any suitable
compression technics if needed. As long as you can safely detect when an
encoding ends, because it encounters some bits (with lower values) used to
start the encoding of one of the higher levels, the compression is safe.

For each level, you can reserve only a single code used to "mark" the start
of another higher level followed by some bits to indicate which level it
is, then followed by the compressed code for the level made so that each
weight is encoded by a code not starting by the reserved mark. That
encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
(if the encoding must be readable as ASCII or UTF-8-based, and must not use
any control or SPACE or isolated surrogate) and codes used to encode each
weight must not start by a byte lower or equal to this mark. The binary or
ASCII code units used to encode each weight must just be comparable, so
that comparing codes is equivalent to compare weights represented by each
code.

As well, you are not required to store multiple "marks". This is just one
of the possibilities to encode in the sort key which level is encoded after
each "mark", and the marks are not necessarily the same before each level
(their length may also vary depending on the level they are starting):
these marks may be completely removed from the final encoding if the
encoding/compression used allows discriminating the level used by all
weights, encoded in separate sets of values.

Typical compression technics are for example differencial, notably in
secondary or higher levels, and run-legth encoded to skip sequences of
weights all equal to the minimum weight.

The code units used by the weigh encoding for each level may also need to
avoid some forbidden values if needed (e.g. when encoding the weights to
UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
would create a string not conforming to any standard UTF).

Once again this means that the sequence of logical weight will can sefely
become a readable string, even suitable to be transmitted as plain-text
using any UTF, and that compression is also possible in that case: you can
create and store lot of sort keys even for very long texts

However it is generally better to just encode sort keys only for a
reasonnably discriminant part of the text, e.g. no sort key longer than 255
bytes (created from the start of the original texts): if you compare two
sort keys and find that they are equal, and if both sort keys have this
length of 255 bytes, then you'll compare the full original texts using the
fast-compare algorithm: you don't need to store full sort keys in addition
to the original texts. This can save lot of storage, provided that original
texts are sufficiently discriminated by their start, and that cases where
the sort keys were truncated to the limit of 255 bytes are exceptionnal.

For short texts however, truncated sortkeys may save time at the price of a
reasonnable storage cost (but sortkeys can be also encoded with roughly the
same size as the original text: compression is modest for the encoded
primary level. But compression is frequently very effective for higher
levels where their smaller weight also have less possible variations of
value, in a smaller set.

Notably for the secondary level used to encode case differences, only 3
bits are enough per weight, and you just need to reserve the 3-bit value
"000" as the "mark" for indicating the start of another higher level, while
encoding secondary weights as "001" to "111".

(This means that primary levels have to be encoded so that none of their
encoded primary weights are starting with "000" marking the start of the
secondary level. So primary weights can be encoded in patterns starting by
"0001", "001", "01", or "1" and followed by other bits: this allows
encoding them as readable UTF-8 if these characters are all different at
primary level, excluding only the 16 first C0 controls which need to be
preprocessed into escape sequences using the first permitted C0 control as
an escape, and escaping that C0 control itself).

The third level, started by the mark "00" and followed by the encoded
weights indicating this is a tertiary level and not an higher level, will
also be used to encode a small set of weights (in most locales, this is not
more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
differential coding on 3-bits, you reserve "000" as the "mark" for the next
higher level, then use "001" to "111" to encode differencial weights, the
differencial weights being initially based on the minimum tertiary weight,
you'll use the bit pattern "001" to encode the most frequent minimum
tertiary weight, and patterns "01" to "11" plus additional bits to encode
other positive or negative differences of tertiary weights, or to use
run-length compression). Here also it is possible to map the patterns so
that the encoded secondary weight will be readable valid UTF-8.

The fourth level, started by the mark "000" can use the pattern "001" to
encode the most frequent minimum quaternary weight, and patterns "010" to
"011" followed by other bits to differentially encode the quaternary
weights. Here again it is possible to create an encoding for quaternary
weights that can use some run-length compression and can also be readable
valid UTF-8!

And so on.

Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a ?crit :

> So it should be clear in the UCA algorithm and in the DUCET datatable that
> "0000" is NOT a valid weight
> It is just a notational placeholder used as ".0000", only indicating in
> the DUCET format that there's NO weight assigned at the indicated level,
> because the collation element is ALWAYS ignorable at this level.
> The DUCET could have as well used the notation ".none", or just dropped
> every ".0000" in its file (provided it contains a data entry specifying
> what is the minimum weight used for each level). This notation is only
> intended to be read by humans editing the file, so they don't need to
> wonder what is the level of the first indicated weight or remember what is
> the minimum weight for that level.
> But the DUCET table is actually generated by a machine and processed by
> machines.
>
>
>
> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a ?crit :
>
>> In summary, this step given in the algorithm is completely unneeded and
>> can be dropped completely:
>>
>> *S3.2 *If L is not 1, append a *level
>> separator*
>>
>> *Note:*The level separator is zero (0000), which is guaranteed to be
>> lower than any weight in the resulting sort key. This guarantees that when
>> two strings of unequal length are compared, where the shorter string is a
>> prefix of the longer string, the longer string is always sorted after the
>> shorter?in the absence of special features like contractions. For example:
>> "abc" < "abcX" where "X" can be any character(s).
>>
>> Remove any reference to the "level separator" from the UCA. You never
>> need it.
>>
>> As well this paragraph
>>
>> 7.3 Form Sort Keys
>>
>> *Step 3.* Construct a sort key for each collation element array by
>> successively appending all non-zero weights from the collation element
>> array. Figure 2 gives an example of the application of this step to one
>> collation element array.
>>
>> Figure 2. Collation Element Array to Sort Key
>>
>> Collation Element ArraySort Key
>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706
>> 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
>>
>> can be written with this figure:
>>
>> Figure 2. Collation Element Array to Sort Key
>>
>> Collation Element ArraySort Key
>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>
>> The parentheses mark the collation weights 0020 and 0002 that can be
>> safely removed if they are respectively the minimum secondary weight and
>> minimum tertiary weight.
>> But note that 0020 is kept in two places as they are followed by a higher
>> weight 0021. This is general for any tailored collation (not just the
>> DUCET).
>>
>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a
>> ?crit :
>>
>>> The 0000 is there in the UCA only because the DUCET is published in a
>>> format that uses it, but here also this format is useless: you never need
>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>> just needs to indicate what is the minimum weight assigned for every level
>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>
>>>
>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
>>> ?crit :
>>>
>>>> There are lots of ways to implement the UCA.
>>>>
>>>> When you want fast string comparison, the zero weights are useful for
>>>> processing -- and you don't actually assemble a sort key.
>>>>
>>>> People who want sort keys usually want them to be short, so you spend
>>>> time on compression. You probably also build sort keys as byte vectors not
>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>>> collation data file remunges all weights into fractional byte sequences,
>>>> and leaves gaps for tailoring.
>>>>
>>>> markus
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 1 21:45:27 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 2 Nov 2018 02:45:27 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181101215606.30dd6ced@JRWUBU2>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>
Message-ID:

Richard Wordingham responded to Janusz S. Bie?,

>> ... Nobody ever claimed that reproducing all variations
>> in manuscripts is in scope of Unicode, so whom do you want
>> to convince that it is not?
>
> I think the counter-claim is that one will never be able
> to encode all the meaning-conveying distinctions of text
> in Unicode.

I think that the general agreement is that Unicode plain text isn't
intended for preserving stylistic differences.? The dilemma is that
opinions differ as to what constitutes a stylistic difference.

If there had been an "International Typewriter Usage Consortium" a
hundred years ago which had issued an edict like "the underscore is
placed on the keyboard for the explicit purpose of typing empty lines
for 'fill-in-the-blank' forms, and must never be used by the typist to
underline any other element of type", then that consortium would have
been dictating how users perceive their own written symbols along with
preventing users from establishing new conventions using existing
symbols, experimenting, or innovating.

Some people consider that Unicode is essentially doing the same kind of
thing.? It's *that* perception which needs to be addressed, perhaps with
FAQs and education, or with some kind of revisiting and rethinking.? Or
both.

From unicode at unicode.org Thu Nov 1 21:59:46 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 2 Nov 2018 02:59:46 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

Message-ID: <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Alphabetic script users write things the way they are spelled and spell
things the way they are written.? The abbreviation in question as
written consists of three recognizable symbols.? An "M", a superscript
"r", and an equal sign (= two lines).? It can be printed, handwritten,
or in fraktur; it will still consist of those same three recognizable
symbols.

We're supposed to be preserving the past, not editing it or revising it.

From unicode at unicode.org Fri Nov 2 00:22:59 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 1 Nov 2018 22:22:59 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>
Message-ID:

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 00:44:35 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 02 Nov 2018 06:44:35 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com> (Asmus
Freytag via Unicode's message of "Thu, 1 Nov 2018 13:34:05 -0700")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>

<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl>
<923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com>
Message-ID: <86r2g4uj7g.fsf@mimuw.edu.pl>

On Thu, Nov 01 2018 at 13:34 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 10:23 AM, Janusz S. Bie? via Unicode wrote:

[...]

> Looks like you completely missed my point. Nobody ever claimed that
> reproducing all variations in manuscripts is in scope of Unicode, so
> whom do you want to convince that it is not?
>
> Looks like you are missing my point about there being a continuum with
> not clear lines that can be perfectly drawn a-priori.

Why do you think so? There is nothing in my posts which can be used to
support your claim. Perhaps you confused me with some other poster?

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Fri Nov 2 01:05:06 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 02 Nov 2018 07:05:06 +0100
Subject: mail attribution (was: A sign/abbreviation for "magister")
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>

<23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
<84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com>
<20181101075209.5ffbba7d@JRWUBU2>
<97890362-7550-2e43-2266-a41853b89ba7@ix.netcom.com>
Message-ID: <865zxgui99.fsf@mimuw.edu.pl>

On Thu, Nov 01 2018 at 6:43 -0700, Asmus Freytag via Unicode wrote:
> On 11/1/2018 12:52 AM, Richard Wordingham via Unicode wrote:
>
> On Wed, 31 Oct 2018 11:35:19 -0700
> Asmus Freytag via Unicode wrote:

[...]

> Unfortunately, your emails are extremely hard to read in plain text.
> It is even difficult to tell who wrote what.

My previous mail is unfortunately an example.

>
> Not sure why that is. After they make the round trip, they look fine
> to me.

When displaying your HTML mail, Emacs Gnus doesn't show correctly the
attributions. If I forget to edit it by hand when replying, we get the
confusion like in my previous mail.

I guess I should submit this as a bug or feature request to Emacs
developers. Perhaps Richard Wordingham should do the same for the mail
agent he uses.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Fri Nov 2 02:16:35 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 2 Nov 2018 07:16:35 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Message-ID:

Asmus Freytag wrote,

> Alphabetic script users' handwriting does not match
> print in all features. Traditional German handwriting
> used a line like a macron over the letter 'u' to
> distinguish it from 'n'. Rendering this with a
> u-macron in print would be the height of absurdity.

If German text were displayed with a traditional German handwriting
(cursive) font, then every "u" would display with a macron.? (Except the
ones with umlauts.)? That's because the macron is part and parcel of the
identity of the stylistic variant (cursive) of the letter, not because
the addition of the macron makes a stylistic variation.? It would indeed
be silly to encode such macrons in data derived from a traditional
German handwriting specimen.? Hopefully most everyone here agrees with that.

We all seem to accept that, for example, d = d = d = d.

We all don't seem to agree that d # d?. Or that "Mr." # "Mr" # "M?" #
"M??" # "M:r".

From unicode at unicode.org Fri Nov 2 03:54:36 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Fri, 2 Nov 2018 08:54:36 +0000 (GMT)
Subject: A sign/abbreviation for "magister"
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>

<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>
Message-ID:

On 2018-11-02, James Kass via Unicode wrote:
> Alphabetic script users write things the way they are spelled and spell
> things the way they are written.? The abbreviation in question as
> written consists of three recognizable symbols.? An "M", a superscript
> "r", and an equal sign (= two lines).? It can be printed, handwritten,

That's not true. The squiggle under the r is a squiggle - it is a
matter of interpretation (on which there was some discussion a hundred
messages up-thread or so :) whether it was intended to be = .
Just as it is a matter of interpretation whether the superscript and
squiggle were deeply meaningful to the writer, or whether they were
just a stylistic flourish for Mr.

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

From unicode at unicode.org Fri Nov 2 04:48:01 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 2 Nov 2018 09:48:01 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Message-ID:

Julian Bradfield wrote,

>> consists of three recognizable symbols.? An "M", a superscript
>> "r", and an equal sign (= two lines).? It can be printed, handwritten,
>
> That's not true. The squiggle under the r is a squiggle - it is a
> matter of interpretation (on which there was some discussion a hundred
> messages up-thread or so :) whether it was intended to be = .

I recall Asmus pointing out that the Z-like squiggle was likely a
handwritten "=" and that there was some agreement to this, but didn't
realize that it was in dispute.? FWIW, I agree that the squiggle which
looks kind of like "?" is simply the cursive, stylistic variant of "=",
especially when written quickly.

> Just as it is a matter of interpretation whether the superscript and
> squiggle were deeply meaningful to the writer, or whether they were
> just a stylistic flourish for Mr.

A third possibility is that the double-underlined superscript was a
writing/spelling convention of the time for writing/spelling abbreviations.

Even if someone produced contemporary Polish manuscripts abbreviating
magister as "Mr", it could be argued that the two writers were simply
using different conventions.

From unicode at unicode.org Fri Nov 2 06:31:06 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 2 Nov 2018 11:31:06 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Message-ID: <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>

Suppose someone found a hundred year old form from Poland which included
a section for "sign your name" and "print your name" which had been
filled out by a man with the typically Polish name of Bogus McCoy?? And
he was a Magister, to boot!? And proud of it.

If he signed the magister abbreviation using double-underlined
superscript and likewise his surname *and* printed it the same way -- it
might still be arguable as to whether it was a writing/spelling or a
stylish distinction, I suppose.

But if he signed using double-underlined superscripts and printed using
baseline lower case Latin letters, *that* might be persuasive.

Doesn't seem likely, though, does it?

(Bogus?aw is a legitimate Polish masculine given name.? Its nickname is
Bogus.? McCoy is not, however, a typical Polish surname.? The snarky
combination of "Bogus McCoy" was irresistible to someone of my character
and temperament.? "Bogus" is American slang for fake and "McCoy"
connotes being genuine, as in "the real McCoy".)

From unicode at unicode.org Fri Nov 2 07:09:51 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Fri, 2 Nov 2018 05:09:51 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
Message-ID: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 08:03:37 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 2 Nov 2018 14:03:37 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

You may not like the format of the data, but you are not bound to it. If
you don't like the data format (eg you want [.0021.0002] instead of
[.0000.0021.0002]), you can transform it however you want as long as you
get the same answer, as it says here:

http://unicode.org/reports/tr10/#Conformance
?The Unicode Collation Algorithm is a logical specification.
Implementations are free to change any part of the algorithm as long as any
two strings compared by the implementation are ordered the same as they
would be by the algorithm as specified. Implementations may also use a
different format for the data in the Default Unicode Collation Element
Table. The sort key is a logical intermediate object: if an implementation
produces the same results in comparison of strings, the sort keys can
differ in format from what is specified in this document. (See Section 9,
Implementation Notes.)?

That is what is done, for example, in ICU's implementation. See
http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
collation elements" and "sort keys" to see the transformed collation
elements (from the DUCET + CLDR) and the resulting sort keys.

a =>[29,05,_05] => 29 , 05 , 05 .
a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
? =>
A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
? =>

Mark

On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
unicode at unicode.org> wrote:

> As well the step 2 of the algorithm speaks about a single "array" of
> collation elements. Actually it's best to create one separate array per
> level, and append weights for each level in the relevant array for that
> level.
> The steps S2.2 to S2.4 can do this, including for derived collation
> elements in section 10.1, or variable weighting in section 4.
>
> This also means that for fast string compares, the primary weights can be
> processed on the fly (without needing any buffering) is the primary weights
> are different between the two strings (including when one or both of the
> two strings ends, and the secondary weights or tertiary weights detected
> until then have not found any weight higher than the minimum weight value
> for each level).
> Otherwise:
> - the first secondary weight higher that the minimum secondary weght
> value, and all subsequent secondary weights must be buffered in a
> secondary buffer .
> - the first tertiary weight higher that the minimum secondary weght value,
> and all subsequent secondary weights must be buffered in a tertiary buffer.
> - and so on for higher levels (each buffer just needs to keep a counter,
> when it's first used, indicating how many weights were not buffered while
> processing and counting the primary weights, because all these weights were
> all equal to the minimum value for the relevant level)
> - these secondary/tertiary/etc. buffers will only be used once you reach
> the end of the two strings when processing the primary level and no
> difference was found: you'll start by comparing the initial counters in
> these buffers and the buffer that has the largest counter value is
> necessarily for the smaller compared string. If both counters are equal,
> then you start comparing the weights stored in each buffer, until one of
> the buffers ends before another (the shorter buffer is for the smaller
> compared string). If both weight buffers reach the end, you use the next
> pair of buffers built for the next level and process them with the same
> algorithm.
>
> Nowhere you'll ever need to consider any [.0000] weight which is just a
> notation in the format of the DUCET intended only to be readable by humans
> but never needed in any machine implementation.
>
> Now if you want to create sort keys this is similar except that you don"t
> have two strings to process and compare, all you want is to create separate
> arrays of weights for each level: each level can be encoded separately, the
> encoding must be made so that when you'll concatenate the encoded arrays,
> the first few encoded *bits* in the secondary or tertiary encodings cannot
> be larger or equal to the bits used by the encoding of the primary weights
> (this only limits how you'll encode the 1st weight in each array as its
> first encoding *bits* must be lower than the first bits used to encode any
> weight in previous levels).
>
> Nowhere you are required to encode weights exactly like their logical
> weight, this encoding is fully reversible and can use any suitable
> compression technics if needed. As long as you can safely detect when an
> encoding ends, because it encounters some bits (with lower values) used to
> start the encoding of one of the higher levels, the compression is safe.
>
> For each level, you can reserve only a single code used to "mark" the
> start of another higher level followed by some bits to indicate which level
> it is, then followed by the compressed code for the level made so that each
> weight is encoded by a code not starting by the reserved mark. That
> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
> any control or SPACE or isolated surrogate) and codes used to encode each
> weight must not start by a byte lower or equal to this mark. The binary or
> ASCII code units used to encode each weight must just be comparable, so
> that comparing codes is equivalent to compare weights represented by each
> code.
>
> As well, you are not required to store multiple "marks". This is just one
> of the possibilities to encode in the sort key which level is encoded after
> each "mark", and the marks are not necessarily the same before each level
> (their length may also vary depending on the level they are starting):
> these marks may be completely removed from the final encoding if the
> encoding/compression used allows discriminating the level used by all
> weights, encoded in separate sets of values.
>
> Typical compression technics are for example differencial, notably in
> secondary or higher levels, and run-legth encoded to skip sequences of
> weights all equal to the minimum weight.
>
> The code units used by the weigh encoding for each level may also need to
> avoid some forbidden values if needed (e.g. when encoding the weights to
> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
> would create a string not conforming to any standard UTF).
>
> Once again this means that the sequence of logical weight will can sefely
> become a readable string, even suitable to be transmitted as plain-text
> using any UTF, and that compression is also possible in that case: you can
> create and store lot of sort keys even for very long texts
>
> However it is generally better to just encode sort keys only for a
> reasonnably discriminant part of the text, e.g. no sort key longer than 255
> bytes (created from the start of the original texts): if you compare two
> sort keys and find that they are equal, and if both sort keys have this
> length of 255 bytes, then you'll compare the full original texts using the
> fast-compare algorithm: you don't need to store full sort keys in addition
> to the original texts. This can save lot of storage, provided that original
> texts are sufficiently discriminated by their start, and that cases where
> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>
> For short texts however, truncated sortkeys may save time at the price of
> a reasonnable storage cost (but sortkeys can be also encoded with roughly
> the same size as the original text: compression is modest for the encoded
> primary level. But compression is frequently very effective for higher
> levels where their smaller weight also have less possible variations of
> value, in a smaller set.
>
> Notably for the secondary level used to encode case differences, only 3
> bits are enough per weight, and you just need to reserve the 3-bit value
> "000" as the "mark" for indicating the start of another higher level, while
> encoding secondary weights as "001" to "111".
>
> (This means that primary levels have to be encoded so that none of their
> encoded primary weights are starting with "000" marking the start of the
> secondary level. So primary weights can be encoded in patterns starting by
> "0001", "001", "01", or "1" and followed by other bits: this allows
> encoding them as readable UTF-8 if these characters are all different at
> primary level, excluding only the 16 first C0 controls which need to be
> preprocessed into escape sequences using the first permitted C0 control as
> an escape, and escaping that C0 control itself).
>
> The third level, started by the mark "00" and followed by the encoded
> weights indicating this is a tertiary level and not an higher level, will
> also be used to encode a small set of weights (in most locales, this is not
> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
> differential coding on 3-bits, you reserve "000" as the "mark" for the next
> higher level, then use "001" to "111" to encode differencial weights, the
> differencial weights being initially based on the minimum tertiary weight,
> you'll use the bit pattern "001" to encode the most frequent minimum
> tertiary weight, and patterns "01" to "11" plus additional bits to encode
> other positive or negative differences of tertiary weights, or to use
> run-length compression). Here also it is possible to map the patterns so
> that the encoded secondary weight will be readable valid UTF-8.
>
> The fourth level, started by the mark "000" can use the pattern "001" to
> encode the most frequent minimum quaternary weight, and patterns "010" to
> "011" followed by other bits to differentially encode the quaternary
> weights. Here again it is possible to create an encoding for quaternary
> weights that can use some run-length compression and can also be readable
> valid UTF-8!
>
> And so on.
>
>
>
>
>
>
>
>
> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a ?crit :
>
>> So it should be clear in the UCA algorithm and in the DUCET datatable
>> that "0000" is NOT a valid weight
>> It is just a notational placeholder used as ".0000", only indicating in
>> the DUCET format that there's NO weight assigned at the indicated level,
>> because the collation element is ALWAYS ignorable at this level.
>> The DUCET could have as well used the notation ".none", or just dropped
>> every ".0000" in its file (provided it contains a data entry specifying
>> what is the minimum weight used for each level). This notation is only
>> intended to be read by humans editing the file, so they don't need to
>> wonder what is the level of the first indicated weight or remember what is
>> the minimum weight for that level.
>> But the DUCET table is actually generated by a machine and processed by
>> machines.
>>
>>
>>
>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a
>> ?crit :
>>
>>> In summary, this step given in the algorithm is completely unneeded and
>>> can be dropped completely:
>>>
>>> *S3.2 *If L is not 1, append a *level
>>> separator*
>>>
>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>> lower than any weight in the resulting sort key. This guarantees that when
>>> two strings of unequal length are compared, where the shorter string is a
>>> prefix of the longer string, the longer string is always sorted after the
>>> shorter?in the absence of special features like contractions. For example:
>>> "abc" < "abcX" where "X" can be any character(s).
>>>
>>> Remove any reference to the "level separator" from the UCA. You never
>>> need it.
>>>
>>> As well this paragraph
>>>
>>> 7.3 Form Sort Keys
>>>
>>> *Step 3.* Construct a sort key for each collation element array by
>>> successively appending all non-zero weights from the collation element
>>> array. Figure 2 gives an example of the application of this step to one
>>> collation element array.
>>>
>>> Figure 2. Collation Element Array to Sort Key
>>>
>>> Collation Element ArraySort Key
>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>> 0002 0002 0002
>>>
>>> can be written with this figure:
>>>
>>> Figure 2. Collation Element Array to Sort Key
>>>
>>> Collation Element ArraySort Key
>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>
>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>> safely removed if they are respectively the minimum secondary weight and
>>> minimum tertiary weight.
>>> But note that 0020 is kept in two places as they are followed by a
>>> higher weight 0021. This is general for any tailored collation (not just
>>> the DUCET).
>>>
>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a
>>> ?crit :
>>>
>>>> The 0000 is there in the UCA only because the DUCET is published in a
>>>> format that uses it, but here also this format is useless: you never need
>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>> just needs to indicate what is the minimum weight assigned for every level
>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>
>>>>
>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
>>>> ?crit :
>>>>
>>>>> There are lots of ways to implement the UCA.
>>>>>
>>>>> When you want fast string comparison, the zero weights are useful for
>>>>> processing -- and you don't actually assemble a sort key.
>>>>>
>>>>> People who want sort keys usually want them to be short, so you spend
>>>>> time on compression. You probably also build sort keys as byte vectors not
>>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>>>> collation data file remunges all weights into fractional byte sequences,
>>>>> and leaves gaps for tailoring.
>>>>>
>>>>> markus
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 08:44:25 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Fri, 2 Nov 2018 13:44:25 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Message-ID:

I write my 7?s and Z?s with a horizontal line through them. ? is encoded not for this purpose, but because Z and ? are distinct in orthographies for varieties of Tatar, Chechen, Karelian, and Mongolian. This is a contemporary writing convention but it does not argue for a new SEVEN WITH STROKE character or that I should use ? rather than Z when I write *?an?ibar.

Michael Everson

> On 2 Nov 2018, at 09:48, James Kass via Unicode wrote:
>
> A third possibility is that the double-underlined superscript was a writing/spelling convention of the time for writing/spelling abbreviations.

From unicode at unicode.org Fri Nov 2 08:47:24 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 2 Nov 2018 14:47:24 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181101215606.30dd6ced@JRWUBU2>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>
Message-ID: <94d21f1f-7adc-c433-38ce-465383daca01@orange.fr>

On 01/11/2018 22:56, Richard Wordingham via Unicode wrote:
> On Thu, 01 Nov 2018 18:23:05 +0100
> "Janusz S. Bie? via Unicode" wrote:
>
>> On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote:
>
>>> I don't think it's a joke to recognize that there is a continuum
As a sidenote: I remember something called the "continuum bias" but
turn out unable to retrieve a relevant page on the internet.
>>> here and that there is no line that can be drawn which is based on
>>> straightforward principles. This is a pattern that keeps surfacing
>>> the deeper you look at character coding questions.
>>
>> Looks like you completely missed my point. Nobody ever claimed that
>> reproducing all variations in manuscripts is in scope of Unicode, so
>> whom do you want to convince that it is not?
>
> I think the counter-claim is that one will never be able to encode all
> the meaning-conveying distinctions of text in Unicode.

Much is already done using variation selectors, so I can easily figure out
that UTC will allow one of the 200+ already encoded variation selectors to
be defined as directing the rendering engine to add a double line below a
superscript abbreviation indicator, and another one to add a single line,
according to mainstream ordinal indicators having one or zero underlines
depending on the typeface, and NUMERO SIGN showing currently two lines
like the "Magister" abbreviation on the Polish postcard.

Another option would be using the variation selector scheme to make any
letter an abbreviation indicator needing appropriate display in superscript
plus zero through two underlines. Personally I wouldn?t favor this scheme
for Latin abbreviations, given using preformatted superscripts is most
straightforward.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 2 08:54:19 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 2 Nov 2018 14:54:19 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

It's not just a question of "I like it or not". But the fact that the
standard makes the presence of 0000 required in some steps, and the
requirement is in fact wrong: this is in fact NEVER required to create an
equivalent collation order. these steps are completely unnecessary and
should be removed.

Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit :

> You may not like the format of the data, but you are not bound to it. If
> you don't like the data format (eg you want [.0021.0002] instead of
> [.0000.0021.0002]), you can transform it however you want as long as you
> get the same answer, as it says here:
>
> http://unicode.org/reports/tr10/#Conformance
> ?The Unicode Collation Algorithm is a logical specification.
> Implementations are free to change any part of the algorithm as long as any
> two strings compared by the implementation are ordered the same as they
> would be by the algorithm as specified. Implementations may also use a
> different format for the data in the Default Unicode Collation Element
> Table. The sort key is a logical intermediate object: if an implementation
> produces the same results in comparison of strings, the sort keys can
> differ in format from what is specified in this document. (See Section 9,
> Implementation Notes.)?
>
>
> That is what is done, for example, in ICU's implementation. See
> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> collation elements" and "sort keys" to see the transformed collation
> elements (from the DUCET + CLDR) and the resulting sort keys.
>
> a =>[29,05,_05] => 29 , 05 , 05 .
> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> ? =>
> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> ? =>
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>> As well the step 2 of the algorithm speaks about a single "array" of
>> collation elements. Actually it's best to create one separate array per
>> level, and append weights for each level in the relevant array for that
>> level.
>> The steps S2.2 to S2.4 can do this, including for derived collation
>> elements in section 10.1, or variable weighting in section 4.
>>
>> This also means that for fast string compares, the primary weights can be
>> processed on the fly (without needing any buffering) is the primary weights
>> are different between the two strings (including when one or both of the
>> two strings ends, and the secondary weights or tertiary weights detected
>> until then have not found any weight higher than the minimum weight value
>> for each level).
>> Otherwise:
>> - the first secondary weight higher that the minimum secondary weght
>> value, and all subsequent secondary weights must be buffered in a
>> secondary buffer .
>> - the first tertiary weight higher that the minimum secondary weght
>> value, and all subsequent secondary weights must be buffered in a tertiary
>> buffer.
>> - and so on for higher levels (each buffer just needs to keep a counter,
>> when it's first used, indicating how many weights were not buffered while
>> processing and counting the primary weights, because all these weights were
>> all equal to the minimum value for the relevant level)
>> - these secondary/tertiary/etc. buffers will only be used once you reach
>> the end of the two strings when processing the primary level and no
>> difference was found: you'll start by comparing the initial counters in
>> these buffers and the buffer that has the largest counter value is
>> necessarily for the smaller compared string. If both counters are equal,
>> then you start comparing the weights stored in each buffer, until one of
>> the buffers ends before another (the shorter buffer is for the smaller
>> compared string). If both weight buffers reach the end, you use the next
>> pair of buffers built for the next level and process them with the same
>> algorithm.
>>
>> Nowhere you'll ever need to consider any [.0000] weight which is just a
>> notation in the format of the DUCET intended only to be readable by humans
>> but never needed in any machine implementation.
>>
>> Now if you want to create sort keys this is similar except that you don"t
>> have two strings to process and compare, all you want is to create separate
>> arrays of weights for each level: each level can be encoded separately, the
>> encoding must be made so that when you'll concatenate the encoded arrays,
>> the first few encoded *bits* in the secondary or tertiary encodings cannot
>> be larger or equal to the bits used by the encoding of the primary weights
>> (this only limits how you'll encode the 1st weight in each array as its
>> first encoding *bits* must be lower than the first bits used to encode any
>> weight in previous levels).
>>
>> Nowhere you are required to encode weights exactly like their logical
>> weight, this encoding is fully reversible and can use any suitable
>> compression technics if needed. As long as you can safely detect when an
>> encoding ends, because it encounters some bits (with lower values) used to
>> start the encoding of one of the higher levels, the compression is safe.
>>
>> For each level, you can reserve only a single code used to "mark" the
>> start of another higher level followed by some bits to indicate which level
>> it is, then followed by the compressed code for the level made so that each
>> weight is encoded by a code not starting by the reserved mark. That
>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
>> any control or SPACE or isolated surrogate) and codes used to encode each
>> weight must not start by a byte lower or equal to this mark. The binary or
>> ASCII code units used to encode each weight must just be comparable, so
>> that comparing codes is equivalent to compare weights represented by each
>> code.
>>
>> As well, you are not required to store multiple "marks". This is just one
>> of the possibilities to encode in the sort key which level is encoded after
>> each "mark", and the marks are not necessarily the same before each level
>> (their length may also vary depending on the level they are starting):
>> these marks may be completely removed from the final encoding if the
>> encoding/compression used allows discriminating the level used by all
>> weights, encoded in separate sets of values.
>>
>> Typical compression technics are for example differencial, notably in
>> secondary or higher levels, and run-legth encoded to skip sequences of
>> weights all equal to the minimum weight.
>>
>> The code units used by the weigh encoding for each level may also need to
>> avoid some forbidden values if needed (e.g. when encoding the weights to
>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
>> would create a string not conforming to any standard UTF).
>>
>> Once again this means that the sequence of logical weight will can sefely
>> become a readable string, even suitable to be transmitted as plain-text
>> using any UTF, and that compression is also possible in that case: you can
>> create and store lot of sort keys even for very long texts
>>
>> However it is generally better to just encode sort keys only for a
>> reasonnably discriminant part of the text, e.g. no sort key longer than 255
>> bytes (created from the start of the original texts): if you compare two
>> sort keys and find that they are equal, and if both sort keys have this
>> length of 255 bytes, then you'll compare the full original texts using the
>> fast-compare algorithm: you don't need to store full sort keys in addition
>> to the original texts. This can save lot of storage, provided that original
>> texts are sufficiently discriminated by their start, and that cases where
>> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>>
>> For short texts however, truncated sortkeys may save time at the price of
>> a reasonnable storage cost (but sortkeys can be also encoded with roughly
>> the same size as the original text: compression is modest for the encoded
>> primary level. But compression is frequently very effective for higher
>> levels where their smaller weight also have less possible variations of
>> value, in a smaller set.
>>
>> Notably for the secondary level used to encode case differences, only 3
>> bits are enough per weight, and you just need to reserve the 3-bit value
>> "000" as the "mark" for indicating the start of another higher level, while
>> encoding secondary weights as "001" to "111".
>>
>> (This means that primary levels have to be encoded so that none of their
>> encoded primary weights are starting with "000" marking the start of the
>> secondary level. So primary weights can be encoded in patterns starting by
>> "0001", "001", "01", or "1" and followed by other bits: this allows
>> encoding them as readable UTF-8 if these characters are all different at
>> primary level, excluding only the 16 first C0 controls which need to be
>> preprocessed into escape sequences using the first permitted C0 control as
>> an escape, and escaping that C0 control itself).
>>
>> The third level, started by the mark "00" and followed by the encoded
>> weights indicating this is a tertiary level and not an higher level, will
>> also be used to encode a small set of weights (in most locales, this is not
>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
>> differential coding on 3-bits, you reserve "000" as the "mark" for the next
>> higher level, then use "001" to "111" to encode differencial weights, the
>> differencial weights being initially based on the minimum tertiary weight,
>> you'll use the bit pattern "001" to encode the most frequent minimum
>> tertiary weight, and patterns "01" to "11" plus additional bits to encode
>> other positive or negative differences of tertiary weights, or to use
>> run-length compression). Here also it is possible to map the patterns so
>> that the encoded secondary weight will be readable valid UTF-8.
>>
>> The fourth level, started by the mark "000" can use the pattern "001" to
>> encode the most frequent minimum quaternary weight, and patterns "010" to
>> "011" followed by other bits to differentially encode the quaternary
>> weights. Here again it is possible to create an encoding for quaternary
>> weights that can use some run-length compression and can also be readable
>> valid UTF-8!
>>
>> And so on.
>>
>>
>>
>>
>>
>>
>>
>>
>> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a
>> ?crit :
>>
>>> So it should be clear in the UCA algorithm and in the DUCET datatable
>>> that "0000" is NOT a valid weight
>>> It is just a notational placeholder used as ".0000", only indicating in
>>> the DUCET format that there's NO weight assigned at the indicated level,
>>> because the collation element is ALWAYS ignorable at this level.
>>> The DUCET could have as well used the notation ".none", or just dropped
>>> every ".0000" in its file (provided it contains a data entry specifying
>>> what is the minimum weight used for each level). This notation is only
>>> intended to be read by humans editing the file, so they don't need to
>>> wonder what is the level of the first indicated weight or remember what is
>>> the minimum weight for that level.
>>> But the DUCET table is actually generated by a machine and processed by
>>> machines.
>>>
>>>
>>>
>>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a
>>> ?crit :
>>>
>>>> In summary, this step given in the algorithm is completely unneeded and
>>>> can be dropped completely:
>>>>
>>>> *S3.2 *If L is not 1, append a *level
>>>> separator*
>>>>
>>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>>> lower than any weight in the resulting sort key. This guarantees that when
>>>> two strings of unequal length are compared, where the shorter string is a
>>>> prefix of the longer string, the longer string is always sorted after the
>>>> shorter?in the absence of special features like contractions. For example:
>>>> "abc" < "abcX" where "X" can be any character(s).
>>>>
>>>> Remove any reference to the "level separator" from the UCA. You never
>>>> need it.
>>>>
>>>> As well this paragraph
>>>>
>>>> 7.3 Form Sort Keys
>>>>
>>>> *Step 3.* Construct a sort key for each collation element array by
>>>> successively appending all non-zero weights from the collation element
>>>> array. Figure 2 gives an example of the application of this step to one
>>>> collation element array.
>>>>
>>>> Figure 2. Collation Element Array to Sort Key
>>>>
>>>> Collation Element ArraySort Key
>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>>> 0002 0002 0002
>>>>
>>>> can be written with this figure:
>>>>
>>>> Figure 2. Collation Element Array to Sort Key
>>>>
>>>> Collation Element ArraySort Key
>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>>
>>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>>> safely removed if they are respectively the minimum secondary weight and
>>>> minimum tertiary weight.
>>>> But note that 0020 is kept in two places as they are followed by a
>>>> higher weight 0021. This is general for any tailored collation (not just
>>>> the DUCET).
>>>>
>>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a
>>>> ?crit :
>>>>
>>>>> The 0000 is there in the UCA only because the DUCET is published in a
>>>>> format that uses it, but here also this format is useless: you never need
>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>>> just needs to indicate what is the minimum weight assigned for every level
>>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>>
>>>>>
>>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
>>>>> ?crit :
>>>>>
>>>>>> There are lots of ways to implement the UCA.
>>>>>>
>>>>>> When you want fast string comparison, the zero weights are useful for
>>>>>> processing -- and you don't actually assemble a sort key.
>>>>>>
>>>>>> People who want sort keys usually want them to be short, so you spend
>>>>>> time on compression. You probably also build sort keys as byte vectors not
>>>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>>>>> collation data file remunges all weights into fractional byte sequences,
>>>>>> and leaves gaps for tailoring.
>>>>>>
>>>>>> markus
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 09:23:39 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 2 Nov 2018 15:23:39 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

The table is the way it is because it is easier to process (and comprehend)
when the first field is always the primary weight, second is always the
secondary, etc.

Go ahead and transform the input DUCET files as you see fit. The "should be
removed" is your personal preference. Unless we hear strong demand
otherwise from major implementers, people have better things to do than
change their parsers to suit your preference.

Mark

On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of 0000 required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to create an
> equivalent collation order. these steps are completely unnecessary and
> should be removed.
>
> Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit :
>
>> You may not like the format of the data, but you are not bound to it. If
>> you don't like the data format (eg you want [.0021.0002] instead of
>> [.0000.0021.0002]), you can transform it however you want as long as you
>> get the same answer, as it says here:
>>
>> http://unicode.org/reports/tr10/#Conformance
>> ?The Unicode Collation Algorithm is a logical specification.
>> Implementations are free to change any part of the algorithm as long as any
>> two strings compared by the implementation are ordered the same as they
>> would be by the algorithm as specified. Implementations may also use a
>> different format for the data in the Default Unicode Collation Element
>> Table. The sort key is a logical intermediate object: if an implementation
>> produces the same results in comparison of strings, the sort keys can
>> differ in format from what is specified in this document. (See Section 9,
>> Implementation Notes.)?
>>
>>
>> That is what is done, for example, in ICU's implementation. See
>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>> collation elements" and "sort keys" to see the transformed collation
>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>
>> a =>[29,05,_05] => 29 , 05 , 05 .
>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>> ? =>
>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>> ? =>
>>
>> Mark
>>
>>
>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>> unicode at unicode.org> wrote:
>>
>>> As well the step 2 of the algorithm speaks about a single "array" of
>>> collation elements. Actually it's best to create one separate array per
>>> level, and append weights for each level in the relevant array for that
>>> level.
>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>> elements in section 10.1, or variable weighting in section 4.
>>>
>>> This also means that for fast string compares, the primary weights can
>>> be processed on the fly (without needing any buffering) is the primary
>>> weights are different between the two strings (including when one or both
>>> of the two strings ends, and the secondary weights or tertiary weights
>>> detected until then have not found any weight higher than the minimum
>>> weight value for each level).
>>> Otherwise:
>>> - the first secondary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a
>>> secondary buffer .
>>> - the first tertiary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>> buffer.
>>> - and so on for higher levels (each buffer just needs to keep a counter,
>>> when it's first used, indicating how many weights were not buffered while
>>> processing and counting the primary weights, because all these weights were
>>> all equal to the minimum value for the relevant level)
>>> - these secondary/tertiary/etc. buffers will only be used once you reach
>>> the end of the two strings when processing the primary level and no
>>> difference was found: you'll start by comparing the initial counters in
>>> these buffers and the buffer that has the largest counter value is
>>> necessarily for the smaller compared string. If both counters are equal,
>>> then you start comparing the weights stored in each buffer, until one of
>>> the buffers ends before another (the shorter buffer is for the smaller
>>> compared string). If both weight buffers reach the end, you use the next
>>> pair of buffers built for the next level and process them with the same
>>> algorithm.
>>>
>>> Nowhere you'll ever need to consider any [.0000] weight which is just a
>>> notation in the format of the DUCET intended only to be readable by humans
>>> but never needed in any machine implementation.
>>>
>>> Now if you want to create sort keys this is similar except that you
>>> don"t have two strings to process and compare, all you want is to create
>>> separate arrays of weights for each level: each level can be encoded
>>> separately, the encoding must be made so that when you'll concatenate the
>>> encoded arrays, the first few encoded *bits* in the secondary or tertiary
>>> encodings cannot be larger or equal to the bits used by the encoding of the
>>> primary weights (this only limits how you'll encode the 1st weight in each
>>> array as its first encoding *bits* must be lower than the first bits used
>>> to encode any weight in previous levels).
>>>
>>> Nowhere you are required to encode weights exactly like their logical
>>> weight, this encoding is fully reversible and can use any suitable
>>> compression technics if needed. As long as you can safely detect when an
>>> encoding ends, because it encounters some bits (with lower values) used to
>>> start the encoding of one of the higher levels, the compression is safe.
>>>
>>> For each level, you can reserve only a single code used to "mark" the
>>> start of another higher level followed by some bits to indicate which level
>>> it is, then followed by the compressed code for the level made so that each
>>> weight is encoded by a code not starting by the reserved mark. That
>>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
>>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
>>> any control or SPACE or isolated surrogate) and codes used to encode each
>>> weight must not start by a byte lower or equal to this mark. The binary or
>>> ASCII code units used to encode each weight must just be comparable, so
>>> that comparing codes is equivalent to compare weights represented by each
>>> code.
>>>
>>> As well, you are not required to store multiple "marks". This is just
>>> one of the possibilities to encode in the sort key which level is encoded
>>> after each "mark", and the marks are not necessarily the same before each
>>> level (their length may also vary depending on the level they are
>>> starting): these marks may be completely removed from the final encoding if
>>> the encoding/compression used allows discriminating the level used by all
>>> weights, encoded in separate sets of values.
>>>
>>> Typical compression technics are for example differencial, notably in
>>> secondary or higher levels, and run-legth encoded to skip sequences of
>>> weights all equal to the minimum weight.
>>>
>>> The code units used by the weigh encoding for each level may also need
>>> to avoid some forbidden values if needed (e.g. when encoding the weights to
>>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
>>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
>>> would create a string not conforming to any standard UTF).
>>>
>>> Once again this means that the sequence of logical weight will can
>>> sefely become a readable string, even suitable to be transmitted as
>>> plain-text using any UTF, and that compression is also possible in that
>>> case: you can create and store lot of sort keys even for very long texts
>>>
>>> However it is generally better to just encode sort keys only for a
>>> reasonnably discriminant part of the text, e.g. no sort key longer than 255
>>> bytes (created from the start of the original texts): if you compare two
>>> sort keys and find that they are equal, and if both sort keys have this
>>> length of 255 bytes, then you'll compare the full original texts using the
>>> fast-compare algorithm: you don't need to store full sort keys in addition
>>> to the original texts. This can save lot of storage, provided that original
>>> texts are sufficiently discriminated by their start, and that cases where
>>> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>>>
>>> For short texts however, truncated sortkeys may save time at the price
>>> of a reasonnable storage cost (but sortkeys can be also encoded with
>>> roughly the same size as the original text: compression is modest for the
>>> encoded primary level. But compression is frequently very effective for
>>> higher levels where their smaller weight also have less possible variations
>>> of value, in a smaller set.
>>>
>>> Notably for the secondary level used to encode case differences, only 3
>>> bits are enough per weight, and you just need to reserve the 3-bit value
>>> "000" as the "mark" for indicating the start of another higher level, while
>>> encoding secondary weights as "001" to "111".
>>>
>>> (This means that primary levels have to be encoded so that none of their
>>> encoded primary weights are starting with "000" marking the start of the
>>> secondary level. So primary weights can be encoded in patterns starting by
>>> "0001", "001", "01", or "1" and followed by other bits: this allows
>>> encoding them as readable UTF-8 if these characters are all different at
>>> primary level, excluding only the 16 first C0 controls which need to be
>>> preprocessed into escape sequences using the first permitted C0 control as
>>> an escape, and escaping that C0 control itself).
>>>
>>> The third level, started by the mark "00" and followed by the encoded
>>> weights indicating this is a tertiary level and not an higher level, will
>>> also be used to encode a small set of weights (in most locales, this is not
>>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
>>> differential coding on 3-bits, you reserve "000" as the "mark" for the next
>>> higher level, then use "001" to "111" to encode differencial weights, the
>>> differencial weights being initially based on the minimum tertiary weight,
>>> you'll use the bit pattern "001" to encode the most frequent minimum
>>> tertiary weight, and patterns "01" to "11" plus additional bits to encode
>>> other positive or negative differences of tertiary weights, or to use
>>> run-length compression). Here also it is possible to map the patterns so
>>> that the encoded secondary weight will be readable valid UTF-8.
>>>
>>> The fourth level, started by the mark "000" can use the pattern "001" to
>>> encode the most frequent minimum quaternary weight, and patterns "010" to
>>> "011" followed by other bits to differentially encode the quaternary
>>> weights. Here again it is possible to create an encoding for quaternary
>>> weights that can use some run-length compression and can also be readable
>>> valid UTF-8!
>>>
>>> And so on.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a
>>> ?crit :
>>>
>>>> So it should be clear in the UCA algorithm and in the DUCET datatable
>>>> that "0000" is NOT a valid weight
>>>> It is just a notational placeholder used as ".0000", only indicating in
>>>> the DUCET format that there's NO weight assigned at the indicated level,
>>>> because the collation element is ALWAYS ignorable at this level.
>>>> The DUCET could have as well used the notation ".none", or just dropped
>>>> every ".0000" in its file (provided it contains a data entry specifying
>>>> what is the minimum weight used for each level). This notation is only
>>>> intended to be read by humans editing the file, so they don't need to
>>>> wonder what is the level of the first indicated weight or remember what is
>>>> the minimum weight for that level.
>>>> But the DUCET table is actually generated by a machine and processed by
>>>> machines.
>>>>
>>>>
>>>>
>>>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a
>>>> ?crit :
>>>>
>>>>> In summary, this step given in the algorithm is completely unneeded
>>>>> and can be dropped completely:
>>>>>
>>>>> *S3.2 *If L is not 1, append
>>>>> a *level separator*
>>>>>
>>>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>>>> lower than any weight in the resulting sort key. This guarantees that when
>>>>> two strings of unequal length are compared, where the shorter string is a
>>>>> prefix of the longer string, the longer string is always sorted after the
>>>>> shorter?in the absence of special features like contractions. For example:
>>>>> "abc" < "abcX" where "X" can be any character(s).
>>>>>
>>>>> Remove any reference to the "level separator" from the UCA. You never
>>>>> need it.
>>>>>
>>>>> As well this paragraph
>>>>>
>>>>> 7.3 Form Sort Keys
>>>>>
>>>>> *Step 3.* Construct a sort key for each collation element array by
>>>>> successively appending all non-zero weights from the collation element
>>>>> array. Figure 2 gives an example of the application of this step to one
>>>>> collation element array.
>>>>>
>>>>> Figure 2. Collation Element Array to Sort Key
>>>>>
>>>>> Collation Element ArraySort Key
>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>>>> 0002 0002 0002
>>>>>
>>>>> can be written with this figure:
>>>>>
>>>>> Figure 2. Collation Element Array to Sort Key
>>>>>
>>>>> Collation Element ArraySort Key
>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>>>
>>>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>>>> safely removed if they are respectively the minimum secondary weight and
>>>>> minimum tertiary weight.
>>>>> But note that 0020 is kept in two places as they are followed by a
>>>>> higher weight 0021. This is general for any tailored collation (not just
>>>>> the DUCET).
>>>>>
>>>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a
>>>>> ?crit :
>>>>>
>>>>>> The 0000 is there in the UCA only because the DUCET is published in a
>>>>>> format that uses it, but here also this format is useless: you never need
>>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>>>> just needs to indicate what is the minimum weight assigned for every level
>>>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>>>
>>>>>>
>>>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a
>>>>>> ?crit :
>>>>>>
>>>>>>> There are lots of ways to implement the UCA.
>>>>>>>
>>>>>>> When you want fast string comparison, the zero weights are useful
>>>>>>> for processing -- and you don't actually assemble a sort key.
>>>>>>>
>>>>>>> People who want sort keys usually want them to be short, so you
>>>>>>> spend time on compression. You probably also build sort keys as byte
>>>>>>> vectors not uint16 vectors (because byte vectors fit into more APIs and
>>>>>>> tend to be shorter), like ICU does using the CLDR collation data file. The
>>>>>>> CLDR root collation data file remunges all weights into fractional byte
>>>>>>> sequences, and leaves gaps for tailoring.
>>>>>>>
>>>>>>> markus
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 09:39:49 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 2 Nov 2018 14:39:49 +0000
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID: <20181102143949.4165d666@JRWUBU2>

On Fri, 2 Nov 2018 14:54:19 +0100
Philippe Verdy via Unicode wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of 0000 required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to
> create an equivalent collation order. these steps are completely
> unnecessary and should be removed.
>
> Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a
> ?crit :
>
> > You may not like the format of the data, but you are not bound to
> > it. If you don't like the data format (eg you want [.0021.0002]
> > instead of [.0000.0021.0002]), you can transform it however you
> > want as long as you get the same answer, as it says here:
> >
> > http://unicode.org/reports/tr10/#Conformance
> > ?The Unicode Collation Algorithm is a logical specification.
> > Implementations are free to change any part of the algorithm as
> > long as any two strings compared by the implementation are ordered
> > the same as they would be by the algorithm as specified.
> > Implementations may also use a different format for the data in the
> > Default Unicode Collation Element Table. The sort key is a logical
> > intermediate object: if an implementation produces the same results
> > in comparison of strings, the sort keys can differ in format from
> > what is specified in this document. (See Section 9, Implementation
> > Notes.)?

Given the above paragraph, how does the standard force you to use a
special 0000? Perhaps the wording of the standard can be changed to
prevent your unhappy interpretation.

> > That is what is done, for example, in ICU's implementation. See
> > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> > collation elements" and "sort keys" to see the transformed collation
> > elements (from the DUCET + CLDR) and the resulting sort keys.
> >
> > a =>[29,05,_05] => 29 , 05 , 05 .
> > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> > ? =>
> > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> > ? =>

As you can see, Mark does not come to the same conclusion as you, and
nor do I.

Richard.

From unicode at unicode.org Fri Nov 2 10:04:13 2018
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Fri, 2 Nov 2018 16:04:13 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References:
<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

Message-ID: <20181102150413.r2mdgoulkoe46trq@angband.pl>

On Fri, Nov 02, 2018 at 01:44:25PM +0000, Michael Everson via Unicode wrote:
> I write my 7?s and Z?s with a horizontal line through them. ? is encoded
> not for this purpose, but because Z and ? are distinct in orthographies
> for varieties of Tatar, Chechen, Karelian, and Mongolian. This is a
> contemporary writing convention but it does not argue for a new SEVEN WITH
> STROKE character or that I should use ? rather than Z when I write
> *?an?ibar.

And that use conflicts with ? ? being an allograph of Polish ? ?, used
especially when marks above cap height are unwanted or when readability is
important (?? is too similar to ??). It also happened to be nicely
renderable with Z^H- z^H- vs Z^H' z^H' on printers which had backspace.

I unsuccessfully argued for such a variant on a "historical terminals" font:
https://github.com/rbanffy/3270font/issues/19

But in either case the difference is purely visual rather than semantic.
The latter still applies to _some_ uses of superscript, but not to the
mgr.

Meow!
--
??????? Have you heard of the Amber Road? For thousands of years, the
??????? Romans and co valued amber, hauled through the Europe over the
??????? mountains and along the Vistula, from Gda?sk. To where it came
??????? together with silk (judging by today's amber stalls).

From unicode at unicode.org Fri Nov 2 10:10:21 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 2 Nov 2018 16:10:21 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
Message-ID: <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

On 01/11/2018 16:43, Asmus Freytag via Unicode wrote:
[quoted mail]
> I don't think it's a joke to recognize that there is a continuum here and that
> there is no line that can be drawn which is based on straightforward principles.
[?]
> In this case, there is no such framework that could help establish pragmatic
> boundaries dividing the truly useful from the merely fanciful.

I think the red line was always between the positive and the negative answer to
the question whether a given graphic is relevant for legibility/readability of
the plain text backbone. But humans can be trained to mentally disambiguate
a mass of confusables, so the line vanishes and the continuum remains intact.

On 02/11/2018 06:22, Asmus Freytag via Unicode wrote:
> On 11/1/2018 7:59 PM, James Kass via Unicode wrote:
>>
>> Alphabetic script users write things the way they are spelled and spell things
>> the way they are written. The abbreviation in question as written consists of
>> three recognizable symbols. An "M", a superscript "r", and an equal sign
>> (= two lines). It can be printed, handwritten, or in fraktur; it will still
>> consist of those same three recognizable symbols.
>>
>> We're supposed to be preserving the past, not editing it or revising it.
>>
> Alphabetic script users' handwriting does not match print in all features.
> Traditional German handwriting used a line like a macron over the letter 'u'
> to distinguish it from 'n'. Rendering this with a u-macron in print would be
> the height of absurdity.
>
> I feel similarly about the assertion that the "two lines" are something that
> needs to be encoded, but only an expert would know for sure.

Indeed it would be relevant to know whether it is mandatory in Polish, and I?m
not an expert. But looking at several scripts using abbreviation indicators as
superscript, i.e. Latin and Cyrillic (when using the Latin-script-written
abbreviation of "Numero", given Cyrillic for "N" is "?", so it?s strictly
speaking one single script, and two scripts using it), then we can easily see
how single and double underlines are added or not depending on font design
and on customary writing and display. E.g. the Romance feminine and masculine
ordinal indicators have one or zero underlines, to such extent that French
typography specifies that the masculine ordinal indicator, despite beinga
superscript small o, is unfit to compose the French "num?ro" abbreviation,
that must not have an underline. Hence DEGREE SIGN is less bad than U+00BA.

If applying the same to Polish, "Magister" is "M?" and is straigtforward
to input when using a new French keyboard layout or an enhanced variant of
any national Latin one having small supersripts on the Shift+Num level, or
via a ?superscript? dead key, mapped e.g. on Shift + AltGr/Option + E or
any of the 26 letter keys as mnemonically convenient ("superscript"
translates to French "exposant"); or ?Compose? ?^? [e] (where the ASCII
circumflex or caret is repurposed for superscript compose sequences, while
?circumflex accent? is active *after* LESS-THAN SIGN, consistently with the
*new* convention for ?inverted breve? using LEFT PARENTHESIS rather than "g)".

These details are posted in this thread on this List rather than CLDR-USERS
in order to make clear that typing superscript letters directly via the
keyboard is easy, and therefore to propose it is not to harrass the end-user.

On 02/11/2018 13:09, Asmus Freytag via Unicode wrote:
[quoted mail]
[?]
> To transcribe the postcard would mean selecting the characters appropriate
> for the printed equivalent of the text.

As already suggested, selecting the variants can be done using variation
selectors, provided the Standard has defined the intended use case.

>
> If the printed form had a standard way of superscripting letters with a
> decoration below when used for abbreviations,

As already pointed out, Latin script does not benefit from a consensus
to use underline for superscript. E.g. Italian, Portuguese and Spanish
do use underline for superscript, English and French do not.

> then, and only then would we start discussing whether this decoration
> needs to be encoded, or whether it is something a font can supply as part
> of rendering the (sequence of) superscripted letters.

I think the problem is not completely outlined, as long as the use of
variation sequences is not mentioned. There is no "all" or "nothing"
dilemma, given Unicode has the means of providing a standard way of
representing calligraphic variations using variation selectors. E.g.
the letter ENG is preferred in big lowercase form when writing
Bambara, while other locales may like it in hooked uppercase.
The Bambara Arial font allows to make sure it is the right glyph,
and Arial in general follows the Bambara preference, but other fonts
do not, while some of them have the Bambara-fit glyph inside but
don?t display it unless urged by an OpenType supporting renderer,
and appropriate settings turned on, e.g. on a locale identifier basis.

> (Perhaps with the aid of markup identifying the sequence as abbreviation).

That seems to me a regression, after the front has moved in favor of
recognizing Latin script needs preformatted superscript. The use case is
clear, as we have ?, ?, and n? with degree sign, and so on as already
detailed in long e-mails in this thread and elsewhere. There is no point
in setting up or maintaining a Unicode policy stating otherwise, as such
a policy would be inconsistent with longlasting and extremely widespread
practice.

The main thing to fix is the font stack of user agents, that is finally
everyone?s computer. Alternatively web sites may wish to use web fonts.
In order to have superscripts displayed in a professional and civilized
way, with no ransome note effect.

In aUnicode conformant way, to say it shortly.

>
> All else is just applying visual hacks to simulate a specific appearance,
> at the possible cost of obscuring the contents.

As already pointed out, the hack here is to use a higher level protocol
to simulate the effect of abbreviation indicator superscript. Using the
latter is not ?obscuring?, but _clarifying_ ?the contents.? But I agree
that adding combining diacritics to get the related underlines may
obscure the content if unsupported (displaying as .notdef box).
The concern about machine readability of the content is addressed by
setting up equivalence classes and using DUCET discussed in the
parallel thread.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 2 10:38:45 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Fri, 02 Nov 2018 08:38:45 -0700
Subject: A sign/abbreviation for "magister"
Message-ID: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com>

Do we have any other evidence of this usage, besides a single
handwritten postcard?

--
Doug Ewell | Thornton, CO, US | ewellic.org

From unicode at unicode.org Fri Nov 2 10:42:52 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Fri, 02 Nov 2018 08:42:52 -0700
Subject: A sign/abbreviation for "magister"
Message-ID: <20181102084252.665a7a7059d7ee80bb4d670165c8327d.5aa2c4d5b0.wbe@email03.godaddy.com>

Michael Everson wrote:

> I write my 7?s and Z?s with a horizontal line through them. ? is
> encoded not for this purpose, but because Z and ? are distinct in
> orthographies for varieties of Tatar, Chechen, Karelian, and
> Mongolian. This is a contemporary writing convention but it does not
> argue for a new SEVEN WITH STROKE character or that I should use ?
> rather than Z when I write *?an?ibar.

http://www.unicode.org/L2/L2018/18323-open-four.pdf

--
Doug Ewell | Thornton, CO, US | ewellic.org

From unicode at unicode.org Fri Nov 2 11:20:00 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 02 Nov 2018 17:20:00 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> (Asmus
Freytag via Unicode's message of "Fri, 2 Nov 2018 05:09:51 -0700")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>

<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
Message-ID: <86ftwjpi33.fsf@mimuw.edu.pl>

On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

> To transcribe the postcard would mean selecting the characters
> appropriate for the printed equivalent of the text.

You seem to make implicit assumptions which are not necessarily
true. For me to transcribe the postcard would mean to answer the needs
of the intended transcription users.

> If the printed form had a standard way of superscripting letters with
> a decoration below when used for abbreviations, then, and only then
> would we start discussing whether this decoration needs to be encoded,
> or whether it is something a font can supply as part of rendering the
> (sequence of) superscripted letters. (Perhaps with the aid of markup
> identifying the sequence as abbreviation).

As I wrote already some time ago on the list, the alternative "encoding
or using a specialized font" is wrong. These days texts are encoding for
processing (in particular searching), rendering is just a kind of
side-effect.

On the other hand, whom do you mean by "we" and what do you mean by
"encoding"? If I guess correctly what do you mean by these words then
you are discussing an issue which was never raised by anybody (if I'm
wrong, please quote the relevant post). Again is not clear for me whom
you want to convince or inform.

> All else is just applying visual hacks

I don't mind hacks if they are useful and serve the intended purpose,
even if they are visual :-)

> to simulate a specific appearance,

As I said above, the appearance is not necessarily of primary
importance.

> at the possible cost of obscuring the contents.

It's for the users of the transcription to decide what is obscuring the
text and what, to the contrary, makes the transcription more readable
and useful.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Fri Nov 2 11:37:21 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 2 Nov 2018 17:37:21 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>
Message-ID: <55047cad-d1de-707a-70b7-fdf8fb17bbc3@orange.fr>

On 31/10/2018 at 19:34, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 10:32 AM, Janusz S. Bie? via Unicode wrote:
> >
> > Let me remind what plain text is according to the Unicode glossary:
> >
> > Computer-encoded text that consists only of a sequence of code
> > points from a given standard, with no other formatting or structural
> > information.
> >
> > If you try to use this definition to decide what is and what is not a
> > character, you get vicious circle.
> >
> > As mentioned already by others, there is no other generally accepted
> > definition of plain text.

Being among those who argued that the ?plain text? concept cannot?and
therefore mustn?t?be used per se to disallow the use of a more or less
restricted or extended set of characters in what is called ?ordinary text?,
I?m ending up adding the following in case it might be of interest:

>
> This definition becomes tautological only when you try to invoke it in making
> encoding decisions, that is, if you couple it with the statement that only
> "elements of plain text" are ever encoded.

I don?t think that Janusz S. Bie??s concern is about this definition
being ?tautological?. AFAICS the Unicode definition of ?plain text? is
quoted to back the assumption that it?s hard to use that concept to argue
against the use of a given Unicode character in a given context, or to
use it to kill a proposal for characters significant in natural languages.

The reasoning is that the call not to use character X in plain text, while X is
a legal Unicode character whose use is not discouraged for technical reasons,
is like if ?ordinary people? (scarequoted derivative from ?ordinary text?) were
told that X is not a Unicode character. That discourse is a ?vicious circle? in
that there is no limit to it until Latin script is pulled down to plain ASCII.
As already well known, diacritics are handled by the rendering system and don?t
need to be displayed as such in the plain text backbone. I don?t believe that
the same applies to other scripts, but these are often not considered when the
encoding of Latin preformatted letters is fought, given superscripting seems
to be proper to Latin, and originated from longlasting medieval practice and
writing conventions.

>
> For that purpose, you need a number of other definitions of "plain text".
> Including the definition that plain text is the "backbone" to which you apply
> formatting and layout information. I personally believer that there are more
> 2D notations where it's quite obvious to me that what is "placed" is a text
> element. More like maps and music and less like a circuit diagram, where the
> elements are less text like (I deliberately include symbols in the definition
> of text, but not any random graphical line art).

All two-dimensional notations here (outside the parenthetical) use higher-level
protocols; maps and diagrams are often vector graphics. But Unicode strived to
encode all needed plain text elements, such as symbols for maritime and wheather
maps. Even arrows of many possible shapes, including 3D-looking ones, have been
encoded. While freehand (rather than ?any random?) graphical art is out of scope,
we have a lot of box drawing, used with appropriate fonts to draw e.g. layouts of
keyboards above the relevant source code in plain text files (examples in XKB).

As a sidenote: Box drawing while useful is unduly neglected on font level, even
in the Code Charts where the advance width, usually half an em, is inconsistent
between different sorts of elements belonging to the same block.

>
> Another definition of plain text is that which contains the "readable content"
> of the text.

As already discussed on this List, many documents in PDF have hard-to-read plain
text backbones, even misleading Google Search, for the purpose of handling special
glyphs (and, in some era, even special characters).

> As we've discussed here, this definition has edge cases; some
> content is traditionally left to styling.

Many pre-Unicode traditions are found out there, that stay in use, partly for
technical reasons (mainly by lack of updated keyboard layouts), partly for
consistency with accustomed ways of doing. Being traditionally-left-to-styling
is the more unconvincing. Even a letter that got to become LATIN SMALL LETTER O E
(Unicode 1.0) was composed on typewriters using the half-backspace, and should be
_left to styling_ when it was pulled out of the draft ISO/IEC 8859-1 by the fault
of a Frenchman (name undisclosed for privacy). And we?ve been told on this List
that the tradition using styling (a special font) to display the additional Latin
letters used to write Bambara survived.

> Example: some of the small words in
> some Scandinavian languages are routinely italicized to disambiguate their
> reading.

Other languages use titlecase to achieve the same disambiguation. E.g. French
titlecases the noun "Une" which means the "cover", not the undefined article,
and German did the same when "Ein(e)" is a numeral, but today, other means,
including italics, are more common.

> Other languages use accents for this purpose - sometimes without
> recognizing either the accented letter as part of the alphabet, or the accented
> form as a dictionary entry.

Talking about Dutch stressing acute, discussed earlier on this List.

> Which nicely shows, that this level disambiguation
> is intuitively viewed as less orthographic, something that applies to the cases
> where italics are used for the same purpose.

Another Unicode-conformant means of noting stress would be adding an emoji. :-|
If stress is close to emotion, it could be represented in a similar way.

Strictly speaking, that is off-topic in this thread, that is about representing
abbreviations in a legible rather than merely decipherable way. In plain text.
If stress is not represented, you still can read the sentence without stumbling.
That is not always true when abbreviations are not superscripted. I remember an
ASCII-only environment localized in French, where "no centre mess" is "num?ro du
centre de messagerie", "dial number of the message platform". Being unfamiliar, I
did stumble prior to completing and understanding the meaning: "n? centre mess."

>
> In some contexts (Western Math) the scope of readable content is different than
> that of ordinary text. Therefore, this definition of "plain text" isn't universal.
> In principle, you could argue that your definition of readable content should apply;
> however, as a standard, Unicode will insist on limiting the encoding to text elements
> required by some common, widely shared and reasonably agreed-upon definition of
> plain text -- corresponding to a particular division between text elements and styling.
> So far, we have ordinary text, math and phonetics,

Thanks for clarification. Nevertheless, that partition of roles has something arbitrary
as long as abbreviation indicators are excluded from the scope of ordinary text.
That is, that policy is applied and promoted without being well designed. It implodes
from the beginning on, given the feminine and masculine ordinal indicators pre-dated
Unicode and are a living proof of the importance of preformatted superscripts.

Instead of drawing the borderline between usages only, Unicode draw it between natural
languages, stating that Italian, Portuguese and Spanish are entitled to use superscript
ordinal indicators, whereas on the other hand, English and French are not. In the same
vein, Italian, Portuguese and Spanish are granted the right of composing titles and
some other abbreviations using preformatted superscript letters, as long as the set
doesn?t exceed a and o, but other languages are not when using other or more letters,
or when not being accustomed to underlining as an additional abbreviation indicator.

Fortunately that is no longer true, so the point is actually to redact the relevant
paragraphs in TUS, already for consistency with CLDR.

Contributions are hopefully welcome.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 2 11:45:58 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 2 Nov 2018 17:45:58 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>
Message-ID:

Le ven. 2 nov. 2018 ? 16:20, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> That seems to me a regression, after the front has moved in favor of
> recognizing Latin script needs preformatted superscript. The use case is
> clear, as we have ?, ?, and n? with degree sign, and so on as already
> detailed in long e-mails in this thread and elsewhere. There is no point
> in setting up or maintaining a Unicode policy stating otherwise, as such
> a policy would be inconsistent with longlasting and extremely widespread
> practice.
>

Using variation selectors is only appropriate for these existing
(preencoded) superscript letters ? and ? so that they display the
appropriate (underlined or not underlined) glyph. It is not a solution for
creating superscripts on any letters and mark that it should be rendered as
superscript (notably, the base letter to transform into superscript may
also have its own combining diacritics, that must be encoded explicitly,
and if you use the varaition selector, it should allow variation on the
presence or absence of the underline (which must then be encoded explicitly
as a combining character.

So finally what we get with variation selectors is:
and
which
is NOT canonically equivalent.

Using a combining character avoids this caveat:
and
which ARE canonically equivalent.
And this explicitly states the semantic (something that is lost if we are
forced to use presentational superscripts in a higher level protocol like
HTML/CSS for rich text format, and one just extracts the plain text; using
collation will not help at all, except if collators are built with
preprocessing that will first infer the presence of a to insert after each combining sequence of the
plain-text enclosed in a italic style).

There's little risk: if the is not mapped in
fonts (or not recognized by text renderers to create synthetic superscript
scripts from existing recognized clusters), it will render as a visible
.notdef (tofu). But normally text renderers recognize the basic properties
of characters in the UCD and can see that has
a combining mark general property (it also knows that it has a 0 combinjing
class, so canonical equivalences are not broken) to render a better symbols
than the .notdef "tofu": it should better render a dotted circle. Even if
this tofu or dotted circle is rendered, it still explicitly marks the
presence of the abbreviation mark, so there's less confusion about what is
preceding it (the combining sequence that was supposed to be superscripted).

The can also have its own to select other styles when they are optional, such as adding
underlines to the superscripted letter, or rendering the letter instead as
underscript, or as a small baseline letter with a dot after it: this is
still an explicit abbreviation mark, and the meaning of the plein text is
still preserved: the variation selector is only suitable to alter the
rendering of a cluster when it has effectively several variants and the
default rendering is not universal, notably across font styles initially
designed for specific markets with their own local preferences: the
variation selector still allows the same fonts to map all known variants
distinctly, independantly of the initial arbitrary choice of the default
glyph used when the variation selector is missing).

Even if fonts (or text renderers may map the
to variable glyphs, this is purely stylictic, the semantic of the plain
text is not lost because the is still there.
There's no need of any rich-text to encode it (the rich -text styles are
not explicitly encoding that a superscript is actually an abbreviation
mark, so it cannot also allow variation like rendering an underscript, or a
baseline small glyph with an added dot. Typically a used in an English style would render the letter (or cluster) before
it as a "small" letter without any added dot.

So I really think that is far better than:
* using preencoded superscript letters (they don't map all the necessary
repertoire of clusters where the abbreviation is needed, it now just covers
Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a
few other letters like stops; it's impossible to rencode the full Unicode
repertoire and its allowed combining sequences or extended default grapheme
clusters!),
* or using variation selectors to make them appear as a superscript (does
not work with all clusters containing other diacritics like accents),
* or using rich-text styling (from which you cannot safely infer any
semantic (there no warranty that M^r in HTML is actually an
abbreviation of "Mister"; in HTML this is encoded elsewhere as M^r or M^r (the
semantic of the abbreviation has to be looked a possible container
element and the meaning of the abbreviation is to look inside its title
attribute, so obviously this requires complex preprocessing before we can
infer a plaintext version (suitable for
example in plain-text searches where you don't want to match a mathematical
object M, like a matrix, elevated to the power r, or a single plaintext M
followed by a footnote call noted by the letter "r").

It solves all practical problems: legacy encoding using the preencoded
superscript Latin letters (aka "modifier letters") should have never been
used or needed (not even for IPA usage which could have used an explicit
for its superscripted symbols, or for its
distinctive "a" and "g"). We should not have needed to encode the variants
for "a" and "g": these were old hacks that broke the Unicode character
encoding model since the beginning. However only roundtrip compatibility
with legacy non UCS charsets milited only for keeping the ordinal feminine
or ordinal masculine mark, or the "Numero" cluster (actually made of two
letters, the second one followed by an implicit abbreviation mark, but
transformed in the legacy charset to be treated as a single unbreakable
cluster containing only one symbol; even Unicode considers the abbreviated
Numero as being only "compatibility equivalent" to the letter N followed by
the masculine ordinal symbol, the latter being also only "compatibility
equivalent" to a letter o with an implicit superscript, but also with an
optional combining underline).

All these superscripts in Unicode (as well as Mathematical "styled"
letters, which were also completely unnecessary and will necessarily be
incomplete for the intended usage) are now to be treated only as legacy
practices, they should be deprecated in favor of the more semantic and
logical character encoding model, deprecating complelely the legacy visual
encoding.

Only precombined characters, recognized by canonical equivalences are part
of the standard and may be kept as "non"-legacy: they still fit in the
logical encoding. As well the extended default grapheme clusters include
the precomposed Hangul LVT and LV syllables, and CGJ used before combining
marks with non-zero combining class, and variation selectors used only
after base letters with the zero combining class and that start the
extended default graphgeme clusters.

Let's return to the root of the far better logical encoding which remains
the recommended practice. All the rest is legacy (some of them came from
decision taken to preserve roundtrip compatibility with legacy charsets,
including prepended letters in Thai, and so we have a few compatibility
characters (which are not the recommended practive), but the rest was bad
decisions made by Unicode and ISO WG to break the logical character
encoding model.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 11:46:42 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 2 Nov 2018 17:46:42 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86ftwjpi33.fsf@mimuw.edu.pl>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<86ftwjpi33.fsf@mimuw.edu.pl>
Message-ID: <72e3fabb-6b01-2b77-16c8-56e049ab2707@orange.fr>

On 02/11/2018 17:20, Janusz S. Bie? via Unicode wrote:
> On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote:
>
> [...]
>
>> To transcribe the postcard would mean selecting the characters
>> appropriate for the printed equivalent of the text.
>
> You seem to make implicit assumptions which are not necessarily
> true. For me to transcribe the postcard would mean to answer the needs
> of the intended transcription users.
>
>> If the printed form had a standard way of superscripting letters with
>> a decoration below when used for abbreviations, then, and only then
>> would we start discussing whether this decoration needs to be encoded,
>> or whether it is something a font can supply as part of rendering the
>> (sequence of) superscripted letters. (Perhaps with the aid of markup
>> identifying the sequence as abbreviation).
>
> As I wrote already some time ago on the list, the alternative "encoding
> or using a specialized font" is wrong. These days texts are encoding for
> processing (in particular searching), rendering is just a kind of
> side-effect.

Indeed, not using MODIFIER LETTER SMALL R to encode the r in "M?" would
make it harder to retrieve the "Magister" abbreviation in a database.
Eg Bing Search having less extended equivalence classes when I tested
it for mathematical preformatted letters, it was able to retrieve them
precisely. Perhaps it still is.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 2 12:02:05 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 2 Nov 2018 18:02:05 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID:

I was replying not about the notational repreentation of the DUCET data
table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
Which remains highly confusive, and contains completely unnecesary steps,
and just complicates things with absoiluytely no benefit at all by
introducing confusion about these "0000". UTR#10 still does not explicitly
state that its use of "0000" does not mean it is a valid "weight", it's a
notation only (but the notation is used for TWO distinct purposes: one is
for presenting the notation format used in the DUCET itself to present how
collation elements are structured, the other one is for marking the
presence of a possible, but not always required, encoding of an explicit
level separator for encoding sort keys).

UTR#10 is still needlessly confusive. Even the example tables can be made
without using these "0000" (for example in tables showing how to build sort
keys, it can present the list of weights splitted in separate columns, one
column per level, without any "0000". The implementation does not
necessarily have to create a buffer containing all weight values in a row,
when separate buffers for each level is far superior (and even more
efficient as it can save space in memory). The step "S3.2" in the UCA
algorithm should not even be there (it is made in favor an specific
implementation which is not even efficient or optimal), it complicates the
algorithm with absoluytely no benefit at all); you can ALWAYS remove it
completely and this still generates equivalent results.

Le ven. 2 nov. 2018 ? 15:23, Mark Davis ?? a ?crit :

> The table is the way it is because it is easier to process (and
> comprehend) when the first field is always the primary weight, second is
> always the secondary, etc.
>
> Go ahead and transform the input DUCET files as you see fit. The "should
> be removed" is your personal preference. Unless we hear strong demand
> otherwise from major implementers, people have better things to do than
> change their parsers to suit your preference.
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote:
>
>> It's not just a question of "I like it or not". But the fact that the
>> standard makes the presence of 0000 required in some steps, and the
>> requirement is in fact wrong: this is in fact NEVER required to create an
>> equivalent collation order. these steps are completely unnecessary and
>> should be removed.
>>
>> Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit :
>>
>>> You may not like the format of the data, but you are not bound to it. If
>>> you don't like the data format (eg you want [.0021.0002] instead of
>>> [.0000.0021.0002]), you can transform it however you want as long as you
>>> get the same answer, as it says here:
>>>
>>> http://unicode.org/reports/tr10/#Conformance
>>> ?The Unicode Collation Algorithm is a logical specification.
>>> Implementations are free to change any part of the algorithm as long as any
>>> two strings compared by the implementation are ordered the same as they
>>> would be by the algorithm as specified. Implementations may also use a
>>> different format for the data in the Default Unicode Collation Element
>>> Table. The sort key is a logical intermediate object: if an implementation
>>> produces the same results in comparison of strings, the sort keys can
>>> differ in format from what is specified in this document. (See Section 9,
>>> Implementation Notes.)?
>>>
>>>
>>> That is what is done, for example, in ICU's implementation. See
>>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>>> collation elements" and "sort keys" to see the transformed collation
>>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>>
>>> a =>[29,05,_05] => 29 , 05 , 05 .
>>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>>> ? =>
>>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>>> ? =>
>>>
>>> Mark
>>>
>>>
>>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>>> unicode at unicode.org> wrote:
>>>
>>>> As well the step 2 of the algorithm speaks about a single "array" of
>>>> collation elements. Actually it's best to create one separate array per
>>>> level, and append weights for each level in the relevant array for that
>>>> level.
>>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>>> elements in section 10.1, or variable weighting in section 4.
>>>>
>>>> This also means that for fast string compares, the primary weights can
>>>> be processed on the fly (without needing any buffering) is the primary
>>>> weights are different between the two strings (including when one or both
>>>> of the two strings ends, and the secondary weights or tertiary weights
>>>> detected until then have not found any weight higher than the minimum
>>>> weight value for each level).
>>>> Otherwise:
>>>> - the first secondary weight higher that the minimum secondary weght
>>>> value, and all subsequent secondary weights must be buffered in a
>>>> secondary buffer .
>>>> - the first tertiary weight higher that the minimum secondary weght
>>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>>> buffer.
>>>> - and so on for higher levels (each buffer just needs to keep a
>>>> counter, when it's first used, indicating how many weights were not
>>>> buffered while processing and counting the primary weights, because all
>>>> these weights were all equal to the minimum value for the relevant level)
>>>> - these secondary/tertiary/etc. buffers will only be used once you
>>>> reach the end of the two strings when processing the primary level and no
>>>> difference was found: you'll start by comparing the initial counters in
>>>> these buffers and the buffer that has the largest counter value is
>>>> necessarily for the smaller compared string. If both counters are equal,
>>>> then you start comparing the weights stored in each buffer, until one of
>>>> the buffers ends before another (the shorter buffer is for the smaller
>>>> compared string). If both weight buffers reach the end, you use the next
>>>> pair of buffers built for the next level and process them with the same
>>>> algorithm.
>>>>
>>>> Nowhere you'll ever need to consider any [.0000] weight which is just a
>>>> notation in the format of the DUCET intended only to be readable by humans
>>>> but never needed in any machine implementation.
>>>>
>>>> Now if you want to create sort keys this is similar except that you
>>>> don"t have two strings to process and compare, all you want is to create
>>>> separate arrays of weights for each level: each level can be encoded
>>>> separately, the encoding must be made so that when you'll concatenate the
>>>> encoded arrays, the first few encoded *bits* in the secondary or tertiary
>>>> encodings cannot be larger or equal to the bits used by the encoding of the
>>>> primary weights (this only limits how you'll encode the 1st weight in each
>>>> array as its first encoding *bits* must be lower than the first bits used
>>>> to encode any weight in previous levels).
>>>>
>>>> Nowhere you are required to encode weights exactly like their logical
>>>> weight, this encoding is fully reversible and can use any suitable
>>>> compression technics if needed. As long as you can safely detect when an
>>>> encoding ends, because it encounters some bits (with lower values) used to
>>>> start the encoding of one of the higher levels, the compression is safe.
>>>>
>>>> For each level, you can reserve only a single code used to "mark" the
>>>> start of another higher level followed by some bits to indicate which level
>>>> it is, then followed by the compressed code for the level made so that each
>>>> weight is encoded by a code not starting by the reserved mark. That
>>>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
>>>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
>>>> any control or SPACE or isolated surrogate) and codes used to encode each
>>>> weight must not start by a byte lower or equal to this mark. The binary or
>>>> ASCII code units used to encode each weight must just be comparable, so
>>>> that comparing codes is equivalent to compare weights represented by each
>>>> code.
>>>>
>>>> As well, you are not required to store multiple "marks". This is just
>>>> one of the possibilities to encode in the sort key which level is encoded
>>>> after each "mark", and the marks are not necessarily the same before each
>>>> level (their length may also vary depending on the level they are
>>>> starting): these marks may be completely removed from the final encoding if
>>>> the encoding/compression used allows discriminating the level used by all
>>>> weights, encoded in separate sets of values.
>>>>
>>>> Typical compression technics are for example differencial, notably in
>>>> secondary or higher levels, and run-legth encoded to skip sequences of
>>>> weights all equal to the minimum weight.
>>>>
>>>> The code units used by the weigh encoding for each level may also need
>>>> to avoid some forbidden values if needed (e.g. when encoding the weights to
>>>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
>>>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
>>>> would create a string not conforming to any standard UTF).
>>>>
>>>> Once again this means that the sequence of logical weight will can
>>>> sefely become a readable string, even suitable to be transmitted as
>>>> plain-text using any UTF, and that compression is also possible in that
>>>> case: you can create and store lot of sort keys even for very long texts
>>>>
>>>> However it is generally better to just encode sort keys only for a
>>>> reasonnably discriminant part of the text, e.g. no sort key longer than 255
>>>> bytes (created from the start of the original texts): if you compare two
>>>> sort keys and find that they are equal, and if both sort keys have this
>>>> length of 255 bytes, then you'll compare the full original texts using the
>>>> fast-compare algorithm: you don't need to store full sort keys in addition
>>>> to the original texts. This can save lot of storage, provided that original
>>>> texts are sufficiently discriminated by their start, and that cases where
>>>> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>>>>
>>>> For short texts however, truncated sortkeys may save time at the price
>>>> of a reasonnable storage cost (but sortkeys can be also encoded with
>>>> roughly the same size as the original text: compression is modest for the
>>>> encoded primary level. But compression is frequently very effective for
>>>> higher levels where their smaller weight also have less possible variations
>>>> of value, in a smaller set.
>>>>
>>>> Notably for the secondary level used to encode case differences, only 3
>>>> bits are enough per weight, and you just need to reserve the 3-bit value
>>>> "000" as the "mark" for indicating the start of another higher level, while
>>>> encoding secondary weights as "001" to "111".
>>>>
>>>> (This means that primary levels have to be encoded so that none of
>>>> their encoded primary weights are starting with "000" marking the start of
>>>> the secondary level. So primary weights can be encoded in patterns starting
>>>> by "0001", "001", "01", or "1" and followed by other bits: this allows
>>>> encoding them as readable UTF-8 if these characters are all different at
>>>> primary level, excluding only the 16 first C0 controls which need to be
>>>> preprocessed into escape sequences using the first permitted C0 control as
>>>> an escape, and escaping that C0 control itself).
>>>>
>>>> The third level, started by the mark "00" and followed by the encoded
>>>> weights indicating this is a tertiary level and not an higher level, will
>>>> also be used to encode a small set of weights (in most locales, this is not
>>>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
>>>> differential coding on 3-bits, you reserve "000" as the "mark" for the next
>>>> higher level, then use "001" to "111" to encode differencial weights, the
>>>> differencial weights being initially based on the minimum tertiary weight,
>>>> you'll use the bit pattern "001" to encode the most frequent minimum
>>>> tertiary weight, and patterns "01" to "11" plus additional bits to encode
>>>> other positive or negative differences of tertiary weights, or to use
>>>> run-length compression). Here also it is possible to map the patterns so
>>>> that the encoded secondary weight will be readable valid UTF-8.
>>>>
>>>> The fourth level, started by the mark "000" can use the pattern "001"
>>>> to encode the most frequent minimum quaternary weight, and patterns "010"
>>>> to "011" followed by other bits to differentially encode the quaternary
>>>> weights. Here again it is possible to create an encoding for quaternary
>>>> weights that can use some run-length compression and can also be readable
>>>> valid UTF-8!
>>>>
>>>> And so on.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a
>>>> ?crit :
>>>>
>>>>> So it should be clear in the UCA algorithm and in the DUCET datatable
>>>>> that "0000" is NOT a valid weight
>>>>> It is just a notational placeholder used as ".0000", only indicating
>>>>> in the DUCET format that there's NO weight assigned at the indicated level,
>>>>> because the collation element is ALWAYS ignorable at this level.
>>>>> The DUCET could have as well used the notation ".none", or just
>>>>> dropped every ".0000" in its file (provided it contains a data entry
>>>>> specifying what is the minimum weight used for each level). This notation
>>>>> is only intended to be read by humans editing the file, so they don't need
>>>>> to wonder what is the level of the first indicated weight or remember what
>>>>> is the minimum weight for that level.
>>>>> But the DUCET table is actually generated by a machine and processed
>>>>> by machines.
>>>>>
>>>>>
>>>>>
>>>>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a
>>>>> ?crit :
>>>>>
>>>>>> In summary, this step given in the algorithm is completely unneeded
>>>>>> and can be dropped completely:
>>>>>>
>>>>>> *S3.2 *If L is not 1, append
>>>>>> a *level separator*
>>>>>>
>>>>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>>>>> lower than any weight in the resulting sort key. This guarantees that when
>>>>>> two strings of unequal length are compared, where the shorter string is a
>>>>>> prefix of the longer string, the longer string is always sorted after the
>>>>>> shorter?in the absence of special features like contractions. For example:
>>>>>> "abc" < "abcX" where "X" can be any character(s).
>>>>>>
>>>>>> Remove any reference to the "level separator" from the UCA. You never
>>>>>> need it.
>>>>>>
>>>>>> As well this paragraph
>>>>>>
>>>>>> 7.3 Form Sort Keys
>>>>>>
>>>>>> *Step 3.* Construct a sort key for each collation element array by
>>>>>> successively appending all non-zero weights from the collation element
>>>>>> array. Figure 2 gives an example of the application of this step to one
>>>>>> collation element array.
>>>>>>
>>>>>> Figure 2. Collation Element Array to Sort Key
>>>>>>
>>>>>> Collation Element ArraySort Key
>>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>>>>> 0002 0002 0002
>>>>>>
>>>>>> can be written with this figure:
>>>>>>
>>>>>> Figure 2. Collation Element Array to Sort Key
>>>>>>
>>>>>> Collation Element ArraySort Key
>>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>>>>
>>>>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>>>>> safely removed if they are respectively the minimum secondary weight and
>>>>>> minimum tertiary weight.
>>>>>> But note that 0020 is kept in two places as they are followed by a
>>>>>> higher weight 0021. This is general for any tailored collation (not just
>>>>>> the DUCET).
>>>>>>
>>>>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a
>>>>>> ?crit :
>>>>>>
>>>>>>> The 0000 is there in the UCA only because the DUCET is published in
>>>>>>> a format that uses it, but here also this format is useless: you never need
>>>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>>>>> just needs to indicate what is the minimum weight assigned for every level
>>>>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>>>>
>>>>>>>
>>>>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer
>>>>>>> a ?crit :
>>>>>>>
>>>>>>>> There are lots of ways to implement the UCA.
>>>>>>>>
>>>>>>>> When you want fast string comparison, the zero weights are useful
>>>>>>>> for processing -- and you don't actually assemble a sort key.
>>>>>>>>
>>>>>>>> People who want sort keys usually want them to be short, so you
>>>>>>>> spend time on compression. You probably also build sort keys as byte
>>>>>>>> vectors not uint16 vectors (because byte vectors fit into more APIs and
>>>>>>>> tend to be shorter), like ICU does using the CLDR collation data file. The
>>>>>>>> CLDR root collation data file remunges all weights into fractional byte
>>>>>>>> sequences, and leaves gaps for tailoring.
>>>>>>>>
>>>>>>>> markus
>>>>>>>>
>>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 12:34:20 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 02 Nov 2018 18:34:20 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com>
(Doug Ewell via Unicode's message of "Fri, 02 Nov 2018 08:38:45
-0700")
References: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com>
Message-ID: <86bm77o02r.fsf@mimuw.edu.pl>

I have a feeling this discussion became too chaotic: about 90 posts in
October and about 30 in November, all interesting but too many of them
only loosely related to my original post.

I propose to close the thread. I hope some time in the future to prepare
a short summary (but first I would like to check some technical issues,
so it will take some time).

Thank you very much to all who contributed.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Fri Nov 2 13:52:05 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 2 Nov 2018 19:52:05 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

Message-ID: <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
[quoted mail]
>
> Using variation selectors is only appropriate for these existing
> (preencoded) superscript letters ? and ? so that they display the
> appropriate (underlined or not underlined) glyph.

And it is for forcing the display of DIGIT ZERO with a short stroke:
0030 FE00; short diagonal stroke form; # DIGIT ZERO
https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt

From that it becomes unclear why that isn?t applied to 4, 7, z and Z
mentioned in this thread, to be displayed open or with a short bar.

> It is not a solution for creating superscripts on any letters and
> mark that it should be rendered as superscript (notably, the base
> letter to transform into superscript may also have its own combining
> diacritics, that must be encoded explicitly, and if you use the
> varaition selector, it should allow variation on the presence or
> absence of the underline (which must then be encoded explicitly as a
> combining character.

I totally agree that abbreviation indicating superscript should not be
encoded using variation selectors, as already stated I don?t prefer it.
>
> So finally what we get with variation selectors is: variation selector, combining diacritic> and precombined with the diacritic, variation selector> which is NOT
> canonically equivalent.

That seems to me like a flaw in canonical equivalence. Variations must
be canonically equivalent, and the variation selector position should
be handled or parsed accordingly. Personally I?m unaware of this rule.
>
> Using a combining character avoids this caveat: combining diacritic, combining abbreviation mark> and precombined with the diacritic, combining abbreviation mark> which
> ARE canonically equivalent. And this explicitly states the semantic
> (something that is lost if we are forced to use presentational
> superscripts in a higher level protocol like HTML/CSS for rich text
> format, and one just extracts the plain text; using collation will
> not help at all, except if collators are built with preprocessing
> that will first infer the presence of a
> to insert after each combining sequence of the plain-text enclosed in
> a italic style).

That exactly outlines my concern with calls for relegating superscript
as an abbreviation indicator to higher level protocols like HTML/CSS.
>
> There's little risk: if the is not
> mapped in fonts (or not recognized by text renderers to create
> synthetic superscript scripts from existing recognized clusters), it
> will render as a visible .notdef (tofu). But normally text renderers
> recognize the basic properties of characters in the UCD and can see
> that has a combining mark general
> property (it also knows that it has a 0 combinjing class, so
> canonical equivalences are not broken) to render a better symbols
> than the .notdef "tofu": it should better render a dotted circle.
> Even if this tofu or dotted circle is rendered, it still explicitly
> marks the presence of the abbreviation mark, so there's less
> confusion about what is preceding it (the combining sequence that was
> supposed to be superscripted).

The problem with the you are proposing
is that it contradicts streamlined implementation as well as easy
input of current abbreviations like ordinal indicators in French and,
optionally, in English. Preformatted superscripts are already widely
implemented, and coding of "4?" only needs two characters, input
using only three fingers in two times (thumb on AltGr, press key
E04 then E12) with an appropriately programmed layout driver. I?m
afraid that the solution with would be
much less straightforward.
>
> The can also have its own selector> to select other styles when they are optional, such as
> adding underlines to the superscripted letter, or rendering the
> letter instead as underscript, or as a small baseline letter with a
> dot after it: this is still an explicit abbreviation mark, and the
> meaning of the plein text is still preserved: the variation selector
> is only suitable to alter the rendering of a cluster when it has
> effectively several variants and the default rendering is not
> universal, notably across font styles initially designed for specific
> markets with their own local preferences: the variation selector
> still allows the same fonts to map all known variants distinctly,
> independantly of the initial arbitrary choice of the default glyph
> used when the variation selector is missing).

I don?t think German users would welcome being directed to input a
plus a instead of
a period.
>
> Even if fonts (or text renderers may map the mark> to variable glyphs, this is purely stylictic, the semantic of
> the plain text is not lost because the
> is still there. There's no need of any rich-text to encode it (the
> rich -text styles are not explicitly encoding that a superscript is
> actually an abbreviation mark, so it cannot also allow variation like
> rendering an underscript, or a baseline small glyph with an added
> dot. Typically a used in an English
> style would render the letter (or cluster) before it as a "small"
> letter without any added dot.

The advantage of preformatted superscripts is that the English user
can decide whether he or she wishes the ordinal indicators to be
baseline or superscript, while being sure of stable rendering.
>
> So I really think that is far better
> than:
>
> * using preencoded superscript letters (they don't map all the
> necessary repertoire of clusters where the abbreviation is needed,
> it now just covers Basic Latin, ten digits, plus and minus signs, and
> the dot or comma, plus a few other letters like stops;

As seen in this thread, preformatted superscripts are standardized and
implemented to get combining diacritics, eg "??", "??". Encoding any more
precomposed letters that can be represented as combining sequences is
out of scope, and that is the reason why no accented letters will ever
be encoded as preformatted superscripts. Correct display of the "S???"
abbreviation for French "Soci?t?" ("Company") is already working in
browsers, depending on the fonts present on the machine and set in the
settings, unless a correct webfont is downloaded and installed ad hoc.

> it's impossible to rencode the full Unicode repertoire and its allowed
> combining sequences or extended default grapheme clusters!),

This persistent and passionate refrain boils down, as already pointed
by others and me in this thread, to a continuum bias strawman fight,
(ie the refrain is repeated to fight a strawman constructed using the
continuum bias, which consists in using a continuum to move someone?s
position to an extreme position that is ultimately off-topic).
>
> * or using variation selectors to make them appear as a superscript
> (does not work with all clusters containing other diacritics like
> accents),
>
> * or using rich-text styling (from which you cannot safely
> infer any semantic (there no warranty that M^r in HTML is
> actually an abbreviation of "Mister"; in HTML this is encoded
> elsewhere as M^r or
> M^r (the semantic of the abbreviation has to
> be looked a possible container element and the meaning of the
> abbreviation is to look inside its title attribute, so obviously this
> requires complex preprocessing before we can infer a plaintext
> version (suitable for example in
> plain-text searches where you don't want to match a mathematical
> object M, like a matrix, elevated to the power r, or a single
> plaintext M followed by a footnote call noted by the letter "r").

Indeed HTML is a powerful language to provide rich and meaningful
content with many features, so that in comparison, plain text could
seem unreadable because it contains all those abbreviations and
symbols you need to know. By contrast, plain text in any natural
language is to contain just enough information that it is readable
for a native reader, and that is the purpose of Unicode.

Therefore, dismissing superscript abbreviation indicators to higher
level protocols is like looking at a language from outside and
telling: ?These are abbreviations anyway, so you probably should also
add tooltips for people to learn the meaning.?
>
> It solves all practical problems: legacy encoding using the
> preencoded superscript Latin letters (aka "modifier letters") should
> have never been used or needed (not even for IPA usage which could
> have used an explicit for its
> superscripted symbols, or for its distinctive "a" and "g"). We should
> not have needed to encode the variants for "a" and "g": these were
> old hacks that broke the Unicode character encoding model since the
> beginning.

The principle of Unicode is to encode anything that is semantically
distinctive in plain text, so encoding IPA letters is totally OK.

> However only roundtrip compatibility with legacy non UCS
> charsets milited only for keeping the ordinal feminine or ordinal
> masculine mark, or the "Numero" cluster (actually made of two
> letters, the second one followed by an implicit abbreviation mark,
> but transformed in the legacy charset to be treated as a single
> unbreakable cluster containing only one symbol; even Unicode
> considers the abbreviated Numero as being only "compatibility
> equivalent" to the letter N followed by the masculine ordinal symbol,
> the latter being also only "compatibility equivalent" to a letter o
> with an implicit superscript, but also with an optional combining
> underline).

These pre-Unicode charsets are a proof that superscripts are required.
>
> All these superscripts in Unicode (as well as Mathematical "styled"
> letters, which were also completely unnecessary and will necessarily
> be incomplete for the intended usage) are now to be treated only as
> legacy practices, they should be deprecated in favor of the more
> semantic and logical character encoding model, deprecating complelely
> the legacy visual encoding.

Mathematicians like them, and even not being a mathematician, I feel
that there are really a lot of styled ?alphabets to choose from?, as
Ken Whistler advised on this list in 2015. What uncovered usages
are you referring to?
>
> Only precombined characters, recognized by canonical equivalences are
> part of the standard and may be kept as "non"-legacy: they still fit
> in the logical encoding. As well the extended default grapheme
> clusters include the precomposed Hangul LVT and LV syllables, and CGJ
> used before combining marks with non-zero combining class, and
> variation selectors used only after base letters with the zero
> combining class and that start the extended default graphgeme
> clusters.
>
> Let's return to the root of the far better logical encoding which
> remains the recommended practice. All the rest is legacy (some of
> them came from decision taken to preserve roundtrip compatibility
> with legacy charsets, including prepended letters in Thai, and so we
> have a few compatibility characters (which are not the recommended
> practive), but the rest was bad decisions made by Unicode and ISO WG
> to break the logical character encoding model.

That criticism only applies to presentation forms, that Unicode was
forced to take in at setup, and whose use Unicode ever discouraged,
as seen also in this thread.

So all languages using superscript to indicate abbreviations are
still better served with preformatted superscript letters.

The new turn is that many languages, eg Italian, Polish, Portuguese
and Spanish, need variation sequences for single or double
underscoring, which whill work with OpenType fonts having the
appropriate glyph sets, while the variation selector is ignorable
for most other machine processing purposes.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 2 15:10:30 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 2 Nov 2018 20:10:30 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com>
References: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com>
Message-ID: <20181102201030.5d0fa3a6@JRWUBU2>

On Fri, 02 Nov 2018 08:38:45 -0700
Doug Ewell via Unicode wrote:

> Do we have any other evidence of this usage, besides a single
> handwritten postcard?

What, beyond some of us actually employing it ourselves? I'm sure I've
seen 'William' abbreviated in print to 'W?' with some mark below, but I
couldn't lay my hands on an example.

Richard.

From unicode at unicode.org Fri Nov 2 16:27:37 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Fri, 2 Nov 2018 14:27:37 -0700
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

Message-ID: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net>

On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
> I was replying not about the notational repreentation of the DUCET
> data table (using [.0000...] unnecessarily) but about the text of
> UTR#10 itself. Which remains highly confusive, and contains completely
> unnecesary steps, and just complicates things with absoiluytely no
> benefit at all by introducing confusion about these "0000".

Sorry, Philippe, but the confusion that I am seeing introduced is what
you are introducing to the unicode list in the course of this discussion.

> UTR#10 still does not explicitly state that its use of "0000" does not
> mean it is a valid "weight", it's a notation only

No, it is explicitly a valid weight. And it is explicitly and
normatively referred to in the specification of the algorithm. See
UTS10-D8 (and subsequent definitions), which explicitly depend on a
definition of "A collation weight whose value is zero." The entire
statement of what are primary, secondary, tertiary, etc. collation
elements depends on that definition. And see the tables in Section 3.2,
which also depend on those definitions.

> (but the notation is used for TWO distinct purposes: one is for
> presenting the notation format used in the DUCET

It is *not* just a notation format used in the DUCET -- it is part of
the normative definitional structure of the algorithm, which then
percolates down into further definitions and rules and the steps of the
algorithm.

> itself to present how collation elements are structured, the other one
> is for marking the presence of a possible, but not always required,
> encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
is not part of the *notation* for collation elements, but instead is a
magic value chosen for the level separator precisely because zero values
from the collation elements are removed during sort key construction, so
that zero is then guaranteed to be a lower value than any remaining
weight added to the sort key under construction. This part of the
algorithm is not rocket science, by the way!
>
> UTR#10 is still needlessly confusive.

O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

> Even the example tables can be made without using these "0000" (for
> example in tables showing how to build sort keys, it can present the
> list of weights splitted in separate columns, one column per level,
> without any "0000". The implementation does not necessarily have to
> create a buffer containing all weight values in a row, when separate
> buffers for each level is far superior (and even more efficient as it
> can save space in memory).

The UCA doesn't *require* you to do anything particular in your own
implementation, other than come up with the same results for string
comparisons. That is clearly stated in the conformance clause of UTS #10.

https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

> The step "S3.2" in the UCA algorithm should not even be there (it is
> made in favor an specific implementation which is not even efficient
> or optimal),

That is a false statement. Step S3.2 is there to provide a clear
statement of the algorithm, to guarantee correct results for string
comparison. Section 9 of UTS #10 provides a whole lunch buffet of
techniques that implementations can choose from to increase the
efficiency of their implementations, as they deem appropriate. You are
free to implement as you choose -- including techniques that do not
require any level separators. You are, however, duly warned in:

https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can
interfere with other compression methods."

> it complicates the algorithm with absoluytely no benefit at all); you
> can ALWAYS remove it completely and this still generates equivalent
> results.

No you cannot ALWAYS remove it completely. Whether or not your
implementation can do so, depends on what other techniques you may be
using to increase performance, store shorter keys, or whatever else may
be at stake in your optimization.

If you don't like zeroes in collation, be my guest, and ignore them
completely. Take them out of your tables, and don't use level
separators. Just make sure you end up with conformant result for
comparison of strings when you are done. And in the meantime, if you
want to complain about the text of the specification of UTS #10, then
provide carefully worded alternatives as suggestions for improvement to
the text, rather than just endlessly ranting about how the standard is
confusive because the collation weight 0000 is "unnecessary".

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Fri Nov 2 17:32:29 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 2 Nov 2018 22:32:29 +0000
Subject: use vs mention (was: second attempt)
In-Reply-To: <20181101074640.2866a022@JRWUBU2>
References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com>
<20181101074640.2866a022@JRWUBU2>
Message-ID: <20181102223229.2b593ffa@JRWUBU2>

On Thu, 1 Nov 2018 07:46:40 +0000
Richard Wordingham via Unicode wrote:

> On Wed, 31 Oct 2018 23:35:06 +0100
> Piotr Karocki via Unicode wrote:
>
> > These are only examples of changes in meaning with ^{or _{,
> > not all of these examples can really exist - but, then, another
> > question: can we know what author means? And as carbon and iodine
> > cannot exist, then of course CI should be interpreted as carbon on
> > first oxidation?
>
> Are you sure about the non-existence? Some pretty weird
> chemical species exist in interstellar space.

It's not interstellar, but CI is the empirical formula for diiodoethyne
and its isomer iodoiodanuidylethyne, and the CI? ion has Pubchem CID
59215341.

Richard.

From unicode at unicode.org Fri Nov 2 20:34:58 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 3 Nov 2018 01:34:58 +0000
Subject: UCA unnecessary collation weight 0000
In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
References:

<268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
Message-ID: <20181103013458.3e0a968d@JRWUBU2>

On Fri, 2 Nov 2018 14:27:37 -0700
Ken Whistler via Unicode wrote:

> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:

> > UTR#10 still does not explicitly state that its use of "0000" does
> > not mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and
> normatively referred to in the specification of the algorithm. See
> UTS10-D8 (and subsequent definitions), which explicitly depend on a
> definition of "A collation weight whose value is zero." The entire
> statement of what are primary, secondary, tertiary, etc. collation
> elements depends on that definition. And see the tables in Section
> 3.2, which also depend on those definitions.

The definition is defective in that it doesn't handle 'large weight
values' well. There is the anomaly that a mapping of collating element
to [1234.0000.0000][0200.020.002] may be compatible with WF1, but the
exactly equivalent mapping to [1234.020.002][0200.0000.0000] makes the
table ill-formed. The fractional weight definitions for UCA eliminate
this '0000' notion quite well, and I once expected the UCA to move to
the CLDRCA (CLDR Collation Algorithm) fractional weight definition.
The definition of the CLDRCA does a much better job of explaining
'large weight values'. It turns them from something exceptional to a
normal part of its functioning.

> > (but the notation is used for TWO distinct purposes: one is for
> > presenting the notation format used in the DUCET
>
> It is *not* just a notation format used in the DUCET -- it is part of
> the normative definitional structure of the algorithm, which then
> percolates down into further definitions and rules and the steps of
> the algorithm.

It's not needed for the CLDRCA! The statement of the UCA algorithm
does depend on its notation, but it can be recast to avoid these zero
weights.

Richard.

From unicode at unicode.org Sat Nov 3 14:41:54 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 20:41:54 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>
Message-ID:

Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
> [quoted mail]
> >
> > Using variation selectors is only appropriate for these existing
> > (preencoded) superscript letters ? and ? so that they display the
> > appropriate (underlined or not underlined) glyph.
>
> And it is for forcing the display of DIGIT ZERO with a short stroke:
> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>
> From that it becomes unclear why that isn?t applied to 4, 7, z and Z
> mentioned in this thread, to be displayed open or with a short bar.
>
> > It is not a solution for creating superscripts on any letters and
> > mark that it should be rendered as superscript (notably, the base
> > letter to transform into superscript may also have its own combining
> > diacritics, that must be encoded explicitly, and if you use the
> > varaition selector, it should allow variation on the presence or
> > absence of the underline (which must then be encoded explicitly as a
> > combining character.
>
> I totally agree that abbreviation indicating superscript should not be
> encoded using variation selectors, as already stated I don?t prefer it.
> >
> > So finally what we get with variation selectors is: > variation selector, combining diacritic> and > precombined with the diacritic, variation selector> which is NOT
> > canonically equivalent.
>
> That seems to me like a flaw in canonical equivalence. Variations must
> be canonically equivalent, and the variation selector position should
> be handled or parsed accordingly. Personally I?m unaware of this rule.
> >
> > Using a combining character avoids this caveat: > combining diacritic, combining abbreviation mark> and > precombined with the diacritic, combining abbreviation mark> which
> > ARE canonically equivalent. And this explicitly states the semantic
> > (something that is lost if we are forced to use presentational
> > superscripts in a higher level protocol like HTML/CSS for rich text
> > format, and one just extracts the plain text; using collation will
> > not help at all, except if collators are built with preprocessing
> > that will first infer the presence of a
> > to insert after each combining sequence of the plain-text enclosed in
> > a italic style).
>
> That exactly outlines my concern with calls for relegating superscript
> as an abbreviation indicator to higher level protocols like HTML/CSS.
>

That's exactlky my concern that this relation to HTML/CSS should NOT occur
at all ! It's really not the solution, HTML/CSS styles have NO semantic at
all (I demonstrated it in the message you are quoting).

> > There's little risk: if the is not
> > mapped in fonts (or not recognized by text renderers to create
> > synthetic superscript scripts from existing recognized clusters), it
> > will render as a visible .notdef (tofu). But normally text renderers
> > recognize the basic properties of characters in the UCD and can see
> > that has a combining mark general
> > property (it also knows that it has a 0 combinjing class, so
> > canonical equivalences are not broken) to render a better symbols
> > than the .notdef "tofu": it should better render a dotted circle.
> > Even if this tofu or dotted circle is rendered, it still explicitly
> > marks the presence of the abbreviation mark, so there's less
> > confusion about what is preceding it (the combining sequence that was
> > supposed to be superscripted).
>
> The problem with the you are proposing
> is that it contradicts streamlined implementation as well as easy
> input of current abbreviations like ordinal indicators in French and,
> optionally, in English. Preformatted superscripts are already widely
> implemented, and coding of "4?" only needs two characters, input
> using only three fingers in two times (thumb on AltGr, press key
> E04 then E12) with an appropriately programmed layout driver. I?m
> afraid that the solution with would be
> much less straightforward.
>

This is not a real concern: this is legacy old practives that should no
longer be recommanded as it is ambiguous (nothing says that "4?" is an
abbreviated ordinal, it can as well be 4 elevated to the power e, or
various other things).

Also the keys to press on a keyboard is absolutely not a concern: the same
key presses you propose can as well generate the letter followed by the
combining abbreviation mark. In fact what you propose is even less
practical because it uses complex input for all characters and requires
mapping keys on the whole alphabet (so it uses precious space on the key
layout). It's just simpler for everyone to press "4", "e", followed by a
combination (like AltGr+".") to produce the !

And these legacy superscript characters still are not warrantied to not
have any underline (the variation may as well be significant), and there
will never be enough superscript characters for the many superscript
notations (not just abbreviations) that should still be encoded the normal
letters (including in clusters, with diacritics, ligatures and so on):
Unicode will never accept to reencode all existing letters (plus all the
infinite set of clusters that can be formed with them) just to turn them
into superscript/subscript variants. These encodings that found their way
from the need of roundtrip compatibility of legacy charsets (before the
UCS) should have never occured at all: these should have not even been
tolerated for IPA symbols, for mathematical symbols (monospace, bold,
italic...).

The variation selector solution is also not suitable when the intent is
only to add semantic to the encoded text and not drive the choice between
glyph variants (when the default glyph without the variant selector can
FREELY vary into forms that are UNACCEPTABLE in some contexts, then the
variation does not really encode the semantic but encodes the visual
rendering intent: it is too easily abuse to do something else).
But a single *semantic* combining mark does not encode any visual rendering
intent like what variation selectors do. They still allow glyphic
variations as long as the the semantic is kept, and they have the correct
fallbacks (there's no obscuring of the encoding of the clusters to which
the semantic combining mark applies: they are still part of the same
general encoding as normal letters, and rendering abbreviation mark does
not necessarily means that the base cluster MUST be rendered differently
than normal letters: it is permitted as well to render the combining mark
for example as a dot, or as a true diacritic on top of the letters). And if
needed the following can control the visual appearence:

> >
> > The can also have its own > selector> to select other styles when they are optional, such as
> > adding underlines to the superscripted letter, or rendering the
> > letter instead as underscript, or as a small baseline letter with a
> > dot after it: this is still an explicit abbreviation mark, and the
> > meaning of the plein text is still preserved: the variation selector
> > is only suitable to alter the rendering of a cluster when it has
> > effectively several variants and the default rendering is not
> > universal, notably across font styles initially designed for specific
> > markets with their own local preferences: the variation selector
> > still allows the same fonts to map all known variants distinctly,
> > independantly of the initial arbitrary choice of the default glyph
> > used when the variation selector is missing).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 15:02:23 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 21:02:23 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

As well the separate encoding of mathematical variants could have been
completely avoided (we know that this encoding is not sufficient, so much
that even LaTeX renderers simply don't need it or use it !).

We could have just encoded a single to use
after any base cluster, and the whole set was covered !

The additional distinction of visual variants (monospace, bold, italic...)
would have been encoded using variation selectors after the : the semantic as a mathematical symbols was still
preserved including the additional semantic for distinguishing some symbols
in maths notations like "f(f)=f" where the 3 "f" must be distinguished
(between the function in a set of functions, the source belonging to one
set of values or being a variable, and the result in another set which may
be a value or variable.

Once again this covered all the needs without using this duplicate encoding
(that was NEVER needed for roundtrip compatibility with legacy non-UCS
charsets).

All I ask is reasonnable: it's just a SINGLE code point to encode the
combining mark itself, semantically, NOT visually.

The visual appearance can be controlled by an additional variation selector
to cancel the effect of glyph variations allowed for ALL characters in the
UCS, where there's just a **non-mandatory** form generally used by default
in fonts and matching more or less the "representative glyph" shown in the
Unicode and ISO 10646 charts, which cannot show all allowed variations (if
there's a need to detail them, Unicode offers the possibility to ask to
register known "variation sequences" which can feed a supplementary chart
showing more representative glyphs, one for each accepted "variation
sequence", but without even needing to modify the "representative glyph"
shown in the base chart.

Note that even if Unicode requires registration of variation sequences
prior to using them, the published charts still omit to add the additional
charts (just below the existing base chart) showing representative glyphs
for accepted sequences, with one small chart per base character, listing
them simply ordered by "VSn" value. All what Unicode publishes is only a
mere data list with some names (not enough for most users to be ware that
variations can be encoded explicitly and compliantly)

Le sam. 3 nov. 2018 ? 20:41, Philippe Verdy a ?crit :

>
>
> Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode <
> unicode at unicode.org> a ?crit :
>
>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
>> [quoted mail]
>> >
>> > Using variation selectors is only appropriate for these existing
>> > (preencoded) superscript letters ? and ? so that they display the
>> > appropriate (underlined or not underlined) glyph.
>>
>> And it is for forcing the display of DIGIT ZERO with a short stroke:
>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>>
>> From that it becomes unclear why that isn?t applied to 4, 7, z and Z
>> mentioned in this thread, to be displayed open or with a short bar.
>>
>> > It is not a solution for creating superscripts on any letters and
>> > mark that it should be rendered as superscript (notably, the base
>> > letter to transform into superscript may also have its own combining
>> > diacritics, that must be encoded explicitly, and if you use the
>> > varaition selector, it should allow variation on the presence or
>> > absence of the underline (which must then be encoded explicitly as a
>> > combining character.
>>
>> I totally agree that abbreviation indicating superscript should not be
>> encoded using variation selectors, as already stated I don?t prefer it.
>> >
>> > So finally what we get with variation selectors is: > > variation selector, combining diacritic> and > > precombined with the diacritic, variation selector> which is NOT
>> > canonically equivalent.
>>
>> That seems to me like a flaw in canonical equivalence. Variations must
>> be canonically equivalent, and the variation selector position should
>> be handled or parsed accordingly. Personally I?m unaware of this rule.
>> >
>> > Using a combining character avoids this caveat: > > combining diacritic, combining abbreviation mark> and > > precombined with the diacritic, combining abbreviation mark> which
>> > ARE canonically equivalent. And this explicitly states the semantic
>> > (something that is lost if we are forced to use presentational
>> > superscripts in a higher level protocol like HTML/CSS for rich text
>> > format, and one just extracts the plain text; using collation will
>> > not help at all, except if collators are built with preprocessing
>> > that will first infer the presence of a
>> > to insert after each combining sequence of the plain-text enclosed in
>> > a italic style).
>>
>> That exactly outlines my concern with calls for relegating superscript
>> as an abbreviation indicator to higher level protocols like HTML/CSS.
>>
>
> That's exactlky my concern that this relation to HTML/CSS should NOT occur
> at all ! It's really not the solution, HTML/CSS styles have NO semantic at
> all (I demonstrated it in the message you are quoting).
>
>
>> > There's little risk: if the is not
>> > mapped in fonts (or not recognized by text renderers to create
>> > synthetic superscript scripts from existing recognized clusters), it
>> > will render as a visible .notdef (tofu). But normally text renderers
>> > recognize the basic properties of characters in the UCD and can see
>> > that has a combining mark general
>> > property (it also knows that it has a 0 combinjing class, so
>> > canonical equivalences are not broken) to render a better symbols
>> > than the .notdef "tofu": it should better render a dotted circle.
>> > Even if this tofu or dotted circle is rendered, it still explicitly
>> > marks the presence of the abbreviation mark, so there's less
>> > confusion about what is preceding it (the combining sequence that was
>> > supposed to be superscripted).
>>
>> The problem with the you are proposing
>> is that it contradicts streamlined implementation as well as easy
>> input of current abbreviations like ordinal indicators in French and,
>> optionally, in English. Preformatted superscripts are already widely
>> implemented, and coding of "4?" only needs two characters, input
>> using only three fingers in two times (thumb on AltGr, press key
>> E04 then E12) with an appropriately programmed layout driver. I?m
>> afraid that the solution with would be
>> much less straightforward.
>>
>
> This is not a real concern: this is legacy old practives that should no
> longer be recommanded as it is ambiguous (nothing says that "4?" is an
> abbreviated ordinal, it can as well be 4 elevated to the power e, or
> various other things).
>
> Also the keys to press on a keyboard is absolutely not a concern: the same
> key presses you propose can as well generate the letter followed by the
> combining abbreviation mark. In fact what you propose is even less
> practical because it uses complex input for all characters and requires
> mapping keys on the whole alphabet (so it uses precious space on the key
> layout). It's just simpler for everyone to press "4", "e", followed by a
> combination (like AltGr+".") to produce the !
>
> And these legacy superscript characters still are not warrantied to not
> have any underline (the variation may as well be significant), and there
> will never be enough superscript characters for the many superscript
> notations (not just abbreviations) that should still be encoded the normal
> letters (including in clusters, with diacritics, ligatures and so on):
> Unicode will never accept to reencode all existing letters (plus all the
> infinite set of clusters that can be formed with them) just to turn them
> into superscript/subscript variants. These encodings that found their way
> from the need of roundtrip compatibility of legacy charsets (before the
> UCS) should have never occured at all: these should have not even been
> tolerated for IPA symbols, for mathematical symbols (monospace, bold,
> italic...).
>
> The variation selector solution is also not suitable when the intent is
> only to add semantic to the encoded text and not drive the choice between
> glyph variants (when the default glyph without the variant selector can
> FREELY vary into forms that are UNACCEPTABLE in some contexts, then the
> variation does not really encode the semantic but encodes the visual
> rendering intent: it is too easily abuse to do something else).
> But a single *semantic* combining mark does not encode any visual
> rendering intent like what variation selectors do. They still allow glyphic
> variations as long as the the semantic is kept, and they have the correct
> fallbacks (there's no obscuring of the encoding of the clusters to which
> the semantic combining mark applies: they are still part of the same
> general encoding as normal letters, and rendering abbreviation mark does
> not necessarily means that the base cluster MUST be rendered differently
> than normal letters: it is permitted as well to render the combining mark
> for example as a dot, or as a true diacritic on top of the letters). And if
> needed the following can control the visual appearence:
>
>> >
>> > The can also have its own > > selector> to select other styles when they are optional, such as
>> > adding underlines to the superscripted letter, or rendering the
>> > letter instead as underscript, or as a small baseline letter with a
>> > dot after it: this is still an explicit abbreviation mark, and the
>> > meaning of the plein text is still preserved: the variation selector
>> > is only suitable to alter the rendering of a cluster when it has
>> > effectively several variants and the default rendering is not
>> > universal, notably across font styles initially designed for specific
>> > markets with their own local preferences: the variation selector
>> > still allows the same fonts to map all known variants distinctly,
>> > independantly of the initial arbitrary choice of the default glyph
>> > used when the variation selector is missing).
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 15:45:40 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 21:45:40 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

As an additional remark, I find that Unicode is slowly abandoning its
initial goals of encoding texts logically and semantically. This was
contrasting to the initial ISO 106464 which wanted to produce a giant
visual encoding, based only on code charts (without any character
properties except glyph names and an almost mandatory "representative
glyph" which allowed in fact no variation at all).

The initial ISO 10646 goal failed to reach a global adoption. What proved
to be extremely successful (and allowed easier processing of text, without
limiting the variation of glyph designs needed and wanted for the
orthography of human languages) was the Unicode character encoding model,
based on logical semantic encoding. This drove the worldwide adoption (and
now the fast abandon of legacy charsets, all based on visual appearance and
basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO
standards, or other national standards, including in China, Japan, Europe,
or made and promoted by private hardware manufacturers or software
providers, frequently as well with legal restrictions such as MacRoman with
its well known Apple logo)

It is desesperating to see that Unicode does not resist to that, and even
now refuses the idea of adding just a few simple combining characters (that
fit perfectly in its character encoding model, and still allows efficient
text processing, and rendering with reasonnable fallbacks) that will
explicitly encode the semantics (a good example in Latin: look at why the
lower case eth letter seems to have three codes: this is because theiy have
different semantics but also map to different uppercase letters, and being
able to transform letter cases, and being able to use collation for
plain-text search is an extremely useful feature possible only because of
Unicode character properties, but impossible to do with just the visual
encoding and charts of ISO 10646; the same is true about Latin A versus
Cyrillic A and Greek ALPHA: the semantics is the first goal to respect,
thanks to Unicode character properties and the Unicode character model, but
the visual encoding is definitely not a goal).

So before encoding characters in Unicode, the glyph variation is not enough
(this occurs everywhere in humane languages): you need a proof with
contrasting pairs, showing that the glyph difference makes a semantic
difference and requires different processing (different character
properties).

Unicode has succeeded everywhere ISO 10646 has failed: efficient processing
of humane languages with their wide variation of orthographies and visual
appearance. The other goals (supporting technical notations, like IPA,
maths, music, and now emojis!), driven by glyph requirements everywhere
(mandated in their own relevant standard) is where Unicode can and even
should promote the use of variation sequences, and definitely not dual
encoding as this was done (Unicode abandoning its most useful goal, not
resisting to the pressure of some industries: this has just created more
issues, with more difficulties to correctly and efficiently process texts
written in humane languages).

The more Unicode evolves, the more I see that it will turn the UCS in what
the ISO 10646 attempted to do (and failed): turn the UCS into a visual
encoding, refusing to encode **efficiently** any semantic differences. And
this will become a severe problems later with the constant evolution of
humane languages.

I press Unicode to maintain its "character encoding model" as the path to
follow, and that it should be driven by semantic goals. It has every
features needed for that : combining sequences (including CGJ because of
canonical equivalences that were needed due to roundtrip compatibility with
legacy non-UCS charsets), variation selectors (ONLY to optionally add some
*semantic* restrictions in the largely allowed variation of glyphs and
still preserve distinction between contrasting pairs, but NOT as a way to
encode non-semantic styles), and character properties to allow efficient
processing.

Le sam. 3 nov. 2018 ? 21:02, Philippe Verdy a ?crit :

> As well the separate encoding of mathematical variants could have been
> completely avoided (we know that this encoding is not sufficient, so much
> that even LaTeX renderers simply don't need it or use it !).
>
> We could have just encoded a single to use
> after any base cluster, and the whole set was covered !
>
> The additional distinction of visual variants (monospace, bold, italic...)
> would have been encoded using variation selectors after the mathematical symbol>: the semantic as a mathematical symbols was still
> preserved including the additional semantic for distinguishing some symbols
> in maths notations like "f(f)=f" where the 3 "f" must be distinguished
> (between the function in a set of functions, the source belonging to one
> set of values or being a variable, and the result in another set which may
> be a value or variable.
>
> Once again this covered all the needs without using this duplicate
> encoding (that was NEVER needed for roundtrip compatibility with legacy
> non-UCS charsets).
>
> All I ask is reasonnable: it's just a SINGLE code point to encode the
> combining mark itself, semantically, NOT visually.
>
> The visual appearance can be controlled by an additional variation
> selector to cancel the effect of glyph variations allowed for ALL
> characters in the UCS, where there's just a **non-mandatory** form
> generally used by default in fonts and matching more or less the
> "representative glyph" shown in the Unicode and ISO 10646 charts, which
> cannot show all allowed variations (if there's a need to detail them,
> Unicode offers the possibility to ask to register known "variation
> sequences" which can feed a supplementary chart showing more representative
> glyphs, one for each accepted "variation sequence", but without even
> needing to modify the "representative glyph" shown in the base chart.
>
> Note that even if Unicode requires registration of variation sequences
> prior to using them, the published charts still omit to add the additional
> charts (just below the existing base chart) showing representative glyphs
> for accepted sequences, with one small chart per base character, listing
> them simply ordered by "VSn" value. All what Unicode publishes is only a
> mere data list with some names (not enough for most users to be ware that
> variations can be encoded explicitly and compliantly)
>
>
> Le sam. 3 nov. 2018 ? 20:41, Philippe Verdy a ?crit :
>
>>
>>
>> Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode <
>> unicode at unicode.org> a ?crit :
>>
>>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote:
>>> [quoted mail]
>>> >
>>> > Using variation selectors is only appropriate for these existing
>>> > (preencoded) superscript letters ? and ? so that they display the
>>> > appropriate (underlined or not underlined) glyph.
>>>
>>> And it is for forcing the display of DIGIT ZERO with a short stroke:
>>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO
>>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt
>>>
>>> From that it becomes unclear why that isn?t applied to 4, 7, z and Z
>>> mentioned in this thread, to be displayed open or with a short bar.
>>>
>>> > It is not a solution for creating superscripts on any letters and
>>> > mark that it should be rendered as superscript (notably, the base
>>> > letter to transform into superscript may also have its own combining
>>> > diacritics, that must be encoded explicitly, and if you use the
>>> > varaition selector, it should allow variation on the presence or
>>> > absence of the underline (which must then be encoded explicitly as a
>>> > combining character.
>>>
>>> I totally agree that abbreviation indicating superscript should not be
>>> encoded using variation selectors, as already stated I don?t prefer it.
>>> >
>>> > So finally what we get with variation selectors is: >> > variation selector, combining diacritic> and >> > precombined with the diacritic, variation selector> which is NOT
>>> > canonically equivalent.
>>>
>>> That seems to me like a flaw in canonical equivalence. Variations must
>>> be canonically equivalent, and the variation selector position should
>>> be handled or parsed accordingly. Personally I?m unaware of this rule.
>>> >
>>> > Using a combining character avoids this caveat: >> > combining diacritic, combining abbreviation mark> and >> > precombined with the diacritic, combining abbreviation mark> which
>>> > ARE canonically equivalent. And this explicitly states the semantic
>>> > (something that is lost if we are forced to use presentational
>>> > superscripts in a higher level protocol like HTML/CSS for rich text
>>> > format, and one just extracts the plain text; using collation will
>>> > not help at all, except if collators are built with preprocessing
>>> > that will first infer the presence of a
>>> > to insert after each combining sequence of the plain-text enclosed in
>>> > a italic style).
>>>
>>> That exactly outlines my concern with calls for relegating superscript
>>> as an abbreviation indicator to higher level protocols like HTML/CSS.
>>>
>>
>> That's exactlky my concern that this relation to HTML/CSS should NOT
>> occur at all ! It's really not the solution, HTML/CSS styles have NO
>> semantic at all (I demonstrated it in the message you are quoting).
>>
>>
>>> > There's little risk: if the is not
>>> > mapped in fonts (or not recognized by text renderers to create
>>> > synthetic superscript scripts from existing recognized clusters), it
>>> > will render as a visible .notdef (tofu). But normally text renderers
>>> > recognize the basic properties of characters in the UCD and can see
>>> > that has a combining mark general
>>> > property (it also knows that it has a 0 combinjing class, so
>>> > canonical equivalences are not broken) to render a better symbols
>>> > than the .notdef "tofu": it should better render a dotted circle.
>>> > Even if this tofu or dotted circle is rendered, it still explicitly
>>> > marks the presence of the abbreviation mark, so there's less
>>> > confusion about what is preceding it (the combining sequence that was
>>> > supposed to be superscripted).
>>>
>>> The problem with the you are proposing
>>> is that it contradicts streamlined implementation as well as easy
>>> input of current abbreviations like ordinal indicators in French and,
>>> optionally, in English. Preformatted superscripts are already widely
>>> implemented, and coding of "4?" only needs two characters, input
>>> using only three fingers in two times (thumb on AltGr, press key
>>> E04 then E12) with an appropriately programmed layout driver. I?m
>>> afraid that the solution with would be
>>> much less straightforward.
>>>
>>
>> This is not a real concern: this is legacy old practives that should no
>> longer be recommanded as it is ambiguous (nothing says that "4?" is an
>> abbreviated ordinal, it can as well be 4 elevated to the power e, or
>> various other things).
>>
>> Also the keys to press on a keyboard is absolutely not a concern: the
>> same key presses you propose can as well generate the letter followed by
>> the combining abbreviation mark. In fact what you propose is even less
>> practical because it uses complex input for all characters and requires
>> mapping keys on the whole alphabet (so it uses precious space on the key
>> layout). It's just simpler for everyone to press "4", "e", followed by a
>> combination (like AltGr+".") to produce the !
>>
>> And these legacy superscript characters still are not warrantied to not
>> have any underline (the variation may as well be significant), and there
>> will never be enough superscript characters for the many superscript
>> notations (not just abbreviations) that should still be encoded the normal
>> letters (including in clusters, with diacritics, ligatures and so on):
>> Unicode will never accept to reencode all existing letters (plus all the
>> infinite set of clusters that can be formed with them) just to turn them
>> into superscript/subscript variants. These encodings that found their way
>> from the need of roundtrip compatibility of legacy charsets (before the
>> UCS) should have never occured at all: these should have not even been
>> tolerated for IPA symbols, for mathematical symbols (monospace, bold,
>> italic...).
>>
>> The variation selector solution is also not suitable when the intent is
>> only to add semantic to the encoded text and not drive the choice between
>> glyph variants (when the default glyph without the variant selector can
>> FREELY vary into forms that are UNACCEPTABLE in some contexts, then the
>> variation does not really encode the semantic but encodes the visual
>> rendering intent: it is too easily abuse to do something else).
>> But a single *semantic* combining mark does not encode any visual
>> rendering intent like what variation selectors do. They still allow glyphic
>> variations as long as the the semantic is kept, and they have the correct
>> fallbacks (there's no obscuring of the encoding of the clusters to which
>> the semantic combining mark applies: they are still part of the same
>> general encoding as normal letters, and rendering abbreviation mark does
>> not necessarily means that the base cluster MUST be rendered differently
>> than normal letters: it is permitted as well to render the combining mark
>> for example as a dot, or as a true diacritic on top of the letters). And if
>> needed the following can control the visual appearence:
>>
>>> >
>>> > The can also have its own >> > selector> to select other styles when they are optional, such as
>>> > adding underlines to the superscripted letter, or rendering the
>>> > letter instead as underscript, or as a small baseline letter with a
>>> > dot after it: this is still an explicit abbreviation mark, and the
>>> > meaning of the plein text is still preserved: the variation selector
>>> > is only suitable to alter the rendering of a cluster when it has
>>> > effectively several variants and the default rendering is not
>>> > universal, notably across font styles initially designed for specific
>>> > markets with their own local preferences: the variation selector
>>> > still allows the same fonts to map all known variants distinctly,
>>> > independantly of the initial arbitrary choice of the default glyph
>>> > used when the variation selector is missing).
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 16:55:17 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 22:55:17 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

I can give other interesting examples about why the Unicode "character
encoding model" is the best option

Just consider how the Hangul alphabet is (now) encoded: its consonnant
letters are encoded "twice" (leading and trailing jamos) because they carry
semantic distinctions for efficient processing of Korean text where
syllable boundaries are significant to disambiguate text ; this apparent
"double encoding" also has a visual model (still currently employed) to
*preferably* (not mandatorily) render syllables in a well defined square
layout. But the square layout causes significant rendering issues (notably
at small font sizes), so it is also possible to render the syllable by
aligning letters horizontally. This was done in the "compatibility jamos"
used in old terminals/printers (but unfortunately without marking the
syllable boundaries explicitly before groups of consonnants, or after them,
or in the middle of the group); due to the need to preserve roundtrip
compatiblity with the non-UCS encodings, the "compatibility jamos" had to
be encoded separately, even if their use is no longer recommanded for
normal Korean texts that should explicitly encode syllabic boundaries by
distinguishing leading and trailing consonnants (this is equivalent to the
distinction of letter case in Latin: leading jamos in Hangul are exactly
like our Latin capital consonnants, trailing jamos in Hangul are exactly
like our latin small letters, the vowel jamos in Hangul however are
unicameral... for now) But Hangul is still a true alphabet (it is in fact
much simpler than Greek or Cyrillic, and Latin is the most complex script
of the world!). Thanks to this new (recommanded) encoding of Hangul, which
adopts a **semantic** and **logical** model, it is possible to process
Korean text very efficiently (and in fact very simply). The earlier
attempts of encoding Korean was done while ISO 10646 goals were thought to
be enough (so it was a **visual** encoding: it failed even when this
earlier encoding entered in the first versions of Unicode, and has created
a severe precedent where preserving the stability of Unicode (and upward
compatibility) was broken.

I can also cite the case of Egyptian hieroglyphs: there's still no way to
render them correctly because we lack the development of a stable
orthography that would drive the encoding of the missing **semantic**
characters (for this reason Egyptian hieroglyphs still require an upper
layer protocol, as there's still no accepted orthographic norm that
successfully represents all possible semantic variations, but alsop because
the research on old Egyptian hieroglyphs is still aphic very incomplete).
The same can be saif about Mayan hieroglyphs. And because there's still no
semantic encoding of real texts, it's almost impossible to process text in
this script: the characters encoded are ONLY basic glyphs (we don't know
what can be their allowed variations, so we cannot use them safely to
compose combining sequences: they are merely a collection of symbols, not a
humane script). In my opinion, there was absolutely no emergency to encode
them in the UCS (except by not resisting to the pressure of allowing fonts
containing these glyphs to be interchanged, but it remains impossible to
encode and compose complete text with only these fonts: you still need an
orthographic convention and there's still no concensus about it; as well
the standard higher level protocols like HTML/CSS cannot compose them
correctly and efficiently). This encoding was not necessary as these fonts
containing collection of glyphs could have remained encoded with a private
use convention, i.e. with PUAs required by only the attempted (but not
agreed) protocols.

I think on the opposite that VisibleSpeech, or Duploy? shorthands will
reach a point where they have developed a stable orthographic convention:
there will be a standard, and this standard will request to Unicode to
encode the missing **semantic** characters.

This path should also be followed now for encoding emojis (there's a early
development of an orthography for them, it is done by Unicode itself, but
I'm not sure this is part of its mission: Emoji orthographic conventions
should be made by a separate commity). Unfortunately Unicode is starting to
create this orthography without developing what should come with it : its
integration in the Unicode "character encoding model" (which should then be
reviewed to meet the goals wanted for the composition of emoji sequences):
a clear set of character properties for emojis needs to be developed, and
then the emojis subcommittee can work with it (like what the IRC does for
ideographic scripts). But for now any revision of emojis adds new
incompatibilities an inefficiencies to process text correctly (for example
it's nearly imposssible to define the boundaries between clusters of
emojis).

Just consider what is also still missing for Egyptian and Mayan hieroglyphs
or VisibleSpeech, or Duploy? Shorthands: please resist to pressures, and
stop complexifying rules within Emojis. We need rules and these rules must
be integrated in the character encoding model, and the first chapters of
the Unicode standard !

But please don't resist so much to legitimate goals of adding a few simple
semantic characters that can greatly increase the usability and
"universality" of the UCS: this can be done without continuous adding new
duplicate encodings. The duplicate encodings can be kept, but should be
considered only like legacy, i.e. like other "compatiility characters", no
longer recommanded but still usable.

This should be just like the Hangul compatiblity "half-width" jamos in the
last block of the BMP, in which T and L consonnants are not distinguished
(only L consonnants are encoded and are ambiguously reused for T
consonnants) and only TL clusters are unambiguous (but cannot be safely
associated with surrounding T compatiblity jamos, so it's impossible to
compose them safely in syllabic squares, and impossible to determine some
semantic differences if syllabic boundaries can only be "guessed" with an
heuristic and some dictionary lookup to find only the most probable
meaning).

These legacy characters (introduced by Unicode itself, but for bad reasons
or because the UTC did not resist to some commercial pressures) have just
polluted the UCS needlessly and complexified everything (and for long):
they remain there as apparent duplicates but with no clear semantics and
cause various problems (including security problems): most of these
"compatibility characters" are now strongly discouraged, or even now
forbidden in uses where security is an issue. And this is the case for
almost all superscript/subscripts (not justified by roundtrip compatibility
with legacy standard). But now Unicode must keep these characters there in
its own standard to preserve the roundtrip compatiblity with its own
initial versions !

But this does not mean that these characters cannot be deprecated and
treated later as "compatibility characters", even if they are not part of
the current standard normalizations NFKD and NFKC (which have limited
legacy use). These NFKC and NFKD forms should now be replaced by two more
convenient "Legacy Normalization Forms", that I would abbreviate as "NFLC"
and "NFLD" very useful for example for default collations in the DUCET or
CLDR "root" locale, except that it will not be frozen like existing NFKC
and NFKD by the very limited "compatibility mappings" found in the historic
main file of the UCD and that cannot follow the evolution of the
recommanded best practices.

Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset
based on MUTABLE character properties (this can also be "decompositions
mappings" except that once a character is added to the new property file,
they won't be removed, and can have some stability as well, where the
decision to "deprecate" old encodings can only be done if there's a new
recommandation, and that if ever this recommandation changes and is
deprecated, the previous "legacy decomposition mappings" can still be
decomposed again to the new decompositions recommanded): unlike NFKC, and
NFKD, a "legacy decomposition" is not "final" in all future versions, and a
future version may remap them by just adding new entries for the new
characters considered to be "legacy" and no longer recommended. This new
properties file would allow evolution and adaptation to humane languages,
and will allow correcting past errors in the standard. This file should
have this form:

# deprecated codepoint(s) ; new preferred sequence ; Unicode version ins
which it was deprecated
101234 ; 101230 0300... ; 10.0

This file can also be used to deprecate some old variation sequences, or
some old clusters made of multiple characters that are isolately not
deprecated.

Thanks.

Le sam. 3 nov. 2018 ? 21:45, Philippe Verdy a ?crit :

> As an additional remark, I find that Unicode is slowly abandoning its
> initial goals of encoding texts logically and semantically. This was
> contrasting to the initial ISO 106464 which wanted to produce a giant
> visual encoding, based only on code charts (without any character
> properties except glyph names and an almost mandatory "representative
> glyph" which allowed in fact no variation at all).
>
> The initial ISO 10646 goal failed to reach a global adoption. What proved
> to be extremely successful (and allowed easier processing of text, without
> limiting the variation of glyph designs needed and wanted for the
> orthography of human languages) was the Unicode character encoding model,
> based on logical semantic encoding. This drove the worldwide adoption (and
> now the fast abandon of legacy charsets, all based on visual appearance and
> basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO
> standards, or other national standards, including in China, Japan, Europe,
> or made and promoted by private hardware manufacturers or software
> providers, frequently as well with legal restrictions such as MacRoman with
> its well known Apple logo)
>
> It is desesperating to see that Unicode does not resist to that, and even
> now refuses the idea of adding just a few simple combining characters (that
> fit perfectly in its character encoding model, and still allows efficient
> text processing, and rendering with reasonnable fallbacks) that will
> explicitly encode the semantics (a good example in Latin: look at why the
> lower case eth letter seems to have three codes: this is because theiy have
> different semantics but also map to different uppercase letters, and being
> able to transform letter cases, and being able to use collation for
> plain-text search is an extremely useful feature possible only because of
> Unicode character properties, but impossible to do with just the visual
> encoding and charts of ISO 10646; the same is true about Latin A versus
> Cyrillic A and Greek ALPHA: the semantics is the first goal to respect,
> thanks to Unicode character properties and the Unicode character model, but
> the visual encoding is definitely not a goal).
>
> So before encoding characters in Unicode, the glyph variation is not
> enough (this occurs everywhere in humane languages): you need a proof with
> contrasting pairs, showing that the glyph difference makes a semantic
> difference and requires different processing (different character
> properties).
>
> Unicode has succeeded everywhere ISO 10646 has failed: efficient
> processing of humane languages with their wide variation of orthographies
> and visual appearance. The other goals (supporting technical notations,
> like IPA, maths, music, and now emojis!), driven by glyph requirements
> everywhere (mandated in their own relevant standard) is where Unicode can
> and even should promote the use of variation sequences, and definitely not
> dual encoding as this was done (Unicode abandoning its most useful goal,
> not resisting to the pressure of some industries: this has just created
> more issues, with more difficulties to correctly and efficiently process
> texts written in humane languages).
>
> The more Unicode evolves, the more I see that it will turn the UCS in what
> the ISO 10646 attempted to do (and failed): turn the UCS into a visual
> encoding, refusing to encode **efficiently** any semantic differences. And
> this will become a severe problems later with the constant evolution of
> humane languages.
>
> I press Unicode to maintain its "character encoding model" as the path to
> follow, and that it should be driven by semantic goals. It has every
> features needed for that : combining sequences (including CGJ because of
> canonical equivalences that were needed due to roundtrip compatibility with
> legacy non-UCS charsets), variation selectors (ONLY to optionally add some
> *semantic* restrictions in the largely allowed variation of glyphs and
> still preserve distinction between contrasting pairs, but NOT as a way to
> encode non-semantic styles), and character properties to allow efficient
> processing.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 17:36:36 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 23:36:36 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

>
> Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset
> based on MUTABLE character properties (this can also be "decompositions
> mappings" except that once a character is added to the new property file,
> they won't be removed, and can have some stability as well, where the
> decision to "deprecate" old encodings can only be done if there's a new
> recommandation, and that if ever this recommandation changes and is
> deprecated, the previous "legacy decomposition mappings" can still be
> decomposed again to the new decompositions recommanded): unlike NFKC, and
> NFKD, a "legacy decomposition" is not "final" in all future versions, and a
> future version may remap them by just adding new entries for the new
> characters considered to be "legacy" and no longer recommended. This new
> properties file would allow evolution and adaptation to humane languages,
> and will allow correcting past errors in the standard. This file should
> have this form:
>
> # deprecated codepoint(s) ; new preferred sequence ; Unicode version in
> which it was deprecated
> 101234 ; 101230 0300... ; 10.0
>
> This file can also be used to deprecate some old variation sequences, or
> some old clusters made of multiple characters that are isolately not
> deprecated.
>

Another note:

- this new decomposition mapping file for NFLC and NFLD, where NFLC is
defined to be NFC(NFLD), has some stability requirements and it must be
warrantied that NFD(NFLD) = NFD: the "legacy mapping forms" must be a
conforming process respecting the canonical equivalences:

- Unlike in the main UCD file for canonical decompositions, the
decompositions listed there are not limited to map one character to one or
two characters.

- The first column should be given in NFC form; the NFD form may also be
used, this does not change the result. It is NOT required that the 1st
column is in NFKC or NFKD forms (so the decompositions previously
recommanded by a "compatibility mapping" in the main UCD can be ignored: it
was just a suggestion and a requirement only for NFKC and NFKD). This
allows NFLC and NFLD to correct past errors in the frozen permanently NFKC
and NFKD decompositions.

- the mapping done here is permanent but versioned (by the first version of
Unicode deprecating a character or sequence). Being permanent means that
the deprecation cannot be removed, but it can still be changed if the
target string (preferably listed in NFC form) contains some newly
deprecated characters (that will be added separately.

- if the target of the mapping contains other deprecated characters or
sequences (added to the same file), the decompositions listed there becomes
recursive: a derived datafile can be produced listing only the new
recommended mappings.

- if a source string "SATB" is canonically equivalent to "SBTA", and "SA"
is listed as a legacy sequence mapped to be replaced by "X" in this file,
then the NFLD process will not just decompose "SATB" into NFD("XTB"), but
will also decompose "SBTA" into NBT("XBT").

- if a source string "SATB" is NOT canonically equivalent to "SBTA", and
"SA" is listed as a legacy sequence mapped to be replaced by "X" in this
file, then the NFLD process will not decompose "SATB" into NFD("XTB"), but
will not automatically decompose "SBTA" into NBT("XBT")

Then the CLDR project can use NFL(C/D) as a better source for deriving
collation elements (in the DUCET or root locale) instead of NFK(C/D) which
will follow the new recommandations and will correctly adapt the collation
orders for legacy encodings. Tailored collations (per-locale) are not
required to use compatibility mappings in the main UCD file, or in this
file, they'll use it only if they are based on the DUCET or the collation
order of the "root" locale. For that purpose, tailored collations may
specify an alternate set of "compatibility or legacy mappings" (to apply
after NFC or NFD normalization which is still required).

May be the CLDR projects would like to have these derived collation
elements to be orderable (so that it can infer and order the new relative
weights needed for ordering strings containing "legacy characters") but it
may require another column in the legacy mappings datafile (in my opinion
the "Unicode version" field already offers by default a suitable relative
ordering)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 17:38:24 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 3 Nov 2018 23:38:24 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

Le sam. 3 nov. 2018 ? 23:36, Philippe Verdy a ?crit :

> - this new decomposition mapping file for NFLC and NFLD, where NFLC is
>> defined to be NFC(NFLD), has some stability requirements and it must be
>> warrantied that NFD(NFLD) = NFD
>>
> Oops! fix my typo: it must be warrantied that NFD(NFLD) = NFLD
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 17:50:52 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 3 Nov 2018 22:50:52 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID: <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

When the topic being discussed no longer matches the thread title,
somebody should start a new thread with an appropriate thread title.

From unicode at unicode.org Sat Nov 3 18:05:39 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 00:05:39 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<86k1lypt3q.fsf@mimuw.edu.pl>
<86in1im37d.fsf@mimuw.edu.pl>
<4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

<86lg6djlpz.fsf_-_@mimuw.edu.pl>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID:

It should be noted that the algorithmic complexity for this NFLD
normalization ("legacy") is exactly the same as for NFKD ("compatibility").

However NFLD is versioned (like also NFLC), so NFLD can take a second
parameter: the maximum Unicode version which can be used to filter which
decomposition mappings are usable (they indicate the first minimal version
where the mapping applies).

It is even possible to allow a "legacy" normalization to be changed in a
later version for the same source string:

# deprecated codepoint(s) ; new preferred sequence ; Unicode version in
which it was deprecated
101234 ; 101230 0300... ; 10.0
101234 ; 101240 0301... ; 11.0

It is also possible to add other filters to these recommanded new
encodings, for example a language (or a BCP 47 locale identifier):
101234 ; 101230 0300 ; 10.0 ; fr
101234 ; 101240 0301... ; 10.0
(here starting in the same version 10.0, the new recommandation is to
replace <101234> by <101240 0301> in all languages except French (BCP47
rules) where <101230 0300> should be used instead).

In that case, the NFKD normalization can be viewed as if it was an historic
version of NFLD, or a specialisation of NFLD for a "compatibility locale"
(using "u-nfk" as a BCP 47 locale identifier???), independant of the
unicode version (you can specify any version in the parameters of the NFLD
or NFLC functions, and the locale identifier can be set to "u-nkf").

The complete parameters for NFLD (or NFLC) are :
NFLD(text, version, locale) -> returns a text in NFD form
NFLC(text, version, locale) -> returns a text in NFC form

The default version is the latest supported version of Unicode, the default
locale is "root" (in CLDR) or the same as the DUCET in Unicode, but should
not be "u-nfk".

And so:
NFKD(text) = NFLD(text, 8.0, "u-nfk") = NFLD(text, 12.0, "u-nfk")
= NFLD(text, "u-nfk") = NFD(NFLD(text, "u-nfk"))
NFKC(text) = NFLC(text, 8.0, "u-nfk") = NFLC(text, 12.0,
"u-nfk") = NFLC(text, "u-nfk") = NFC(NFLC(text, "u-nfk"))
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 19:03:30 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 01:03:30 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
References:

<268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
Message-ID:

Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "0000".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "0000" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
Ok is is a valid "weight" when taken *isolately*, but it is invalid as a
weight at any level.
This does not change the fact because weights are always relative to a
specific level for which they are defined, and 0000 does not belong to any
one. This weight is completely artificial and introduced completely
needlessly: all levels are completely defined by a closed range of weights,
all of them being non-0000, and all ranges being numerically separated
(with the primary level using the largest range).

I can reread again and again (even the sections you cite), but there's
absolutely NO need of this articificial "0000" anywhere (any clause
introducing it or using it to define something can be safely removed)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 3 19:46:37 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 4 Nov 2018 00:46:37 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID: <14422e40-f80a-92ee-1ae8-441c98988393@gmail.com>

Possible new thread titles include:

Re: NFKD vs. NFLD (was Re: ...)

Re: Man's inhumanity to humane scripts (was Re: ...)

Re: Mayan and Egyptian hieroglyphs prove emoji pollute the character
encoding model (was Re: ...)

Re: Polynomials and the decline of western civilization (was Re: ...)

From unicode at unicode.org Sat Nov 3 20:33:32 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 02:33:32 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
References:

<268fedd9-25b6-2957-ffa8-ede11495451c@att.net>
Message-ID:

Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "0000".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "0000" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
> (but the notation is used for TWO distinct purposes: one is for presenting
> the notation format used in the DUCET
>
> It is *not* just a notation format used in the DUCET -- it is part of the
> normative definitional structure of the algorithm, which then percolates
> down into further definitions and rules and the steps of the algorithm.
>

I insist that this is NOT NEEDED at all for the definition, it is
absolutely NOT structural. The algorithm still guarantees the SAME result.

It is ONLY used to explain the format of the DUCET and the fact the this
format does NOT use 0000 as a valid weight, ans os can use it as a notation
(in fact only a presentational feature).

> itself to present how collation elements are structured, the other one is
> for marking the presence of a possible, but not always required, encoding
> of an explicit level separator for encoding sort keys).
>
> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
> is not part of the *notation* for collation elements, but instead is a
> magic value chosen for the level separator precisely because zero values
> from the collation elements are removed during sort key construction, so
> that zero is then guaranteed to be a lower value than any remaining weight
> added to the sort key under construction. This part of the algorithm is not
> rocket science, by the way!
>

Here again you make a confusion: a sort key MAY use them as separators if
it wants to compress keys by reencoding weights per level: that's the only
case where you may want to introduce an encoding pattern starting with 0,
while the rest of the encoding for weights in that level must using
patterns not starting by this 0 (the number of bits to encode this 0 does
not matter: it is only part of the encoding used on this level which does
not necessarily have to use 16-bit code units per weight.

>
> Even the example tables can be made without using these "0000" (for
> example in tables showing how to build sort keys, it can present the list
> of weights splitted in separate columns, one column per level, without any
> "0000". The implementation does not necessarily have to create a buffer
> containing all weight values in a row, when separate buffers for each level
> is far superior (and even more efficient as it can save space in memory).
>
> The UCA doesn't *require* you to do anything particular in your own
> implementation, other than come up with the same results for string
> comparisons.
>
Yes I know, but the algorithm also does not require me to use these invalid
0000 pseudo-weights, that the algorithm itself will always discard (in a
completely needless step)!

> That is clearly stated in the conformance clause of UTS #10.
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>
> The step "S3.2" in the UCA algorithm should not even be there (it is made
> in favor an specific implementation which is not even efficient or optimal),
>
> That is a false statement. Step S3.2 is there to provide a clear statement
> of the algorithm, to guarantee correct results for string comparison.
>

You're wrong, this statement is completely useless in all cases. There is
still the correct results for string comparison without them: a string
comparison can only compare valid weights for each level, it will not
compare any weight past the end of the text in any one of the two compared
strings, nowhere it will compare weights with one of them being 0, unless
this 0 is used as a "guard value" for the end of text and your compare loop
still continues scanning the longer string when the other string has
already ended (this case should be detected much earlier before
determineing the next collection boundary in the string and then computing
its weights for each level.

> Section 9 of UTS #10 provides a whole lunch buffet of techniques that
> implementations can choose from to increase the efficiency of their
> implementations, as they deem appropriate. You are free to implement as you
> choose -- including techniques that do not require any level separators.
> You are, however, duly warned in:
>
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
>
> that "While this technique is relatively easy to implement, it can
> interfere with other compression methods."
>
> it complicates the algorithm with absoluytely no benefit at all); you can
> ALWAYS remove it completely and this still generates equivalent results.
>
> No you cannot ALWAYS remove it completely. Whether or not your
> implementation can do so, depends on what other techniques you may be using
> to increase performance, store shorter keys, or whatever else may be at
> stake in your optimization
>
I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However
you MAY ADD them ONLY when generating and encoding the sort keys, if the
encoding used really does compress the weights into smaller values: this is
the only case where you want to ADD a separator, internally only in the
binary key encoder, but but as part of the algorithm itself.

If your key generation does not use any compression (in the simplest
implementations), then it can simply an directly concatenate all weights
with the same code units size (16-bit in the DUCET), without inserting any
additional 0000 code unit to separate them: your resulting sort key will
still not contain any 0000 code unit in any part for any level because the
algorithm already has excluded them. Finally this means that sort keys can
be stored in C-strings (terminated by null code units, instead of being
delimited by a separately encoded length property, but for C-strings where
code units are 8-bit, i.e. "char" in C, you still need an encoder to
convert the 16-bit binary weights into sequences of bytes not containing
any 00 byte: if this encoder is used, still you don't need any 00 separator
between encoded levels!).

As all these 0000 weigths are unnecessary, then the current UCA algorithm
trying to introduce them needlessly is REALLY introducing unnecessary
confusion: values of weights NEVER need to be restricted.

The only conditions that matter is that:
- all weights are *comparable* (sign does not even matter, they are not
even restricted to be numbers or even just integers) and that
- they are **fully ordered**, and that the fully ordered set of weights
(not necessarily an enumerable set or a discrete set, as this can the
continuous set of real numbers)
- and that the full set of weights is **fully partitioned** into distinct,
intervals (with no intersection between intervals, so intervals are also
comparable)
- that the highest interval will be used by weights in the primary level:
each partition is numbered (by the level: a positive integer between 1 and
L): you can compare the level numbers assigned to the partition in which
the weight is a member: if level(weight1) > level(weight2) (this is an
comparison of positive integers), then necessarily you may have weight1 <
weight2 (this is only comparing weights encoded arbitrarily and which can
still use a 0 value if you wish to use it to encode a valid weight for a
valid collation element at any level 1 to N; this is also the only
condition needed to respect rule WF2 in UCA).

---
Notes about encodings for weights in sort keys:

If weights are chosen to be rational numbers, e.g any rational numbers in
the (0.0, 1.0) open interval, and because your collation algorithm will
only recognize a finite set of distinct collation elements with necessarily
a finite number N of distinct weights w(i), for i in 0..(N-1), allows the
collation weights to be represented by choosing them **arbitrarily** within
this open interval:
- this can be done simply by partitionning the (0.0 1.0) into N half-open
intervals [w(i), w(i+1));
- and then encoding a weight w(i) by any **arbitrarily chosen rational**
inside one of these intervals (for example this can be done for using
compression with arithmetic coding).

A weight encoding using a finite discrete set (of binary integers between 0
and M-1) is what you need to use classic Huffman coding: this is equivalent
to multiplying the previous rationals and truncating them to the nearest
floor integer, but as this limits the choice of rational numbers above so
that distinct weights remain distinct with the binary encoding, you need to
keep more significant bits with Huffman coding than with Arithmetic coding
(i.e. you need a higher value of M; where M is typically a power of 2 using
1-bit code units, or power of 256 for the simpler encodings using 8-bit
code units, or a power of 65536 for an uncompressed encoding of 16-bit
weight values).

Arithmetic coding is in fact equivalent to Huffman coding, except that M is
not necessarily a positive integer but can be any positive rational and can
then represent each weigh value with a rational number of bits on average,
instead of a static integer number of bits. You can say as well that
Huffman coding is a restriction of Arithmetic coding where M must be an
integer, or that Arithmetic coding is a generalization of Huffman coding.

Both the Huffman and Arithmetic codings are wellknown examples of "prefix
coding" (the latter offering a bit more compression, for the same
statistical distribution of encoded values). The open interval (0.0, w(0))
is still not used at all to encode weights, but can still have a statistic
distribution, usable with the prefix encoding to represent the end of
string. But here again this does not represent the artificial 0000 weight
which is NEVER encoded anywhere.

---

Ask to a mathematician you trust, he will confirm that these rules speaking
about the pseudo-weight 0000 in UCA are completely unnecessary (i.e.
removing them from the algorithm does not change the result for comparing
strings, or for generating sort keys)
And as a conclusion, attempting to introduce them in the standard creates
more confusion than it helps (in fact it is most probably a relict of a
former bogous *implementation*, that still relied on them because other
well-formness conditions were not satistified, or not well defined in the
earlier attempts to define the UCA...). That this is not even needed for
computing "composite weights" (which is not defining new weights, but an
attempt to encode them in a larger space: this can be done completely
outside the standard algorithm itself: just allow weights to be rational
numbers, it is then easy to extend the number of encodable weights as a
single number without increasing the numeric range in which they are
defined; then leave the encoder of the sort key generator store them with a
convenient "prefix coding", using one or more code units of arbitrary
length).

Philippe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 02:24:57 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 4 Nov 2018 09:24:57 +0100
Subject: Encoding (was: Re: A sign/abbreviation
for "magister")
In-Reply-To: <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>
Message-ID:

On 03/11/2018 23:50, James Kass via Unicode wrote:
>
> When the topic being discussed no longer matches the thread title,
> somebody should start a new thread with an appropriate thread title.
>

Yes, that is what also the OP called for, but my last reply though
taking me some time to write was sent without checking the new mail,
so unfortunately it didn?t acknowledge. So let?s start this new thread
to account for Philippe Verdy?s proposal to encode a new format control.

But all what I can add so far prior to probably stepping out of this
discussion is that the industry does not seem to be interested in this
initiative. Why do I think so? As already discussed on this List, even
the long-existing FRACTION SLASH U+2044 has not been implemented by
major vendors, except that HarfBuzz does implement it and makes its
specified behavior available in environments using HarfBuzz, among
which some major vendors? products are actually available with
HarfBuzz support.

As a result, the Polish abbreviation of Magister as found on the
postcard, and all other abbreviations using superscript that have
been put into parallel in the parent thread, cannot be reliably
encoded without using preformatted superscript, so far as the goal
is a plain text backbone being in the benefit of reliable rendering
support, rather than a semantic-centered coding that may be easier
to parse by special applications but lacks wider industrial support.

If nevertheless, is encoded and will
gain traction, or rather reversely: if it gains traction and will be
encoded (I don?t know which way around to put it, given U+2044 has
been encoded but one still cannot seem to be able to call it widely
implemented), I would surely add it on keyboard layouts if I will
still be maintaining any in that era.

Best regards,

Marcel

From unicode at unicode.org Sun Nov 4 02:27:05 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 4 Nov 2018 09:27:05 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

<268fedd9-25b6-2957-ffa8-ede11495451c@att.net>

Message-ID:

Philippe, I agree that we could have structured the UCA differently. It
does make sense, for example, to have the weights be simply decimal values
instead of integers. But nobody is going to go through the substantial work
of restructuring the UCA spec and data file unless there is a very strong
reason to do so. It takes far more time and effort than people realize to
change in the algorithm/data while making sure that everything lines up
without inadvertent changes being introduced.

It is just not worth the effort. There are so, so, many things we can do in
Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
benefit.

You can continue flogging this horse all you want, but I'm muting this
thread (and I suspect I'm not the only one).

Mark

On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
unicode at unicode.org> wrote:

> Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit :
>
>>
>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>
>> I was replying not about the notational repreentation of the DUCET data
>> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
>> Which remains highly confusive, and contains completely unnecesary steps,
>> and just complicates things with absoiluytely no benefit at all by
>> introducing confusion about these "0000".
>>
>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>> you are introducing to the unicode list in the course of this discussion.
>>
>>
>> UTR#10 still does not explicitly state that its use of "0000" does not
>> mean it is a valid "weight", it's a notation only
>>
>> No, it is explicitly a valid weight. And it is explicitly and normatively
>> referred to in the specification of the algorithm. See UTS10-D8 (and
>> subsequent definitions), which explicitly depend on a definition of "A
>> collation weight whose value is zero." The entire statement of what are
>> primary, secondary, tertiary, etc. collation elements depends on that
>> definition. And see the tables in Section 3.2, which also depend on those
>> definitions.
>>
>> (but the notation is used for TWO distinct purposes: one is for
>> presenting the notation format used in the DUCET
>>
>> It is *not* just a notation format used in the DUCET -- it is part of the
>> normative definitional structure of the algorithm, which then percolates
>> down into further definitions and rules and the steps of the algorithm.
>>
>
> I insist that this is NOT NEEDED at all for the definition, it is
> absolutely NOT structural. The algorithm still guarantees the SAME result.
>
> It is ONLY used to explain the format of the DUCET and the fact the this
> format does NOT use 0000 as a valid weight, ans os can use it as a notation
> (in fact only a presentational feature).
>
>
>> itself to present how collation elements are structured, the other one is
>> for marking the presence of a possible, but not always required, encoding
>> of an explicit level separator for encoding sort keys).
>>
>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>> is not part of the *notation* for collation elements, but instead is a
>> magic value chosen for the level separator precisely because zero values
>> from the collation elements are removed during sort key construction, so
>> that zero is then guaranteed to be a lower value than any remaining weight
>> added to the sort key under construction. This part of the algorithm is not
>> rocket science, by the way!
>>
>
> Here again you make a confusion: a sort key MAY use them as separators if
> it wants to compress keys by reencoding weights per level: that's the only
> case where you may want to introduce an encoding pattern starting with 0,
> while the rest of the encoding for weights in that level must using
> patterns not starting by this 0 (the number of bits to encode this 0 does
> not matter: it is only part of the encoding used on this level which does
> not necessarily have to use 16-bit code units per weight.
>
>>
>> Even the example tables can be made without using these "0000" (for
>> example in tables showing how to build sort keys, it can present the list
>> of weights splitted in separate columns, one column per level, without any
>> "0000". The implementation does not necessarily have to create a buffer
>> containing all weight values in a row, when separate buffers for each level
>> is far superior (and even more efficient as it can save space in memory).
>>
>> The UCA doesn't *require* you to do anything particular in your own
>> implementation, other than come up with the same results for string
>> comparisons.
>>
> Yes I know, but the algorithm also does not require me to use these
> invalid 0000 pseudo-weights, that the algorithm itself will always discard
> (in a completely needless step)!
>
>
>> That is clearly stated in the conformance clause of UTS #10.
>>
>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>
>> The step "S3.2" in the UCA algorithm should not even be there (it is made
>> in favor an specific implementation which is not even efficient or optimal),
>>
>> That is a false statement. Step S3.2 is there to provide a clear
>> statement of the algorithm, to guarantee correct results for string
>> comparison.
>>
>
> You're wrong, this statement is completely useless in all cases. There is
> still the correct results for string comparison without them: a string
> comparison can only compare valid weights for each level, it will not
> compare any weight past the end of the text in any one of the two compared
> strings, nowhere it will compare weights with one of them being 0, unless
> this 0 is used as a "guard value" for the end of text and your compare loop
> still continues scanning the longer string when the other string has
> already ended (this case should be detected much earlier before
> determineing the next collection boundary in the string and then computing
> its weights for each level.
>
>> Section 9 of UTS #10 provides a whole lunch buffet of techniques that
>> implementations can choose from to increase the efficiency of their
>> implementations, as they deem appropriate. You are free to implement as you
>> choose -- including techniques that do not require any level separators.
>> You are, however, duly warned in:
>>
>>
>> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
>>
>> that "While this technique is relatively easy to implement, it can
>> interfere with other compression methods."
>>
>> it complicates the algorithm with absoluytely no benefit at all); you can
>> ALWAYS remove it completely and this still generates equivalent results.
>>
>> No you cannot ALWAYS remove it completely. Whether or not your
>> implementation can do so, depends on what other techniques you may be using
>> to increase performance, store shorter keys, or whatever else may be at
>> stake in your optimization
>>
> I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However
> you MAY ADD them ONLY when generating and encoding the sort keys, if the
> encoding used really does compress the weights into smaller values: this is
> the only case where you want to ADD a separator, internally only in the
> binary key encoder, but but as part of the algorithm itself.
>
> If your key generation does not use any compression (in the simplest
> implementations), then it can simply an directly concatenate all weights
> with the same code units size (16-bit in the DUCET), without inserting any
> additional 0000 code unit to separate them: your resulting sort key will
> still not contain any 0000 code unit in any part for any level because the
> algorithm already has excluded them. Finally this means that sort keys can
> be stored in C-strings (terminated by null code units, instead of being
> delimited by a separately encoded length property, but for C-strings where
> code units are 8-bit, i.e. "char" in C, you still need an encoder to
> convert the 16-bit binary weights into sequences of bytes not containing
> any 00 byte: if this encoder is used, still you don't need any 00 separator
> between encoded levels!).
>
> As all these 0000 weigths are unnecessary, then the current UCA algorithm
> trying to introduce them needlessly is REALLY introducing unnecessary
> confusion: values of weights NEVER need to be restricted.
>
> The only conditions that matter is that:
> - all weights are *comparable* (sign does not even matter, they are not
> even restricted to be numbers or even just integers) and that
> - they are **fully ordered**, and that the fully ordered set of weights
> (not necessarily an enumerable set or a discrete set, as this can the
> continuous set of real numbers)
> - and that the full set of weights is **fully partitioned** into distinct,
> intervals (with no intersection between intervals, so intervals are also
> comparable)
> - that the highest interval will be used by weights in the primary level:
> each partition is numbered (by the level: a positive integer between 1 and
> L): you can compare the level numbers assigned to the partition in which
> the weight is a member: if level(weight1) > level(weight2) (this is an
> comparison of positive integers), then necessarily you may have weight1 <
> weight2 (this is only comparing weights encoded arbitrarily and which can
> still use a 0 value if you wish to use it to encode a valid weight for a
> valid collation element at any level 1 to N; this is also the only
> condition needed to respect rule WF2 in UCA).
>
> ---
> Notes about encodings for weights in sort keys:
>
> If weights are chosen to be rational numbers, e.g any rational numbers in
> the (0.0, 1.0) open interval, and because your collation algorithm will
> only recognize a finite set of distinct collation elements with necessarily
> a finite number N of distinct weights w(i), for i in 0..(N-1), allows the
> collation weights to be represented by choosing them **arbitrarily** within
> this open interval:
> - this can be done simply by partitionning the (0.0 1.0) into N half-open
> intervals [w(i), w(i+1));
> - and then encoding a weight w(i) by any **arbitrarily chosen rational**
> inside one of these intervals (for example this can be done for using
> compression with arithmetic coding).
>
> A weight encoding using a finite discrete set (of binary integers between
> 0 and M-1) is what you need to use classic Huffman coding: this is
> equivalent to multiplying the previous rationals and truncating them to the
> nearest floor integer, but as this limits the choice of rational numbers
> above so that distinct weights remain distinct with the binary encoding,
> you need to keep more significant bits with Huffman coding than with
> Arithmetic coding (i.e. you need a higher value of M; where M is typically
> a power of 2 using 1-bit code units, or power of 256 for the simpler
> encodings using 8-bit code units, or a power of 65536 for an uncompressed
> encoding of 16-bit weight values).
>
> Arithmetic coding is in fact equivalent to Huffman coding, except that M
> is not necessarily a positive integer but can be any positive rational and
> can then represent each weigh value with a rational number of bits on
> average, instead of a static integer number of bits. You can say as well
> that Huffman coding is a restriction of Arithmetic coding where M must be
> an integer, or that Arithmetic coding is a generalization of Huffman coding.
>
> Both the Huffman and Arithmetic codings are wellknown examples of "prefix
> coding" (the latter offering a bit more compression, for the same
> statistical distribution of encoded values). The open interval (0.0, w(0))
> is still not used at all to encode weights, but can still have a statistic
> distribution, usable with the prefix encoding to represent the end of
> string. But here again this does not represent the artificial 0000 weight
> which is NEVER encoded anywhere.
>
> ---
>
> Ask to a mathematician you trust, he will confirm that these rules
> speaking about the pseudo-weight 0000 in UCA are completely unnecessary
> (i.e. removing them from the algorithm does not change the result for
> comparing strings, or for generating sort keys)
> And as a conclusion, attempting to introduce them in the standard creates
> more confusion than it helps (in fact it is most probably a relict of a
> former bogous *implementation*, that still relied on them because other
> well-formness conditions were not satistified, or not well defined in the
> earlier attempts to define the UCA...). That this is not even needed for
> computing "composite weights" (which is not defining new weights, but an
> attempt to encode them in a larger space: this can be done completely
> outside the standard algorithm itself: just allow weights to be rational
> numbers, it is then easy to extend the number of encodable weights as a
> single number without increasing the numeric range in which they are
> defined; then leave the encoder of the sort key generator store them with a
> convenient "prefix coding", using one or more code units of arbitrary
> length).
>
> Philippe.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 10:45:08 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 17:45:08 +0100
Subject: Encoding (was: Re: A
sign/abbreviation for "magister")
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

Message-ID:

Note that I actually propose not just one rendering for the but two possible variants (that would be equally valid
withou preference). Use it after any base cluster (including with
diacritics if needed, like combining underlines).
- the first one can be to render the previous cluster as superscript (very
easy to do implement synthetically by any text renderer)
- the second one can be to render it as an abbreviation dot (also very easy
to)
Fonts can provide their own mapping (e.g. to offer alternate glyph forms or
kerning for the superscript, they can also reuse the leter forms used for
other existing and encoded superscript letters, or to position the
abbreviation dot with negative kerning, for example after a T), in which
case the renderer does not have to synthetize the rendering for the
sequence combining sequence not mapped in the font.

Allowing this variation from the start will:
- allow renderers to support it fast (so a rapid adoption for encoding
texts in humane languages, instead of the few legacy superscript letters).
- allow font designers to develop and provide reasonnable mappings if
needed (to adjust the position or size of the superscript) in updated fonts
(no requirement for them to add new glyphs if it's just to map the same
glyphs used by existing superscript letters)
- also prohibit the abuse of this mark for every text that one would would
to write in superscript (these cases can still uses the few existing
superscript letters/digits/signs that are already encoded), so this is not
suitable for example for marking mathematical exponents (e.g. "x?", if it's
encoded as could validly be rendered as
"x2."): exponents must use the superscript (either the already encoded
ones, or using external styles like in HTML/CSS, or in LaTeX which uses the
notation "x^2", both as a style, but also some intended semantic of an
exponent and certainly not the intended semantic of an abbreviation)

Le dim. 4 nov. 2018 ? 09:34, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 03/11/2018 23:50, James Kass via Unicode wrote:
> >
> > When the topic being discussed no longer matches the thread title,
> > somebody should start a new thread with an appropriate thread title.
> >
>
> Yes, that is what also the OP called for, but my last reply though
> taking me some time to write was sent without checking the new mail,
> so unfortunately it didn?t acknowledge. So let?s start this new thread
> to account for Philippe Verdy?s proposal to encode a new format control.
>
> But all what I can add so far prior to probably stepping out of this
> discussion is that the industry does not seem to be interested in this
> initiative. Why do I think so? As already discussed on this List, even
> the long-existing FRACTION SLASH U+2044 has not been implemented by
> major vendors, except that HarfBuzz does implement it and makes its
> specified behavior available in environments using HarfBuzz, among
> which some major vendors? products are actually available with
> HarfBuzz support.
>
> As a result, the Polish abbreviation of Magister as found on the
> postcard, and all other abbreviations using superscript that have
> been put into parallel in the parent thread, cannot be reliably
> encoded without using preformatted superscript, so far as the goal
> is a plain text backbone being in the benefit of reliable rendering
> support, rather than a semantic-centered coding that may be easier
> to parse by special applications but lacks wider industrial support.
>
> If nevertheless, is encoded and will
> gain traction, or rather reversely: if it gains traction and will be
> encoded (I don?t know which way around to put it, given U+2044 has
> been encoded but one still cannot seem to be able to call it widely
> implemented), I would surely add it on keyboard layouts if I will
> still be maintaining any in that era.
>
> Best regards,
>
> Marcel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 11:34:29 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 4 Nov 2018 18:34:29 +0100
Subject: Encoding
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

Message-ID: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>

On 04/11/2018 17:45, Philippe Verdy wrote:
>
> Note that I actually propose not just one rendering for the
> but two possible variants (that would
> be equally valid withou preference). Use it after any base cluster
> (including with diacritics if needed, like combining underlines).
>
> - the first one can be to render the previous cluster as superscript
> (very easy to do implement synthetically by any text renderer)
>
> - the second one can be to render it as an abbreviation dot (also
> very easy to)
>
> Fonts can provide their own mapping (e.g. to offer alternate glyph
> forms or kerning for the superscript, they can also reuse the leter
> forms used for other existing and encoded superscript letters, or to
> position the abbreviation dot with negative kerning, for example
> after a T), in which case the renderer does not have to synthetize
> the rendering for the sequence combining sequence not mapped in the
> font.
>
> Allowing this variation from the start will:
>
> - allow renderers to support it fast (so a rapid adoption for
> encoding texts in humane languages, instead of the few legacy
> superscript letters).
>
> - allow font designers to develop and provide reasonnable mappings if
> needed (to adjust the position or size of the superscript) in updated
> fonts (no requirement for them to add new glyphs if it's just to map
> the same glyphs used by existing superscript letters)
>
> - also prohibit the abuse of this mark for every text that one would
> would to write in superscript (these cases can still uses the few
> existing superscript letters/digits/signs that are already encoded),
> so this is not suitable for example for marking mathematical
> exponents (e.g. "x?", if it's encoded as mark> could validly be rendered as "x2."): exponents must use the
> superscript (either the already encoded ones, or using external
> styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
> both as a style, but also some intended semantic of an exponent and
> certainly not the intended semantic of an abbreviation)

Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don?t advocate
this use case, as I?m only lobbying for natural languages? support as
specified in the Standard,* but it shouldn?t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I?m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.

From unicode at unicode.org Sun Nov 4 11:42:22 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 4 Nov 2018 18:42:22 +0100
Subject: Encoding (was: Re: A
sign/abbreviation for "magister")
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

Message-ID:

Sorry, I didn?t truncate the subject line, it was my mail client.

On 04/11/2018 17:45, Philippe Verdy wrote:
>
> Note that I actually propose not just one rendering for the
> but two possible variants (that would
> be equally valid withou preference). Use it after any base cluster
> (including with diacritics if needed, like combining underlines).
>
> - the first one can be to render the previous cluster as superscript
> (very easy to do implement synthetically by any text renderer)
>
> - the second one can be to render it as an abbreviation dot (also
> very easy to)
>
> Fonts can provide their own mapping (e.g. to offer alternate glyph
> forms or kerning for the superscript, they can also reuse the leter
> forms used for other existing and encoded superscript letters, or to
> position the abbreviation dot with negative kerning, for example
> after a T), in which case the renderer does not have to synthetize
> the rendering for the sequence combining sequence not mapped in the
> font.
>
> Allowing this variation from the start will:
>
> - allow renderers to support it fast (so a rapid adoption for
> encoding texts in humane languages, instead of the few legacy
> superscript letters).
>
> - allow font designers to develop and provide reasonnable mappings if
> needed (to adjust the position or size of the superscript) in updated
> fonts (no requirement for them to add new glyphs if it's just to map
> the same glyphs used by existing superscript letters)
>
> - also prohibit the abuse of this mark for every text that one would
> would to write in superscript (these cases can still uses the few
> existing superscript letters/digits/signs that are already encoded),
> so this is not suitable for example for marking mathematical
> exponents (e.g. "x?", if it's encoded as mark> could validly be rendered as "x2."): exponents must use the
> superscript (either the already encoded ones, or using external
> styles like in HTML/CSS, or in LaTeX which uses the notation "x^2",
> both as a style, but also some intended semantic of an exponent and
> certainly not the intended semantic of an abbreviation)

Unicode always (or in principle) aims at polyvalence, making characters
reusable and repurposable, while the combining abbreviation mark does
not solve the problems around making chemicals better represented in
plain text as seen in the parent thread, for example. I don?t advocate
this use case, as I?m only lobbying for natural languages? support as
specified in the Standard,* but it shouldn?t be forgotten given there is
some point in not disfavoring chemistry compared to mathematics, that is
already widely favored over chemistry when looking at the symbol blocks,
while chemistry is denied three characters because they are subscript
forms of already encoded letters.

Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
needs OpenType support to work, while direct encoding of preformatted
superscripts and use as abbreviation indicators for an interoperable
digital representation of natural languages does not.

Best regards,

Marcel
* As already repeatedly stated, I?m taking the one bit where TUS states
that all natural languages shall be given a semantically unambiguous (ie
not introducing new ambiguity) and interoperable digital representation.

From unicode at unicode.org Sun Nov 4 12:54:37 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 19:54:37 +0100
Subject: Encoding
In-Reply-To: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

<79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>
Message-ID:

Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a
?crit :

> On 04/11/2018 17:45, Philippe Verdy wrote:
> Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it
> needs OpenType support to work, while direct encoding of preformatted
> superscripts and use as abbreviation indicators for an interoperable
> digital representation of natural languages does not.
>

No OpenScript is required.

I already propose that a correct rendering of this mark is a simple dot
added to the right of the cluster (if this cluster is LTR) or to the left
(if the cluster is RTL).
It just has to convey the fact that it occurs to mean an abbreviation.

The mark to render (when not rendering the superscript) is left to each
font design (a font made for another script than Latin, Greek, Cyrillic can
use another convenient abbreviation mark suitable for that script and that
avoids the confusion with other dot-like combining marks used in that
script, and it may be placed elsewhere than to the right or left of the
cluster that it modifies)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 13:19:55 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 20:19:55 +0100
Subject: Encoding
In-Reply-To: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

<79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>
Message-ID:

Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a
?crit :

> On 04/11/2018 17:45, Philippe Verdy wrote:
> Marcel
> * As already repeatedly stated, I?m taking the one bit where TUS states
> that all natural languages shall be given a semantically unambiguous (ie
> not introducing new ambiguity) and interoperable digital representation.
>

I also support the sermantically unambiguous digital representation of all
natural languages.
Interoperability is always limited, even for existing script (including
Latin), that's why text renderers (and fonts) constantly need new
developments (but that does not need that these developments will be
deployed).
That's why we have to document reasonnable fallbacks for rendering on
limited platforms, each time this is possible (and in this case this is
clearly possible with extremely low efforts).

Even the mere fallback to render the as a
dotted circle (total absence of support) will not block completely reading
the abbreviation:
* you'll see "2e?" (which is still better than only "2e", with minimal
impact) instead of
* "2?" (which is worse ! this is still what already happens when you use
the legacy encoded which is also semantically ambiguous for
text processing), or
* "2e." (which is acceptable for rendering but ambiguous semantically for
text processing)

So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE
than using (which is also impossible for
noting all abbrevations as it is limited to just a few letters, and most of
the time limited to only the few lowercase IPA symbols). It puts an end to
the pressure to encode superscript letters.

If you want to support other notations (e.g. in chemical or
mathematics notations, where both superscript and subscript must be present
and stack together, and where the allowed varaition using a dot or similar)
you need another encoding and the existing legacy are not suitable as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 13:51:33 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 20:51:33 +0100
Subject: Encoding
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

<79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>

Message-ID:

Note also that some other scripts have their own dedicated "abbreviation
mark" encoded, but as distinctive punctuations or modifier letters: they
are NOT combining. I do not advocate changing these scripts at all.

As well I don't propose to instruct authors to use an after Latin/Greek/Letters/Arabic/Hebrew letters used in
abbreviations. This would be non-sense, including visually, even if you can
infer some semantics, as meaning this is effectively an abbreviation for
text processing (this is still non-senses because this breaks existing
segregations of scripts, delimitation of clusters, line breaking
opportunities, and so on; and this approach would break because these
can legally occur in isolation, without being
necessarily attached to the previous cluster to modify it: the previous
cluster, before the could be for example a
whitespace, or a quotation mark)

I don't propose the as being suitable for
mathematics exponents and Chemical notations (they still need something
else to allow their superscript and subscripts to stack below each other,
and the variation of explicitly permitting it
to be rendered as a dot or another suitable mark, depending on the base
character of the combining sequence, is NOT suitable for these mathematics
or chemical notations).

Once again you need something else for these technical notations, but NOT
the proposed , and NOT EVEN the existing
"modifier letters" , which were in fact first
introduced only for IPA lowercase symbols, with some of them being then
turned as "plain lowercase letters" in alphabets of some natural languages
that have been recently romanized by borrowing IPA symbols (notably in
Africa, where the initial letters borrowed from IPA, or some new specific
letter variants with additional hooks, opening or strokes, were then
followed by the addition of separate capital letters: these letters are NOT
conveying any semantic of an abbreviation, and this is also NOT the case
for their usage as IPA symbols).

There's NO interoperability at all when taking **abusively** the existing
"modifier letters" or for use in
abbreviations (or even in technical notations in maths or chemical
formulas, where they DON'T work the way they should when used with
subscripts, and cannot represent multiple layers of subscripts, e.g. for
expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters"
or or for use as plain
letters or plain digits or plain punctuation or plain symbols (including
IPA) in natural languages. Anything else is abusive ans hould be considered
only as "legacy" encoding, not recommended at all in natural languages.

Le dim. 4 nov. 2018 ? 20:19, Philippe Verdy a ?crit :

>
>
> Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a
> ?crit :
>
>> On 04/11/2018 17:45, Philippe Verdy wrote:
>> Marcel
>> * As already repeatedly stated, I?m taking the one bit where TUS states
>> that all natural languages shall be given a semantically unambiguous (ie
>> not introducing new ambiguity) and interoperable digital representation.
>>
>
> I also support the sermantically unambiguous digital representation of all
> natural languages.
> Interoperability is always limited, even for existing script (including
> Latin), that's why text renderers (and fonts) constantly need new
> developments (but that does not need that these developments will be
> deployed).
> That's why we have to document reasonnable fallbacks for rendering on
> limited platforms, each time this is possible (and in this case this is
> clearly possible with extremely low efforts).
>
> Even the mere fallback to render the as a
> dotted circle (total absence of support) will not block completely reading
> the abbreviation:
> * you'll see "2e?" (which is still better than only "2e", with minimal
> impact) instead of
> * "2?" (which is worse ! this is still what already happens when you use
> the legacy encoded which is also semantically ambiguous for
> text processing), or
> * "2e." (which is acceptable for rendering but ambiguous semantically for
> text processing)
>
> So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE
> than using (which is also impossible for
> noting all abbrevations as it is limited to just a few letters, and most of
> the time limited to only the few lowercase IPA symbols). It puts an end to
> the pressure to encode superscript letters.
>
> If you want to support other notations (e.g. in chemical or
> mathematics notations, where both superscript and subscript must be present
> and stack together, and where the allowed varaition using a dot or similar)
> you need another encoding and the existing legacy letters> are not suitable as well.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 14:59:08 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 21:59:08 +0100
Subject: Encoding
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

<79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>

Message-ID:

I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which no
semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: they were kept
in Unicode (as compatibility characters) not recommended for encoding
natural Korean text, because their semantic is not clear when they are used
in sequences: it's impossible to know clearly where semantically
significant syllable breaks occur, because they don't distinguish the
"leading" and "trailing consonants", and so it is not even possible to
clearly infer that any Hangul "half-width" vowel jamos is logically
attached to the same syllable as the "half-width" consonnant (or
consonnant+vowel) jamo that is encoded just before it. As a consequence,
you cannot safely convert Korean texts using these "half-width" jamos into
normal jamos: only an heuristic attempts to detertemine the syllable breaks
and then infer the "leading" or "trailing" semantic of consonnants. This
last semantic ("leading" or "trailing" is exactly like a letter case
distinction in Latin, so it can be said that the Korean alphabet is
bicameral for consonnants, but only monocameral for vowels, where each
Hangul syllable normally starts by an "uppercase-like" consonnant, or by a
consonnant filler which is also "uppercase-like", and that all other
consonnants and all vowels are "lowercase-like": the heuristic that
transforms the legacy "half-width" jamos into normal jamos just does the
same thing as the heuristic used in Latin that attempts to capitalize some
leading letters in words: it works frequently, but this also fails and that
heuristic is also lossy in Latin, just like it is lossy in Korean!).

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either encoded in
Unicode, or encoded as plain letters modified by superscripting style in
CSS or HTML, or in word processors for example): it fails to give the
correct guess most of the time if there's no user to confirm the actual
intended meaning

Such confirmation is the job of spell correctors in word processors: they
must clearly inform the user and let them decide, all what spell checkers
can do is to provide visual hints to the user editing the document, such as
the common wavy underline in red, that several interpretations are
possible, or this is not the preferrred encoding to use to convey the
correct semantic.

A spell checker may be instructed to do the conversion automatically, while
typing text, but there must be a way for the user to cancel this transform
and make his own decision about the real meaning if canceling the automatic
transform causes the "wavy red underline" to appear; the user may type
"Mr." then the wavy line will appear under these 3 characters, the spell
checker will propose to encode it as an abbreviation "Mr" or leave "Mr." unchanged (and no longer signaled) in
which case the dot remains a regular punctuation, and the "r" is not
modified. Then the user may choose to style the "r" with superscripting or
underlining, and a new wavy red underline will appear below the three
characters "M.", proposing to only transform the as
or and even when the user accepts
one of these suggestions it will remain "M." or
"M." where it is still possible to infer the
semantics of an abbreviation (propose to replace or keep the dot after it),
or doing nothing else and cancel these suggestions (to hide the wavy red
underline hint, added by the spell checker), or instruct the spell checker
that the meaning of the superscript r is that of a mathematical exponent,
or a chemical a notation.

In all cases, the user/author has full control of the intended meaning of
his text and an informed decision is made where all cases are now
distinguished. "Legacy" encoding can be kept as is (in Unicode), even if
it's no longer recommended, just like Unicode has documented that
half-width Hangul is deprecated (it just offers a "compatibility
decomposition" for NFKD or NFKC, but this is lossy and cannot be done
automatically without a human decision).

And the user/author can now freely and easily compose any abbreviation he
wishes in natural languages, without being limited by the reduced "legacy"
set of encoded in Unicode (which should no longer be
extended, except for use as distinct plain letters needed in alphabets of
actual natural languages, or as possibly new IPA symbols), and without
using the styling tricks (of HTML/CSS, or of word processor documents,
spreadsheets, presentation documents allowing "'rich text" formats on top
of "plain text") which are best suitable for "free styling" of any human
text, without any additional semantics, (or as a legacy but insufficient
trick for maths and chemical notations).

Le dim. 4 nov. 2018 ? 20:51, Philippe Verdy a ?crit :

> Note also that some other scripts have their own dedicated "abbreviation
> mark" encoded, but as distinctive punctuations or modifier letters: they
> are NOT combining. I do not advocate changing these scripts at all.
>
> As well I don't propose to instruct authors to use an mark> after Latin/Greek/Letters/Arabic/Hebrew letters used in
> abbreviations. This would be non-sense, including visually, even if you can
> infer some semantics, as meaning this is effectively an abbreviation for
> text processing (this is still non-senses because this breaks existing
> segregations of scripts, delimitation of clusters, line breaking
> opportunities, and so on; and this approach would break because these
> can legally occur in isolation, without being
> necessarily attached to the previous cluster to modify it: the previous
> cluster, before the could be for example a
> whitespace, or a quotation mark)
>
> I don't propose the as being suitable for
> mathematics exponents and Chemical notations (they still need something
> else to allow their superscript and subscripts to stack below each other,
> and the variation of explicitly permitting it
> to be rendered as a dot or another suitable mark, depending on the base
> character of the combining sequence, is NOT suitable for these mathematics
> or chemical notations).
>
> Once again you need something else for these technical notations, but NOT
> the proposed , and NOT EVEN the existing
> "modifier letters" , which were in fact first
> introduced only for IPA lowercase symbols, with some of them being then
> turned as "plain lowercase letters" in alphabets of some natural languages
> that have been recently romanized by borrowing IPA symbols (notably in
> Africa, where the initial letters borrowed from IPA, or some new specific
> letter variants with additional hooks, opening or strokes, were then
> followed by the addition of separate capital letters: these letters are NOT
> conveying any semantic of an abbreviation, and this is also NOT the case
> for their usage as IPA symbols).
>
> There's NO interoperability at all when taking **abusively** the existing
> "modifier letters" or for use in
> abbreviations (or even in technical notations in maths or chemical
> formulas, where they DON'T work the way they should when used with
> subscripts, and cannot represent multiple layers of subscripts, e.g. for
> expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters"
> or or for use as plain
> letters or plain digits or plain punctuation or plain symbols (including
> IPA) in natural languages. Anything else is abusive ans hould be considered
> only as "legacy" encoding, not recommended at all in natural languages.
>
>
>
> Le dim. 4 nov. 2018 ? 20:19, Philippe Verdy a ?crit :
>
>>
>>
>> Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a
>> ?crit :
>>
>>> On 04/11/2018 17:45, Philippe Verdy wrote:
>>> Marcel
>>> * As already repeatedly stated, I?m taking the one bit where TUS states
>>> that all natural languages shall be given a semantically unambiguous (ie
>>> not introducing new ambiguity) and interoperable digital representation.
>>>
>>
>> I also support the sermantically unambiguous digital representation of
>> all natural languages.
>> Interoperability is always limited, even for existing script (including
>> Latin), that's why text renderers (and fonts) constantly need new
>> developments (but that does not need that these developments will be
>> deployed).
>> That's why we have to document reasonnable fallbacks for rendering on
>> limited platforms, each time this is possible (and in this case this is
>> clearly possible with extremely low efforts).
>>
>> Even the mere fallback to render the as a
>> dotted circle (total absence of support) will not block completely reading
>> the abbreviation:
>> * you'll see "2e?" (which is still better than only "2e", with minimal
>> impact) instead of
>> * "2?" (which is worse ! this is still what already happens when you use
>> the legacy encoded which is also semantically ambiguous for
>> text processing), or
>> * "2e." (which is acceptable for rendering but ambiguous semantically for
>> text processing)
>>
>> So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE
>> than using (which is also impossible for
>> noting all abbrevations as it is limited to just a few letters, and most of
>> the time limited to only the few lowercase IPA symbols). It puts an end to
>> the pressure to encode superscript letters.
>>
>> If you want to support other notations (e.g. in chemical or
>> mathematics notations, where both superscript and subscript must be present
>> and stack together, and where the allowed varaition using a dot or similar)
>> you need another encoding and the existing legacy > letters> are not suitable as well.
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 15:51:24 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 4 Nov 2018 22:51:24 +0100
Subject: UCA unnecessary collation weight 0000
In-Reply-To:
References:

<268fedd9-25b6-2957-ffa8-ede11495451c@att.net>

Message-ID:

So you finally admit that I was right... And that the specs include
requirements that are not even needed to make UCA work, and that not even
used by wellknown implementations. These are old artefacts which are now
really confusive (instructing programmers to adopt the old deprecated
behavior, before realizing that this was a bad advice which jut complicated
their task). UCA can be implemented **conformingly** without these, even
for the simplest implementations (where using complex packages like ICU is
not an option and rewriting it is not one as well for much simpler goals)
where these incorrect requirements are in fact suggesting to be more
inefficient than really needed.
There's not a lot of work to edit and to fix the specs without these
polluting 0000 "pseudo-weights".

Le dim. 4 nov. 2018 ? 09:27, Mark Davis ?? a ?crit :

> Philippe, I agree that we could have structured the UCA differently. It
> does make sense, for example, to have the weights be simply decimal values
> instead of integers. But nobody is going to go through the substantial
> work of restructuring the UCA spec and data file unless there is a very
> strong reason to do so. It takes far more time and effort than people
> realize to change in the algorithm/data while making sure that everything
> lines up without inadvertent changes being introduced.
>
> It is just not worth the effort. There are so, so, many things we can do
> in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
> benefit.
>
> You can continue flogging this horse all you want, but I'm muting this
> thread (and I suspect I'm not the only one).
>
> Mark
>
>
> On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>> Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit :
>>
>>>
>>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>>
>>> I was replying not about the notational repreentation of the DUCET data
>>> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
>>> Which remains highly confusive, and contains completely unnecesary steps,
>>> and just complicates things with absoiluytely no benefit at all by
>>> introducing confusion about these "0000".
>>>
>>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>>> you are introducing to the unicode list in the course of this discussion.
>>>
>>>
>>> UTR#10 still does not explicitly state that its use of "0000" does not
>>> mean it is a valid "weight", it's a notation only
>>>
>>> No, it is explicitly a valid weight. And it is explicitly and
>>> normatively referred to in the specification of the algorithm. See UTS10-D8
>>> (and subsequent definitions), which explicitly depend on a definition of "A
>>> collation weight whose value is zero." The entire statement of what are
>>> primary, secondary, tertiary, etc. collation elements depends on that
>>> definition. And see the tables in Section 3.2, which also depend on those
>>> definitions.
>>>
>>> (but the notation is used for TWO distinct purposes: one is for
>>> presenting the notation format used in the DUCET
>>>
>>> It is *not* just a notation format used in the DUCET -- it is part of
>>> the normative definitional structure of the algorithm, which then
>>> percolates down into further definitions and rules and the steps of the
>>> algorithm.
>>>
>>
>> I insist that this is NOT NEEDED at all for the definition, it is
>> absolutely NOT structural. The algorithm still guarantees the SAME result.
>>
>> It is ONLY used to explain the format of the DUCET and the fact the this
>> format does NOT use 0000 as a valid weight, ans os can use it as a notation
>> (in fact only a presentational feature).
>>
>>
>>> itself to present how collation elements are structured, the other one
>>> is for marking the presence of a possible, but not always required,
>>> encoding of an explicit level separator for encoding sort keys).
>>>
>>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>>> is not part of the *notation* for collation elements, but instead is a
>>> magic value chosen for the level separator precisely because zero values
>>> from the collation elements are removed during sort key construction, so
>>> that zero is then guaranteed to be a lower value than any remaining weight
>>> added to the sort key under construction. This part of the algorithm is not
>>> rocket science, by the way!
>>>
>>
>> Here again you make a confusion: a sort key MAY use them as separators if
>> it wants to compress keys by reencoding weights per level: that's the only
>> case where you may want to introduce an encoding pattern starting with 0,
>> while the rest of the encoding for weights in that level must using
>> patterns not starting by this 0 (the number of bits to encode this 0 does
>> not matter: it is only part of the encoding used on this level which does
>> not necessarily have to use 16-bit code units per weight.
>>
>>>
>>> Even the example tables can be made without using these "0000" (for
>>> example in tables showing how to build sort keys, it can present the list
>>> of weights splitted in separate columns, one column per level, without any
>>> "0000". The implementation does not necessarily have to create a buffer
>>> containing all weight values in a row, when separate buffers for each level
>>> is far superior (and even more efficient as it can save space in memory).
>>>
>>> The UCA doesn't *require* you to do anything particular in your own
>>> implementation, other than come up with the same results for string
>>> comparisons.
>>>
>> Yes I know, but the algorithm also does not require me to use these
>> invalid 0000 pseudo-weights, that the algorithm itself will always discard
>> (in a completely needless step)!
>>
>>
>>> That is clearly stated in the conformance clause of UTS #10.
>>>
>>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>>
>>> The step "S3.2" in the UCA algorithm should not even be there (it is
>>> made in favor an specific implementation which is not even efficient or
>>> optimal),
>>>
>>> That is a false statement. Step S3.2 is there to provide a clear
>>> statement of the algorithm, to guarantee correct results for string
>>> comparison.
>>>
>>
>> You're wrong, this statement is completely useless in all cases. There is
>> still the correct results for string comparison without them: a string
>> comparison can only compare valid weights for each level, it will not
>> compare any weight past the end of the text in any one of the two compared
>> strings, nowhere it will compare weights with one of them being 0, unless
>> this 0 is used as a "guard value" for the end of text and your compare loop
>> still continues scanning the longer string when the other string has
>> already ended (this case should be detected much earlier before
>> determineing the next collection boundary in the string and then computing
>> its weights for each level.
>>
>>> Section 9 of UTS #10 provides a whole lunch buffet of techniques that
>>> implementations can choose from to increase the efficiency of their
>>> implementations, as they deem appropriate. You are free to implement as you
>>> choose -- including techniques that do not require any level separators.
>>> You are, however, duly warned in:
>>>
>>>
>>> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
>>>
>>> that "While this technique is relatively easy to implement, it can
>>> interfere with other compression methods."
>>>
>>> it complicates the algorithm with absoluytely no benefit at all); you
>>> can ALWAYS remove it completely and this still generates equivalent results.
>>>
>>> No you cannot ALWAYS remove it completely. Whether or not your
>>> implementation can do so, depends on what other techniques you may be using
>>> to increase performance, store shorter keys, or whatever else may be at
>>> stake in your optimization
>>>
>> I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However
>> you MAY ADD them ONLY when generating and encoding the sort keys, if the
>> encoding used really does compress the weights into smaller values: this is
>> the only case where you want to ADD a separator, internally only in the
>> binary key encoder, but but as part of the algorithm itself.
>>
>> If your key generation does not use any compression (in the simplest
>> implementations), then it can simply an directly concatenate all weights
>> with the same code units size (16-bit in the DUCET), without inserting any
>> additional 0000 code unit to separate them: your resulting sort key will
>> still not contain any 0000 code unit in any part for any level because the
>> algorithm already has excluded them. Finally this means that sort keys can
>> be stored in C-strings (terminated by null code units, instead of being
>> delimited by a separately encoded length property, but for C-strings where
>> code units are 8-bit, i.e. "char" in C, you still need an encoder to
>> convert the 16-bit binary weights into sequences of bytes not containing
>> any 00 byte: if this encoder is used, still you don't need any 00 separator
>> between encoded levels!).
>>
>> As all these 0000 weigths are unnecessary, then the current UCA algorithm
>> trying to introduce them needlessly is REALLY introducing unnecessary
>> confusion: values of weights NEVER need to be restricted.
>>
>> The only conditions that matter is that:
>> - all weights are *comparable* (sign does not even matter, they are not
>> even restricted to be numbers or even just integers) and that
>> - they are **fully ordered**, and that the fully ordered set of weights
>> (not necessarily an enumerable set or a discrete set, as this can the
>> continuous set of real numbers)
>> - and that the full set of weights is **fully partitioned** into
>> distinct, intervals (with no intersection between intervals, so intervals
>> are also comparable)
>> - that the highest interval will be used by weights in the primary level:
>> each partition is numbered (by the level: a positive integer between 1 and
>> L): you can compare the level numbers assigned to the partition in which
>> the weight is a member: if level(weight1) > level(weight2) (this is an
>> comparison of positive integers), then necessarily you may have weight1 <
>> weight2 (this is only comparing weights encoded arbitrarily and which can
>> still use a 0 value if you wish to use it to encode a valid weight for a
>> valid collation element at any level 1 to N; this is also the only
>> condition needed to respect rule WF2 in UCA).
>>
>> ---
>> Notes about encodings for weights in sort keys:
>>
>> If weights are chosen to be rational numbers, e.g any rational numbers in
>> the (0.0, 1.0) open interval, and because your collation algorithm will
>> only recognize a finite set of distinct collation elements with necessarily
>> a finite number N of distinct weights w(i), for i in 0..(N-1), allows the
>> collation weights to be represented by choosing them **arbitrarily** within
>> this open interval:
>> - this can be done simply by partitionning the (0.0 1.0) into N half-open
>> intervals [w(i), w(i+1));
>> - and then encoding a weight w(i) by any **arbitrarily chosen rational**
>> inside one of these intervals (for example this can be done for using
>> compression with arithmetic coding).
>>
>> A weight encoding using a finite discrete set (of binary integers between
>> 0 and M-1) is what you need to use classic Huffman coding: this is
>> equivalent to multiplying the previous rationals and truncating them to the
>> nearest floor integer, but as this limits the choice of rational numbers
>> above so that distinct weights remain distinct with the binary encoding,
>> you need to keep more significant bits with Huffman coding than with
>> Arithmetic coding (i.e. you need a higher value of M; where M is typically
>> a power of 2 using 1-bit code units, or power of 256 for the simpler
>> encodings using 8-bit code units, or a power of 65536 for an uncompressed
>> encoding of 16-bit weight values).
>>
>> Arithmetic coding is in fact equivalent to Huffman coding, except that M
>> is not necessarily a positive integer but can be any positive rational and
>> can then represent each weigh value with a rational number of bits on
>> average, instead of a static integer number of bits. You can say as well
>> that Huffman coding is a restriction of Arithmetic coding where M must be
>> an integer, or that Arithmetic coding is a generalization of Huffman coding.
>>
>> Both the Huffman and Arithmetic codings are wellknown examples of "prefix
>> coding" (the latter offering a bit more compression, for the same
>> statistical distribution of encoded values). The open interval (0.0, w(0))
>> is still not used at all to encode weights, but can still have a statistic
>> distribution, usable with the prefix encoding to represent the end of
>> string. But here again this does not represent the artificial 0000 weight
>> which is NEVER encoded anywhere.
>>
>> ---
>>
>> Ask to a mathematician you trust, he will confirm that these rules
>> speaking about the pseudo-weight 0000 in UCA are completely unnecessary
>> (i.e. removing them from the algorithm does not change the result for
>> comparing strings, or for generating sort keys)
>> And as a conclusion, attempting to introduce them in the standard creates
>> more confusion than it helps (in fact it is most probably a relict of a
>> former bogous *implementation*, that still relied on them because other
>> well-formness conditions were not satistified, or not well defined in the
>> earlier attempts to define the UCA...). That this is not even needed for
>> computing "composite weights" (which is not defining new weights, but an
>> attempt to encode them in a larger space: this can be done completely
>> outside the standard algorithm itself: just allow weights to be rational
>> numbers, it is then easy to extend the number of encodable weights as a
>> single number without increasing the numeric range in which they are
>> defined; then leave the encoder of the sort key generator store them with a
>> convenient "prefix coding", using one or more code units of arbitrary
>> length).
>>
>> Philippe.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 4 16:30:31 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 4 Nov 2018 22:30:31 +0000
Subject: Arranging Hieroglyphics (was: A sign/abbreviation for "magister")
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

<86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2>

<6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com>

<21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com>
<8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

Message-ID: <20181104223031.13d9cc31@JRWUBU2>

On Sat, 3 Nov 2018 22:55:17 +0100
Philippe Verdy via Unicode wrote:

> I can also cite the case of Egyptian hieroglyphs: there's still no
> way to render them correctly because we lack the development of a
> stable orthography that would drive the encoding of the missing
> **semantic** characters (for this reason Egyptian hieroglyphs still
> require an upper layer protocol, as there's still no accepted
> orthographic norm that successfully represents all possible semantic
> variations, but alsop because the research on old Egyptian
> hieroglyphs is still aphic very incomplete).

If you study the document register, you'll find that layout
control characters are being added. I think semantic characters would
have depended on the font to select the rendering consequences; this
will now not happen. What we're getting is more rigorous version of the
Manuel de Codage.

Richard.

From unicode at unicode.org Mon Nov 5 10:46:38 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 05 Nov 2018 09:46:38 -0700
Subject: Encoding (was: Re: A
sign/abbreviation for "magister")
Message-ID: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> Note that I actually propose not just one rendering for the abbrevaition mark> but two possible variants (that would be equally
> valid withou preference).

Actually you're not proposing them. You're talking about them (at
length) on the public mailing list. If you want to propose something,
you should consider writing a proposal.

--
Doug Ewell | Thornton, CO, US | ewellic.org

From unicode at unicode.org Mon Nov 5 17:12:33 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 6 Nov 2018 00:12:33 +0100
Subject: Encoding
In-Reply-To:
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
<2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr>

<508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr>

<3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com>

<79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr>

Message-ID:

On 04/11/2018 20:19, Philippe Verdy via Unicode wrote:
[?]
> Even the mere fallback to render the as
> a dotted circle (total absence of support) will not block completely
> reading the abbreviation:
>
> * you'll see "2e?" (which is still better than only "2e", with
> minimal impact) instead of
>
> * "2?" (which is worse ! this is still what already happens when you
> use the legacy encoded which is also semantically
> ambiguous for text processing), or
>
> * "2e." (which is acceptable for rendering but ambiguous semantically
> for text processing)

I?m afraid the dotted circle instead of the .notdef box would be confusing.

>
> So compare things faily: the solution I propose is EVEN
> MOREINTEROPERABLE than using (which is
> also impossible for noting all abbrevations as it is limited to just
> a few letters, and most of the time limited to only the few lowercase
> IPA symbols). It puts an end to the pressure to encode superscript
> letters.

Actually it encompasses all Latin lowercase base letters except q.

As of putting an end to that pressure, that is also possible by encoding
the missing ones once and for all. As already stated, until the opposite
is posted authoritatively to this List, Latin script is deemed the only
one making extensive use of superscript to denote abbreviations, due to
strong and longlasting medieval practice acting as a template on a few
natural languages, namedly those enumerated so far, among which Polish.

>
> If you want to support other notations (e.g. in chemical or
> mathematics notations, where both superscript and subscript must be
> present and stack together, and where the allowed varaition using a
> dot or similar) you need another encoding and the existing legacy
> are not suitable as well.

I don?t lobby to support mathematics with more superscripts, but for
sure UnicodeMath would be able to use them when the set is complete.
What I did for chemical notations is to remind that chemistry seems
to be disfavored compared to mathematics, because instead of peculiar
subscripts it uses subscript Greek small letters. Three of them, as
has been reported on this List. They are being refused because they
are letters of a script. If they were fancy symbols, they would be
encoded, like alchemical symbols and mathematical symbols are.

Further, on 04/11/2018 20:51, Philippe Verdy via Unicode wrote:
[?]
> Once again you need something else for these technical notations, but
> NOT the proposed , and NOT EVEN the
> existing "modifier letters" , which were in
> fact first introduced only for IPA [?]
> [?] these letters are NOT conveying any semantic of an abbreviation,
> and this is also NOT the case for their usage as IPA symbols).

They do convey that semantic if used in a natural language giving
superscript the semantics of an abbreviation.

Unicode does not encode semantics, TUS specifies.

>
> There's NO interoperability at all when taking **abusively** the
> existing "modifier letters" or digit> for use in abbreviations [?].

The interoperabillty I mean is between formats and environments.
Interoperable in that sense is what is in the plain text backbone.

> Keep these "modifier letters" or or punctuation> for use as plain letters or plain digits or plain
> punctuation or plain symbols (including IPA) in natural languages.

That is what I?m suggesting to do: Superscript letters are plain
abbreviation indicators, notably ordinal indicators and indicators
in other abbreviations, used in natural languages.

> Anything else is abusive ans hould be considered only as "legacy"
> encoding, not recommended at all in natural languages.

Put "traditional" in the place of "legacy", and you will come close
to what is actually going on when coding palaeographic texts is
achieved using purposely encoded Latin superscripts. The same
applies to living languages, because it is interoperable and fits
therefore Unicode quality standards about digitally representing
the world?s languages.

Finally, on 04/11/2018 21:59, Philippe Verdy via Unicode wrote:
>
> I can take another example about what I call "legacy encoding" (which
> really means that such encoding is just an "approximation" from which
> no semantic can be clearly infered, except by using a non-determinist
> heuristic, which can frequently make "false guesses").
>
> Consider the case of the legacy Hangul "half-width" jamos: [?]
>
> The same can be said about the heuristics that attempt to infer an
> abbreviation semantic from existing superscript letters (either
> encoded in Unicode, or encoded as plain letters modified by
> superscripting style in CSS or HTML, or in word processors for
> example): it fails to give the correct guess most of the time if
> there's no user to confirm the actual intended meaning

I don?t agree: As opposed to baseline fallbacks, Unicode superscripts
allow the reader to parse the string as an abbreviation, and machines
can be programmed to act likewise.

>
> Such confirmation is the job of spell correctors in word processors:
> [?] the user may type "Mr." then the wavy line will appear under
> these 3 characters, the spell checker will propose to encode it as an
> abbreviation "Mr" or leave "Mr."
> unchanged (and no longer signaled) in which case the dot remains a
> regular punctuation, and the "r" is not modified. Then the user may
> choose to style the "r" with superscripting or underlining, and a new
> wavy red underline will appear below the three characters "M r>.", proposing to only transform the as
> or and even when the user accepts one of
> these suggestions it will remain "M." or
> "M." where it is still possible to infer the
> semantics of an abbreviation (propose to replace or keep the dot
> after it), or doing nothing else and cancel these suggestions (to
> hide the wavy red underline hint, added by the spell checker), or
> instruct the spell checker that the meaning of the superscript r is
> that of a mathematical exponent, or a chemical a notation.

That mainly illustrates why is not
interoperable. The input process seems to be too complicated. And if
a base letter is to be transformed to formatted superscript, you do
need OpenType, much like with U+2044 FRACTION SLASH behaving as
intended, ie transforming the preceding digit string to formatted
numerator digits, and the following to denominator digit glyphs.
In that, U+2044 acts as a format control, and so does that you are suggesting to encode.
>
> In all cases, the user/author has full control of the intended
> meaning of his text and an informed decision is made where all cases
> are now distinguished. "Legacy" encoding can be kept as is (in
> Unicode), even if it's no longer recommended, just like Unicode has
> documented that half-width Hangul is deprecated (it just offers a
> "compatibility decomposition" for NFKD or NFKC, but this is lossy and
> cannot be done automatically without a human decision).
>
> And the user/author can now freely and easily compose any
> abbreviation he wishes in natural languages, without being limited by
> the reduced "legacy" set of encoded in Unicode

So far as the full Latin lowercase alphabet, and for use in all-caps
only, eventually the full Latin uppercase alphabet are encoded, I can
see nothing of a limitation, given these letters have the grapheme
cluster base property and therefore work with all combining diacritics.
That is already working with good font support, as demonstrated in the
parent thread.

> (which should no longer be extended, except for use as distinct plain
> letters needed in alphabets of actual natural languages, or as
> possibly new IPA symbols),

One should be able to overcome the pattern tagging superscripts as not
being ?plain letters?, because that is irrelevant when they are used as
abbreviation indicators in natural languages, and as such are plain
characters, like eg the Romance ordinal indicators U+00AA and U+00BA;
see also the DEGREE SIGN hijacked as a substitute of
because not superscripting the o in "n?" is considered inacceptable.

> and without using the styling tricks (of
> HTML/CSS, or of word processor documents, spreadsheets, presentation
> documents allowing "'rich text" formats on top of "plain text") which
> are best suitable for "free styling" of any human text, without any
> additional semantics, [?]

Yes I fully agree, if ?semantics? is that required for readability in
accordance with standard orthographies in use.

Best regards,

Marcel

From unicode at unicode.org Mon Nov 5 17:32:32 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 6 Nov 2018 00:32:32 +0100
Subject: Encoding (was: Re: A
sign/abbreviation for "magister")
In-Reply-To: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com>
References: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com>
Message-ID:

On 05/11/2018 17:46, Doug Ewell via Unicode wrote:
>
> Philippe Verdy wrote:
>
>> Note that I actually propose not just one rendering for the > abbrevaition mark> but two possible variants (that would be equally
>> valid withou preference).
>
> Actually you're not proposing them. You're talking about them (at
> length) on the public mailing list. If you want to propose something,
> you should consider writing a proposal.

The accepted meaning of "to propose" is not limited to the technical
sense it is used with respect to Unicode. Also, Philippe and I are both
influenced by our French locale, where "je propose" has pretty wide
semantics.

To conform with Unicode terminology, simply think "suggest", as in:
?Note that I actually suggest not just one rendering [?].?

Thanks anyway for encouraging Philippe Verdy to submit the related
encodingproposal.

Best regards,

Marcel

From unicode at unicode.org Tue Nov 6 04:56:35 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Tue, 06 Nov 2018 11:56:35 +0100
Subject: A sign/abbreviation for "magister" - first question summary
Message-ID: <86zhumwk2k.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>
> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.

I've got almost immediately two complementary answers:

On Sat, Oct 27 2018 at 9:11 -0400, Robert Wheelock wrote:

> It is constructed much like the symbol for numero?only with a capital
> accompanied by a superscript small > having an underbar (or
> double underbar).

On Sat, Oct 27 2018 at 6:58 -0700, Asmus Freytag via Unicode wrote:

[...]

> My suspicion would be that the small "z" is rather a "=" that
> acquired a connecting stroke as part of quick handwriting. A./

and on the same day this interpretation was supported by Philippe Verdy:

On Sat, Oct 27 2018 at 20:35 +0200, Philippe Verdy via Unicode wrote:

[...]

> I have the same kind of reading, the zigzagging stroek is an
> hnadwritten emphasis of the uperscript r above it (explicitly noting
> it is terminating the abbreviation), jut like the small underline that
> happens sometimes below the superscript o in the abbreviation of
> "numero" (as well sometimes there was not just one but two small
> underlines, including in some prints).
>
> This sample is a perfect example of fast cursive handwritting (due to
> high variability of all other letter shapes, sizes and joinings, where
> even the capital M is written as two unconnected strokes), and it's
> not abnormal to see in such condition this cursive joining between the
> two underlining strokes so that it looks like a single zigzag.

Later it was summarized by James Kass:

On Fri, Nov 02 2018 at 2:59 GMT, James Kass via Unicode wrote:
> Alphabetic script users write things the way they are spelled and
> spell things the way they are written.? The abbreviation in question
> as written consists of three recognizable symbols.? An "M", a
> superscript "r", and an equal sign (= two lines).? It can be printed,
> handwritten, or in fraktur; it will still consist of those same three
> recognizable symbols.
>
> We're supposed to be preserving the past, not editing it or revising
> it.

It was commented by Julian Bradfield:

On Fri, Nov 02 2018 at 8:54 GMT, Julian Bradfield via Unicode wrote:

[...]

> That's not true. The squiggle under the r is a squiggle - it is a
> matter of interpretation (on which there was some discussion a hundred
> messages up-thread or so :) whether it was intended to be = .
> Just as it is a matter of interpretation whether the superscript and
> squiggle were deeply meaningful to the writer, or whether they were
> just a stylistic flourish for Mr.

The abbreviation in question definitely consists of three symbols: an
"M", a superscript "r" and the third one, which I think was best
described by Robert Wheelock as double (under)bar, with the connecting
stroke mentioned first by Asmus Freytag.

This third element was referred to, also by myself, as a squiggle, but
after looking up the definition of the word in a dictionary

a short line that has been written or drawn and that curves and
twists in a way that is not regular

I think this is a misnomer. Unfortunately I have no better proposal.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Tue Nov 6 04:59:23 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Tue, 06 Nov 2018 11:59:23 +0100
Subject: A sign/abbreviation for "magister" - second question summary
Message-ID: <86lg66wjxw.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".

[...]

> The second question is: are you familiar with such or a similar symbol?
> Have you ever seen it in print?

Later I provided some additional information:

On Sat, Oct 27 2018 at 16:09 +0200, Janusz S. Bie? via Unicode wrote:
>
> The postcard is from the front of the first WW written by an
> Austro-Hungarian soldier. He explaines the meaning of the abbreviation
> to his wife, so looks like the abbreviation was used but not very
> popular.

On Sat, Oct 27 2018 at 20:25 +0200, Janusz S. Bie? via Unicode wrote:

[...]

> In the meantime I looked up some other postcards written by the same
> person i found several other abbreviation including ? 'NUMERO SIGN'
> (U+2116) written in the same way, i.e. with a double instead of a single
> line.

The similarity to ? 'NUMERO SIGN' was mentioned quite often in the
thread, there seem to be no need to quote all this mentions here.

A more general observation was formulated by Richard Wordingham:

On Sun, Oct 28 2018 at 8:13 GMT, Richard Wordingham via Unicode wrote:

[...]

> The notation is a quite widespread format for abbreviations. the
> first letter is normal sized, and the subsequent letter is written in
> some variety of superscript with a squiggle underneath so that it
> doesn't get overlooked.

Various examples of such abbreviations were also mentioned several times
in the thread, but again there seem to be no need to quote all this
mentions here.

Nobody however reported any other occurence of the symbol in question.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Tue Nov 6 05:04:02 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Tue, 06 Nov 2018 12:04:02 +0100
Subject: A sign/abbreviation for "magister" - third question summary
Message-ID: <86h8guwjq5.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote:
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>

[...]

> The third and the last question is: how to encode this symbol in
> Unicode?

A constructive answer to my question was provided quickly by James Kass:

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> Mr? / M=?

I answered:

On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bie? via Unicode wrote:

[...]

> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)
>
> An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
> supporting it are rather rare.

and Philippe Verdy commented:

On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:

[...]

>
> There's a third alternative, that uses the superscript letter r,
> followed by the combining double underline, instead of the normal
> letter r followed by the same combining double underline.

Some comments were made also by Michael Everson:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> I would encode this as M? if you wanted to make sure your data
> contained the abbreviation mark. It would not make sense to encode it
> as M=? or anything else like that, because the ?r? is not modifying a
> dot or a squiggle or an equals sign. The dot or squiggle or equals
> sign has no meaning at all. And I would not encode it as Mr?, firstly
> because it would never render properly and you might as well encode it
> as Mr. or M:r, and second because in the IPA at least that character
> indicates an alveolar realization in disordered speech. (Of course it
> could be used for anything.)

FYI, I decided to use the encoding proposed by Philippe Verdy (if I
understand him correctly):

M??

i.e.

'LATIN CAPITAL LETTER M' (U+004D)
'MODIFIER LETTER SMALL R' (U+02B3)
'COMBINING DOUBLE LOW LINE' (U+0333)

for purely pragmatic reasons: it is rendered quite well in my
Emacs. According to the 'fc-search-codepoint" script, the sequence is
supported on my computer by almost 150 fonts, so I hope to find in due
time a way to render it correctly also in XeTeX. I'm also going to add
it to my private named sequences list
(https://bitbucket.org/jsbien/unicode4polish).

The same post contained a statement which I don't accept:

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:

[...]

> The squiggle in your sample, Janusz, does not indicate anything; it is
> only a decoration, and the abbreviation is the same without it.

One of the reasons I disagree was described by me in the separate thread
"use vs mention":

https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html

There were also some other statements which I find unacceptable:

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]

> The abbreviation in the postcard, rendered in plain text, is "Mr".

He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
9:38 GMT (and earlier in a private mail).

I understand that both of them by "plane text" mean Unicode.

On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:

> You could use the various hacks you've discussed, with modifier
> letters; but that is not "encoding", that is "abusing Unicode to do
> markup". At least, that's the view I take!

and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12
-0700.

The latter elaborated his view later and I answered:

On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bie? via Unicode wrote:
> On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote:

[...]

>> All else is just applying visual hacks
>
> I don't mind hacks if they are useful and serve the intended purpose,
> even if they are visual :-)

[...]

>> at the possible cost of obscuring the contents.
>
> It's for the users of the transcription to decide what is obscuring the
> text and what, to the contrary, makes the transcription more readable
> and useful.

Please note that it's me who makes the transcription, it's me who has a
vision of the future use and users, and in consequence it's me who makes
the decision which aspects of text to encode. Accusing me of "abusing
Unicode" will not stop me from doing it my way.

I hope that at least James Kass understands my attitude:

On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote:

[...]

> If I were entering plain text data from an old post card, I'd try to
> keep the data as close to the source as possible. Because that would
> be my purpose. Others might have different purposes.

There were presented also some ideas which I would call "futuristic":
introducing a new combining character and using variations sequences.
This ideas should be discussed in separate threads, which seems to
happen now.

Best regards

Janusz

--
,
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org Wed Nov 7 13:49:38 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 7 Nov 2018 20:49:38 +0100
Subject: Preformatted superscript in ordinary text, paleography and phonetics
using Latin script (was: Re: A sign/abbreviation for "magister" -
third question summary)
In-Reply-To: <86h8guwjq5.fsf@mimuw.edu.pl>
References: <86h8guwjq5.fsf@mimuw.edu.pl>
Message-ID:

On 06/11/2018 12:04, Janusz S. Bie? via Unicode wrote:
>
> On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote:
>> Hi!
>>
>> On the over 100 years old postcard
>>
>> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>>
>> you can see 2 occurences of a symbol which is explicitely explained (in
>> Polish) as meaning "Magister".
>>
>
> [...]
>
>> The third and the last question is: how to encode this symbol in
>> Unicode?
>
>
> A constructive answer to my question was provided quickly by James Kass:
>
> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
>> Mr? / M=?
>
> I answered:
>
> On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bie? via Unicode wrote:
>
> [...]
>
>> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
>> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
>> the base character. However in the lack of a better solution I can live
>> with it :-)
>>
>> An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
>> supporting it are rather rare.
>
> and Philippe Verdy commented:
>
> On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote:
>
> [...]
>
>>
>> There's a third alternative, that uses the superscript letter r,
>> followed by the combining double underline, instead of the normal
>> letter r followed by the same combining double underline.
>
> Some comments were made also by Michael Everson:
>
> On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:
>
> [...]
>
>> I would encode this as M? if you wanted to make sure your data
>> contained the abbreviation mark. It would not make sense to encode it
>> as M=? or anything else like that, because the ?r? is not modifying a
>> dot or a squiggle or an equals sign. The dot or squiggle or equals
>> sign has no meaning at all. And I would not encode it as Mr?, firstly
>> because it would never render properly and you might as well encode it
>> as Mr. or M:r, and second because in the IPA at least that character
>> indicates an alveolar realization in disordered speech. (Of course it
>> could be used for anything.)
>
> FYI, I decided to use the encoding proposed by Philippe Verdy (if I
> understand him correctly):
>
> M??
>
> i.e.
>
> 'LATIN CAPITAL LETTER M' (U+004D)
> 'MODIFIER LETTER SMALL R' (U+02B3)
> 'COMBINING DOUBLE LOW LINE' (U+0333)
>
> for purely pragmatic reasons: it is rendered quite well in my
> Emacs. According to the 'fc-search-codepoint" script, the sequence is
> supported on my computer by almost 150 fonts, so I hope to find in due
> time a way to render it correctly also in XeTeX. I'm also going to add
> it to my private named sequences list
> (https://bitbucket.org/jsbien/unicode4polish).
>
> The same post contained a statement which I don't accept:
>
> On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote:
>
> [...]
>
>> The squiggle in your sample, Janusz, does not indicate anything; it is
>> only a decoration, and the abbreviation is the same without it.
>
> One of the reasons I disagree was described by me in the separate thread
> "use vs mention":
>
> https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html
>
> There were also some other statements which I find unacceptable:
>
> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:
>
> [...]
>
>> The abbreviation in the postcard, rendered in plain text, is "Mr".
>
> He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at
> 9:38 GMT (and earlier in a private mail).
>
> I understand that both of them by "plane text" mean Unicode.
>
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
>> You could use the various hacks you've discussed, with modifier
>> letters; but that is not "encoding", that is "abusing Unicode to do
>> markup". At least, that's the view I take!
>
> and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12
> -0700.
>
> The latter elaborated his view later and I answered:
>
> On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bie? via Unicode wrote:
>> On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote:
>
> [...]
>
>>> All else is just applying visual hacks
>>
>> I don't mind hacks if they are useful and serve the intended purpose,
>> even if they are visual :-)
>
> [...]
>
>>> at the possible cost of obscuring the contents.
>>
>> It's for the users of the transcription to decide what is obscuring the
>> text and what, to the contrary, makes the transcription more readable
>> and useful.
>
> Please note that it's me who makes the transcription, it's me who has a
> vision of the future use and users, and in consequence it's me who makes
> the decision which aspects of text to encode. Accusing me of "abusing
> Unicode" will not stop me from doing it my way.
>
> I hope that at least James Kass understands my attitude:
>
> On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote:
>
> [...]
>
>> If I were entering plain text data from an old post card, I'd try to
>> keep the data as close to the source as possible. Because that would
>> be my purpose. Others might have different purposes.
>
> There were presented also some ideas which I would call "futuristic":
> introducing a new combining character and using variations sequences.
> This ideas should be discussed in separate threads, which seems to
> happen now.

Thank you for debriefing. So far I?m pleased to infer that the outlined
outcome encounters general agreement.

It?s probably safe to conjecture that the case of the Polish abbreviation
for "magister" is becoming a textbook example of the reception of the
discussed Unicode policy with respect to superscript.

Best regards,

Marcel

From unicode at unicode.org Fri Nov 9 06:42:54 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 9 Nov 2018 07:42:54 -0500
Subject: Aleph-umlaut
Message-ID:

An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fmbdjcbgbjcgdjbe.png
Type: image/png
Size: 24681 bytes
Desc: not available
URL:

From unicode at unicode.org Fri Nov 9 17:25:54 2018
From: unicode at unicode.org (Marius Spix via Unicode)
Date: Sat, 10 Nov 2018 00:25:54 +0100
Subject: Aleph-umlaut
In-Reply-To:
References:
Message-ID: <20181110002554.0334d757@spixxi>

Dear Mark,

I found another sample here:
https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf

On page 86 it says that the aleph with diaresis is a number with
the value 1000.

See also the attached clipping.

A second source is the Brown-Driver-Briggs Hebrew-English Lexicon of the
Old Testament which also mentions that ??? ?means 1000, but there were
no evidence of this usage in Old Testament times. See here (the very
first lemma):
www.biblab.com/students/dizionari/Brown-Driver-Briggs%20Hb-En%20Dic.docx

Yet another usage in a mathematical context of an aleph with umlaut can
be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0
HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut
is used to mark the second derivative.
https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter
(page 28-29 or slide 41-42)

However, seems that there is no real font support for these characters,
though. The only font on my computer, which could render aleph
+ umlaut correctly on my system was Unifont and
roughly Linux Libertine. Other fonts, in particular Arial, DejaVu Sans,
Liberation Sans and Linux Biolinum rendered the diaeresis to much far
to the left.

I even found a user has a similar issue with U+0308, here:
http://smontagu.org/writings/HebrewNumbers.html

Maybe adding an annotation to U+0308 could sensitize font
designers to be aware that this combining character is also used
in the Hebrew alphabet.

My suggestion is to add the annotation ?= hewbrew thousands multiplier?
to U+0308 COMBINING DIAERESIS and a reference from 05B5 ??
HEBREW POINT TSERE to U+0308.

Best regards,

Marius

On Fri, 9 Nov 2018 07:42:54 -0500
"Mark E. Shoulson via Unicode" wrote:

> Noticed something really fascinating in an old pamphlet I was
> reading.? It's from 1922, in Hebrew mostly but with some Yiddish at
> the end.? The Yiddish spelling is not according to more modern
> standardization, but seems to be significantly more faithful to the
> German spellings of the same words, replacing Latin letters with
> Hebrew ones more than respelling phonetically.? And there are even
> places where it appears they represented a German ? with a Hebrew
> aleph?with an umlaut!? Actually it looks a little more like a double
> acute accent but that's surely a style choice, since it obviously is
> mapping to an umlaut.
>
>
>
> (Note also the spelling ???, a calque for German "die", where modern
> Yiddish would spell it phonetically as ??.)
>
>
> I do NOT think this needs any special encoding, btw.? I would
> probably encode this as simply U+05D0 U+0308 (??).? Combining symbols
> do not (necessarily) belong to a specific alphabet, and the fact that
> most fonts would render this badly is a different issue.? I just
> thought the people here might find it interesting.
>
>
> (Link is
> http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874
> look at the last few pages.)
>
>
> ~mark
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: aleph_umlaut.png
Type: image/png
Size: 40318 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: Digitale Signatur von OpenPGP
URL:

From unicode at unicode.org Fri Nov 9 18:02:33 2018
From: unicode at unicode.org (Tex via Unicode)
Date: Fri, 9 Nov 2018 16:02:33 -0800
Subject: Aleph-umlaut
In-Reply-To: <20181110002554.0334d757@spixxi>
References:
<20181110002554.0334d757@spixxi>
Message-ID: <000c01d47888$aedea820$0c9bf860$@xencraft.com>

My notes on Hebrew numbers on http://www.i18nguy.com/unicode/hebrew-numbers.html include:

"Using letters for numbers, there is the possibility of confusion as to whether a string of letters is a word or a numerical value. Therefore, when numbers are used with text, punctuation marks are added to distinguish their numerical meaning. Single character numbers (numbers less than 10) add the punctuation character geresh after the numeric character. Larger numbers insert the punctuation character gershayim before the last character in the number."

So perhaps Alef with diaeresis is a collapsed form of Alef followed by Gershayim when it is used as a numeric value. I wonder if that may also occur for other values.

(I am just speculating.)
Tex

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marius Spix via Unicode
Sent: Friday, November 9, 2018 3:26 PM
To: unicode at unicode.org
Cc: Mark E. Shoulson
Subject: Re: Aleph-umlaut

Dear Mark,

I found another sample here:
https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf

On page 86 it says that the aleph with diaresis is a number with the value 1000.

See also the attached clipping.

A second source is the Brown-Driver-Briggs Hebrew-English Lexicon of the Old Testament which also mentions that ??? ?means 1000, but there were no evidence of this usage in Old Testament times. See here (the very first lemma):
www.biblab.com/students/dizionari/Brown-Driver-Briggs%20Hb-En%20Dic.docx

Yet another usage in a mathematical context of an aleph with umlaut can be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0 HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut is used to mark the second derivative.
https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter
(page 28-29 or slide 41-42)

However, seems that there is no real font support for these characters, though. The only font on my computer, which could render aleph
+ umlaut correctly on my system was Unifont and
roughly Linux Libertine. Other fonts, in particular Arial, DejaVu Sans, Liberation Sans and Linux Biolinum rendered the diaeresis to much far to the left.

I even found a user has a similar issue with U+0308, here:
http://smontagu.org/writings/HebrewNumbers.html

Maybe adding an annotation to U+0308 could sensitize font designers to be aware that this combining character is also used in the Hebrew alphabet.

My suggestion is to add the annotation ?= hewbrew thousands multiplier?
to U+0308 COMBINING DIAERESIS and a reference from 05B5 ?? HEBREW POINT TSERE to U+0308.

Best regards,

Marius

On Fri, 9 Nov 2018 07:42:54 -0500
"Mark E. Shoulson via Unicode" wrote:

> Noticed something really fascinating in an old pamphlet I was reading.
> It's from 1922, in Hebrew mostly but with some Yiddish at the end.
> The Yiddish spelling is not according to more modern standardization,
> but seems to be significantly more faithful to the German spellings of
> the same words, replacing Latin letters with Hebrew ones more than
> respelling phonetically. And there are even places where it appears
> they represented a German ? with a Hebrew aleph?with an umlaut!
> Actually it looks a little more like a double acute accent but that's
> surely a style choice, since it obviously is mapping to an umlaut.
>
>
>
> (Note also the spelling ???, a calque for German "die", where modern
> Yiddish would spell it phonetically as ??.)
>
>
> I do NOT think this needs any special encoding, btw. I would probably
> encode this as simply U+05D0 U+0308 (??). Combining symbols do not
> (necessarily) belong to a specific alphabet, and the fact that most
> fonts would render this badly is a different issue. I just thought
> the people here might find it interesting.
>
>
> (Link is
> http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36
> 609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874
> look at the last few pages.)
>
>
> ~mark
>

From unicode at unicode.org Sat Nov 10 00:25:36 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 10 Nov 2018 06:25:36 +0000
Subject: Aleph-umlaut
In-Reply-To: <000c01d47888$aedea820$0c9bf860$@xencraft.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
Message-ID: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

In the last pages of the text linked by Mark E. Shoulson, both the
gershayim and the aleph-umlaut are shown.? A quick look didn't find any
other base letter with the combining umlaut.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: YiddishUmlaut~2.PNG
Type: image/png
Size: 80947 bytes
Desc: not available
URL:

From unicode at unicode.org Sat Nov 10 09:28:08 2018
From: unicode at unicode.org (Beth Myre via Unicode)
Date: Sat, 10 Nov 2018 10:28:08 -0500
Subject: Aleph-umlaut
In-Reply-To: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>
Message-ID:

Hi Everyone,

Are we sure this is actually Yiddish? To me it looks like it could be
German transliterated into the Yiddish/Hebrew alphabet.

I can spend a little more time with it and put together some examples.

Beth

On Sat, Nov 10, 2018 at 1:28 AM James Kass via Unicode
wrote:

>
> In the last pages of the text linked by Mark E. Shoulson, both the
> gershayim and the aleph-umlaut are shown. A quick look didn't find any
> other base letter with the combining umlaut.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sat Nov 10 18:54:15 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 10 Nov 2018 19:54:15 -0500
Subject: Aleph-umlaut
In-Reply-To: <20181110002554.0334d757@spixxi>
References:
<20181110002554.0334d757@spixxi>
Message-ID:

On 11/9/18 6:25 PM, Marius Spix via Unicode wrote:
> Dear Mark,
>
> I found another sample here:
> https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf
>
> On page 86 it says that the aleph with diaresis is a number with
> the value 1000.

That's true, I've heard of that, and even occasionally seen it. And
sometimes in old printings things like a diaeresis or a dot above were
used where later Hebrew uses a U+05F3 HEBREW PUNCTUATION GERESH or
U+05F4 HEBREW PUNCTUATION GERSHAYIM.? I think what struck me about this
one was that this was not just something that looked like a
diaeresis/umlaut, it really WAS an umlaut, a direct transcoding of the
a-umlaut in Latin letters into aleph-umlaut in Hebrew letters.

> Yet another usage in a mathematical context of an aleph with umlaut can
> be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0
> HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut
> is used to mark the second derivative.
> https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter
> (page 28-29 or slide 41-42)

Kind of an odd usage, since ALEF SYMBOL is usually used for transfinite
cardinals, as in ??, and you don't normally take time-derivatives of
those.? But mathematicians love using weird symbols for whatever they
like.? This is the mathematical usage of two-dots-above, as you note.

~mark

From unicode at unicode.org Sat Nov 10 19:03:11 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 10 Nov 2018 20:03:11 -0500
Subject: Aleph-umlaut
In-Reply-To: <000c01d47888$aedea820$0c9bf860$@xencraft.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
Message-ID: <6d22f2a1-7daa-afc1-d537-c20e6816bbe2@kli.org>

On 11/9/18 7:02 PM, Tex via Unicode wrote:
> My notes on Hebrew numbers on http://www.i18nguy.com/unicode/hebrew-numbers.html include:
>
> "Using letters for numbers, there is the possibility of confusion as to whether a string of letters is a word or a numerical value. Therefore, when numbers are used with text, punctuation marks are added to distinguish their numerical meaning. Single character numbers (numbers less than 10) add the punctuation character geresh after the numeric character. Larger numbers insert the punctuation character gershayim before the last character in the number."
>
> So perhaps Alef with diaeresis is a collapsed form of Alef followed by Gershayim when it is used as a numeric value. I wonder if that may also occur for other values.

I don't know that it's a "collapsed" form.? I think the double-dotted
form is just an alternate one, and one that was more popular in older
times.? Standardized Hebrew numerical usage would be to use a GERESH
(not a GERSHAYIM) after an ALEF to indicate a thousand; GERSHAYIM is
used before the last letter in a number that is "large" generally in the
sense of the number of letters (i.e. more than one or two).? Since
GERESH is also used for single-letter numbers, this means that ?? could
mean "one" (much more common) or "one thousand".? The GERESH-after
becomes useful in something like the full number of the year, ???????
where it sets off the initial 5, making it 5000 (this notation is not
place-value, but there is a usual ordering, so technically it would
(usually) be understandable even without the punctuation marks, due to
the out-of-order placement of the initial HE).

Again, what interested me about this usage was that it really *was* an
umlaut.? But yes, there are other situations where such a thing could
happen.

~mark

From unicode at unicode.org Sat Nov 10 19:17:35 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 10 Nov 2018 20:17:35 -0500
Subject: Aleph-umlaut
In-Reply-To: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>
Message-ID:

On 11/10/18 1:25 AM, James Kass via Unicode wrote:
>
> In the last pages of the text linked by Mark E. Shoulson, both the
> gershayim and the aleph-umlaut are shown.? A quick look didn't find
> any other base letter with the combining umlaut.
>
Indeed; there is no shortage of use of the GERSHAYIM, used as it
normally is, to indicate abbreviations.? The umlaut on the alef is used
specifically in the Yiddish parts, to be an umlaut (the word with the
GERSHAYIM on the top line is an abbreviation for the phrase for a legal
court or authority; the word on the second like transliterates
apparently to "best?tigt"; someone with better German than me can make
more sense of it.? The example I sent at first used the word
"legalit?t", which even I can understand as "legality" or something like
that.)? I think the Yiddish at the time may already not have had ? or ?
sounds, so had no need to transliterate those (or maybe there just
happened not to be a need for them in this text); certainly I see
Yiddish spellings like ?????? ("oyf-") where German would have "auf".

~mark

From unicode at unicode.org Sat Nov 10 19:49:46 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 10 Nov 2018 20:49:46 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID:

On 11/10/18 10:28 AM, Beth Myre via Unicode wrote:
> Hi Everyone,
>
> Are we sure this is actually Yiddish?? To me it looks like it could be
> German transliterated into the Yiddish/Hebrew alphabet.
>
> I can spend a little more time with it and put together some examples.
>
> Beth

Is there really a difference?? In a very real sense, Yiddish *IS* a form
of German (I'm told it's Middle High German, but I don't actually have
knowledge about that), with a strong admixture of Hebrew and Russian and
a few other languages, and which is usually written using Hebrew
letters.? There's probably something like a continuum with "Yiddish" and
"German" as ranges or points.

Is the text *standard* German written with Hebrew letters?? I don't
think so.? Let's see, on the next-to-last page, end of first paragraph,
I see the phrase ?????????????? ????????????, which would transliterate
to "oytorit?ten bekr?fting"?with umlauted "a", but "oy-" instead of
"au-" at the beginning.? OK, I know in German "au" can be pronounced
"oy-" sometimes (I think), but at least
https://en.wiktionary.org/wiki/Autorit%C3%A4t implies that this isn't
the usual/standard pronunciation (I make no claims as to expertise in
German).? The text is littered with terms like ????, abbreviation for
Hebrew ??? ???, "house of judgment" or legal court, pronounced in
Yiddish "beisdin", or ??? (can't be German as it has no vowels!) meaning
"legal decision," from Hebrew?Hebrew-derived words in Yiddish do not
change their spelling, as a rule.? There are definitely German spelling
features that are not found in later spellings, for example, double
letters in German are written double in the Yiddish spelling too, which
is quite unusual (you're used to letters in Hebrew never being silent or
even geminate, but always having at least a semi-syllable sound between
like letters, except in special cases, so it seems striking to see ????
for a simple two syllables).

So I'm not sure if there's a *real* answer to your question, but it does
look to me like this isn't "normal" German, at least.? And would it
matter, anyway?

~mark

From unicode at unicode.org Sat Nov 10 20:24:34 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sat, 10 Nov 2018 21:24:34 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>

Message-ID: <22c4c73c-a6ef-7d0c-3b8c-2e6e3e21e57e@kli.org>

Oh yeah, fun fact about this document that I linked at the outset: I
found it like 10 years ago when researching something unrelated... it
apparently is a ruling opposing an earlier announcement by another group
of Rabbis, declaring it void.? And looking at the rabbis they say are
supporting them in this decision, I see they mention Rabbi Joseph Rosen,
chief Rabbi of "Wisloch".? And I think to myself, "How interesting.? I
have a great-grandfather who was named Rabbi Joseph Rosen, chief Rabbi
of a town called Swisloch" (with an S; presumably an error in the
pamphlet.)? I checked with my father; the timing is about right, would
have been shortly before he came to America.? The Internet moves in
mysterious ways.

~mark

From unicode at unicode.org Sun Nov 11 05:33:20 2018
From: unicode at unicode.org (Otto Stolz via Unicode)
Date: Sun, 11 Nov 2018 12:33:20 +0100
Subject: Aleph-umlaut
In-Reply-To:
References:
Message-ID: <632ae659-cf2e-65a5-64fd-cc94651c6f9f@uni-konstanz.de>

Am 2018-11-09 um 13:42 schrieb Mark E. Shoulson via Unicode:
>
> Noticed something really fascinating in an old pamphlet I was reading
>
really interesting, thanks!
>
>
> (Link is
> http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874
> look at the last few pages.)
>
>
To me, this link delivers an empty document. Please check the spelling
of the URL.

Best wishes,

?? Otto

From unicode at unicode.org Sun Nov 11 00:03:15 2018
From: unicode at unicode.org (Beth Myre via Unicode)
Date: Sun, 11 Nov 2018 01:03:15 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID:

Hi Mark,

This is a really cool find, and it's interesting that you might have a
relative mentioned in it. After looking at it more, I'm more convinced
that it's German written in Hebrew letters, not Yiddish. I think that
explains the umlauts. Since the text is about Jewish subjects, it also
includes Hebrew words like you mentioned, just like we would include *beit
din* or *p'sak* in an English text.

Here's a paragraph from page 22:

[image: Paragraph.jpg]

I (re-)transliterated it, and it reads:

Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?),
noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser
rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung,
dass:

a)?

That's just German. Something like -

We know that we can't expect repentance or insight from the other party,
and that they will disregard the consequences of these rabbinical reports
because:

a)?

I only know a little Yiddish (one semester a long time ago), but I think
Yiddish word order would be very different. Also, 'we are' would be 'mir
zaynen' instead of 'wir sind,' 'and' would be 'un' instead of 'und,' etc.

Beth

On Sat, Nov 10, 2018 at 8:51 PM Mark E. Shoulson via Unicode <
unicode at unicode.org> wrote:

> On 11/10/18 10:28 AM, Beth Myre via Unicode wrote:
> > Hi Everyone,
> >
> > Are we sure this is actually Yiddish? To me it looks like it could be
> > German transliterated into the Yiddish/Hebrew alphabet.
> >
> > I can spend a little more time with it and put together some examples.
> >
> > Beth
>
> Is there really a difference? In a very real sense, Yiddish *IS* a form
> of German (I'm told it's Middle High German, but I don't actually have
> knowledge about that), with a strong admixture of Hebrew and Russian and
> a few other languages, and which is usually written using Hebrew
> letters. There's probably something like a continuum with "Yiddish" and
> "German" as ranges or points.
>
> Is the text *standard* German written with Hebrew letters? I don't
> think so. Let's see, on the next-to-last page, end of first paragraph,
> I see the phrase ?????????????? ????????????, which would transliterate
> to "oytorit?ten bekr?fting"?with umlauted "a", but "oy-" instead of
> "au-" at the beginning. OK, I know in German "au" can be pronounced
> "oy-" sometimes (I think), but at least
> https://en.wiktionary.org/wiki/Autorit%C3%A4t implies that this isn't
> the usual/standard pronunciation (I make no claims as to expertise in
> German). The text is littered with terms like ????, abbreviation for
> Hebrew ??? ???, "house of judgment" or legal court, pronounced in
> Yiddish "beisdin", or ??? (can't be German as it has no vowels!) meaning
> "legal decision," from Hebrew?Hebrew-derived words in Yiddish do not
> change their spelling, as a rule. There are definitely German spelling
> features that are not found in later spellings, for example, double
> letters in German are written double in the Yiddish spelling too, which
> is quite unusual (you're used to letters in Hebrew never being silent or
> even geminate, but always having at least a semi-syllable sound between
> like letters, except in special cases, so it seems striking to see ????
> for a simple two syllables).
>
> So I'm not sure if there's a *real* answer to your question, but it does
> look to me like this isn't "normal" German, at least. And would it
> matter, anyway?
>
> ~mark
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Paragraph.jpg
Type: image/jpeg
Size: 181354 bytes
Desc: not available
URL:

From unicode at unicode.org Sun Nov 11 13:42:53 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 11 Nov 2018 11:42:53 -0800
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID: <1805fb55-9e34-c7ac-f5f0-ced99cd3b163@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 11 14:32:29 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sun, 11 Nov 2018 21:32:29 +0100
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID:

> On 11 Nov 2018, at 07:03, Beth Myre via Unicode wrote:
>
> Hi Mark,
>
> This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text.
>
> Here's a paragraph from page 22:

Actually page 21.

>
>
>
> I (re-)transliterated it, and it reads:

Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text:

> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:

vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass :

From unicode at unicode.org Sun Nov 11 15:16:01 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 11 Nov 2018 13:16:01 -0800
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 11 15:37:10 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Sun, 11 Nov 2018 22:37:10 +0100
Subject: Aleph-umlaut
In-Reply-To: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
Message-ID: <0532C015-564D-4451-9101-44F75DA535E8@telia.com>

> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote:
>
> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>>
>>> On 11 Nov 2018, at 07:03, Beth Myre via Unicode
>>> wrote:
>>>
>>> Hi Mark,
>>>
>>> This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text.
>>>
>>> Here's a paragraph from page 22:
>>>
>> Actually page 21.
>>
>>
>>>
>>>
>>> I (re-)transliterated it, and it reads:
>>>
>> Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text:
>>
>>
>>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:
>>>
>> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass :
>
> This agrees rather well with Beth's retranslation.
> Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these pronunciations are spelled in German (with a few outliers like "izt" for "ist", where the "s" isn't voiced in German). There's also a clear convention of using "kh" for "ch" (as in English "loch" but also for other pronunciation of the German "ch"). The one apparent mismatch is "ge- gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can stand for phonetic "p". "Parthey" might be how Germans could have written "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" alternated with "ei", as in my last name). So, perhaps it's closer than it looks, superficially.
> From context, "Reue" is by far the best match for "Reye" and seems to match a tendency elsewhere in the sample where the transliteration, if pronounced as German, would result in a shifted quality for the vowels (making them sound more Yiddish, for lack of a better description).
>
> "absch?ttelen" - here the second "e" would not be part of Standard German orthography. It's either an artifact of the transcription system or possibly reflects that the writer is familiar with a different spelling convention (to my eyes the spelling "abshittelen" looks somehow more Yiddish, but I'm really not familiar enough with that language).
>
> But still, the text is unquestionably intended to be in German.

One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes.

The German translation it gives:
Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en:

And in English:
Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation:

>From the original Hebrew script, in case someone wants to try out more possibilities:
???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? :

From unicode at unicode.org Sun Nov 11 17:00:06 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Sun, 11 Nov 2018 15:00:06 -0800
Subject: Aleph-umlaut
In-Reply-To: <0532C015-564D-4451-9101-44F75DA535E8@telia.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
<0532C015-564D-4451-9101-44F75DA535E8@telia.com>
Message-ID:

On 11/11/2018 1:37 PM, Hans ?berg wrote:
>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote:
>>
>> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>>>> On 11 Nov 2018, at 07:03, Beth Myre via Unicode
>>>> wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text.
>>>>
>>>> Here's a paragraph from page 22:
>>>>
>>> Actually page 21.
>>>
>>>
>>>>
>>>>
>>>> I (re-)transliterated it, and it reads:
>>>>
>>> Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text:
>>>
>>>
>>>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:
>>>>
>>> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass :
>> This agrees rather well with Beth's retranslation.
>> Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these pronunciations are spelled in German (with a few outliers like "izt" for "ist", where the "s" isn't voiced in German). There's also a clear convention of using "kh" for "ch" (as in English "loch" but also for other pronunciation of the German "ch"). The one apparent mismatch is "ge- gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can stand for phonetic "p". "Parthey" might be how Germans could have written "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" alternated with "ei", as in my last name). So, perhaps it's closer than it looks, superficially.
>> From context, "Reue" is by far the best match for "Reye" and seems to match a tendency elsewhere in the sample where the transliteration, if pronounced as German, would result in a shifted quality for the vowels (making them sound more Yiddish, for lack of a better description).
>>
>> "absch?ttelen" - here the second "e" would not be part of Standard German orthography. It's either an artifact of the transcription system or possibly reflects that the writer is familiar with a different spelling convention (to my eyes the spelling "abshittelen" looks somehow more Yiddish, but I'm really not familiar enough with that language).
>>
>> But still, the text is unquestionably intended to be in German.
> One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes.

Well, the OCR does a much better job than the "translation".

> The German translation it gives:
> Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en:

This is simply utter nonsense and does not even begin to correlate with
the transliteration.

> And in English:
> Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation:

In fact, the English translation makes somewhat more sense. For example,
"Gegenpartei" in many legal contexts (which this sample isn't, by the
way) can in fact be translated as "injured party", which in turn might
correlate with an "injured side" as rendered. However "Seite der
Verletzten" makes no sense in this context, unless there's a Hebrew word
that accidentally matches and got picked up.

(I'm suspicious that some of the auto translation does in fact work like
many real translations which often are not direct, but involve an
intermediate language - simply because it's not possible to find
sufficient translators between random pairs of languages.).

>
> From the original Hebrew script, in case someone wants to try out more possibilities:
> ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? :
>
>
I don't know what that will tell you. You have a rendering that produces
coherent text which closely matches a phonetic transliteration. What
else do you hope to learn?

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 11 17:55:10 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 12 Nov 2018 00:55:10 +0100
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
<0532C015-564D-4451-9101-44F75DA535E8@telia.com>

Message-ID: <93724E33-20B4-4725-938F-EF6494CFF901@telia.com>

> On 12 Nov 2018, at 00:00, Asmus Freytag (c) wrote:
>
> On 11/11/2018 1:37 PM, Hans ?berg wrote:
>>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode
>>> wrote:
>>>
>>> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>>>
>> One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes.
>
> Well, the OCR does a much better job than the "translation".

Not so surprising, but it did not have a literal OCR. An OCR can improve transliteration by guessing the language to fill in partial recognition, so there is a fallacy already there.

>> The German translation it gives:
>> Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en:
>
> This is simply utter nonsense and does not even begin to correlate with the transliteration.
>
>> And in English:
>> Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation:
>
> In fact, the English translation makes somewhat more sense. For example, "Gegenpartei" in many legal contexts (which this sample isn't, by the way) can in fact be translated as "injured party", which in turn might correlate with an "injured side" as rendered. However "Seite der Verletzten" makes no sense in this context, unless there's a Hebrew word that accidentally matches and got picked up.
> (I'm suspicious that some of the auto translation does in fact work like many real translations which often are not direct, but involve an intermediate language - simply because it's not possible to find sufficient translators between random pairs of languages.).

Google translation uses AI by comparing texts in both languages, the Rosetta stone method. Therefore, there is a poor result for languages where there are less available texts to compare with. Sometimes it can be better than dictionaries if it concerns more modern terms. But in other cases, it may just be gibberish.

>> From the original Hebrew script, in case someone wants to try out more possibilities:
>> ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? :
>>
> I don't know what that will tell you. You have a rendering that produces coherent text which closely matches a phonetic transliteration. What else do you hope to learn?

It is up to whoever likes to try (FYI).

From unicode at unicode.org Sun Nov 11 18:12:27 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 11 Nov 2018 19:12:27 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID: <44cf1a26-e53b-564f-1ad8-8aaa50bc8f03@kli.org>

On 11/11/18 3:32 PM, Hans ?berg via Unicode wrote:
> Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text:
>
>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:
> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass :

Yeah, you have to be careful of auto-transliterating, if that's what
you're using for this transliteration.? The third word is definitely not
"auns"; the alef at the beginning is a "shtumer-alef", a *silent* letter
used in Yiddish a little like a mater lectionis, now that I think about
it: it's a nominal (but void) consonant used as a place-holder to hold
the vowel? (Hebrew allows words to start with a vocalic vav, only when
it's used as a conjunction, but Yiddish does not, generally.? Nor a
vocalic yod or double-yod or vav-yod diphthong.)? Interesting that you
have "*zya dya" there (those are silent as well; the words are just "zi
di"); it looks like elsewhere in the document they spell it with a more
precise transliteration, strictly using AYIN for "e", not ALEF as here.

~mark

From unicode at unicode.org Sun Nov 11 18:20:08 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 11 Nov 2018 19:20:08 -0500
Subject: Aleph-umlaut
In-Reply-To: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
Message-ID:

On 11/11/18 4:16 PM, Asmus Freytag via Unicode wrote:
> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>>
>>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:
>> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass :
>
>
> This agrees rather well with Beth's retranslation.
>
> Mapping "z" to "s",? "f" to "v" and "v" to "w" would match the way
> these pronunciations are spelled in German (with a few outliers like
> "izt" for "ist", where the "s" isn't voiced in German). There's also a
> clear convention of using "kh" for "ch" (as in English "loch" but also
> for other pronunciation of the German "ch"). The one apparent mismatch
> is "ge- gefarthey" for "Gegenpartei". Presumably what is
> transliterated as "f" can stand for phonetic "p". "Parthey" might be
> how Germans could have written "Partei" in earlier centuries (when
> "th" was commonly used for "t" and "ey" alternated with "ei", as in my
> last name).? So, perhaps it's closer than it looks, superficially.
>
I think that really IS a "p"; elsewhere in the document they seem to be
quite careful to put a RAFE on top of the PEH when it means "f", and not
using a DAGESH to mark "p".? There definitely does seem to be usage of
TET-HEH for "th"; in the Hebrew text at the beginning it talks about the
?????? community?took me a bit to work out that was an abbreviation for
"Orthodox".

> From context, "Reue" is by far the best match for "Reye" and seems to
> match a tendency elsewhere in the sample where the transliteration, if
> pronounced as German, would result in a shifted quality for the vowels
> (making them sound more Yiddish, for lack of a better description).
>
That word is hard to read in the original, hence the "?" in the
transliteration.? It isn't clear if it's YOD YOD or YOD VAV and the VAV
is missing its body (the head looks different than it should if it were
a YOD).? Which would match your "Reue" fairly well?except that they
generally use AYIN for "e", not "YOD".
>
> "absch?ttelen" - here the second "e" would not be part of Standard
> German orthography. It's either an artifact of the transcription
> system or possibly reflects that the writer is familiar with a
> different spelling convention (to my eyes the spelling "abshittelen"
> looks somehow more Yiddish, but I'm really not familiar enough with
> that language).
>
The ? is, of course, not in the text in the original; it's just "i".?
German ? wound up as "i" in Yiddish, in most cases.

~mark

From unicode at unicode.org Sun Nov 11 18:24:11 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 11 Nov 2018 19:24:11 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>
<0532C015-564D-4451-9101-44F75DA535E8@telia.com>

Message-ID: <7ab3c46d-dd95-2a75-9687-f29145c59b8f@kli.org>

On 11/11/18 6:00 PM, Asmus Freytag (c) via Unicode wrote:
> On 11/11/2018 1:37 PM, Hans ?berg wrote:
>>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote:
>>>
>>> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>>> One should not rely too much these autotranslation tools, but it may
>>> be quicker using some OCR program and then correct by hand, than
>>> entering it all by hand. The setup did not admit transliterating
>>> Hebrew script directly into German. It seems that the translator
>>> program recognizes it as Yiddish, though it might be as a result of
>>> an assumption it makes.
>
>
> Well, the OCR does a much better job than the "translation".
>
Agreed:
>
>
>> The German translation it gives:
>> Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en:
>
>
> This is simply utter nonsense and does not even begin to correlate
> with the transliteration.
>
Yeah, that looks like word salad even to me and my tiny knowledge of
German.? The first words are definitely "Wir sind," for example.
>
>> And in English:
>> Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation:
>
>
> In fact, the English translation makes somewhat more sense. For
> example, "Gegenpartei" in many legal contexts (which this sample
> isn't, by the way) can in fact be translated as "injured party", which
> in turn might correlate with an "injured side" as rendered. However
> "Seite der Verletzten" makes no sense in this context, unless there's
> a Hebrew word that accidentally matches and got picked up.
>
The pamphlet seems to be referring to forming some sort of sub-community
or group as a "gegenpartei," I think.

The actual content of the work is not a deep mystery, really.

~mark

> (I'm suspicious that some of the auto translation does in fact work
> like many real translations which often are not direct, but involve an
> intermediate language - simply because it's not possible to find
> sufficient translators between random pairs of languages.).
>
>> >From the original Hebrew script, in case someone wants to try out more possibilities:
>> ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? :
>>
>>
> I don't know what that will tell you. You have a rendering that
> produces coherent text which closely matches a phonetic
> transliteration. What else do you hope to learn?
>
> A./
>

From unicode at unicode.org Sun Nov 11 18:05:52 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Sun, 11 Nov 2018 19:05:52 -0500
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

Message-ID: <8471fd97-447a-74c8-0bb0-5d04f206b90d@kli.org>

On 11/11/18 1:03 AM, Beth Myre via Unicode wrote:
> Hi Mark,
>
> This is a really cool find, and it's interesting that you might have a
> relative mentioned in it.? After looking at it more, I'm more
> convinced that it's German written in Hebrew letters, not Yiddish.? I
> think that explains the umlauts. Since the text is about Jewish
> subjects, it also includes Hebrew words like you mentioned, just like
> we would include /beit din/ or /p'sak/ in an English text.

Again, I'm not so sure there's really a difference.? Yiddish *IS*
Judeo-German.? That's what it's called.? Do you prefer to think of it as
German?? OK with me, but it's more a matter of taste than fact.

>
> Here's a paragraph from page 22:
>
> Paragraph.jpg
>
> I (re-)transliterated it, and it reads:
>
> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei?weder
> Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen
> dieser rabbinischen Gutachten von sich absch?ttelen werden mit der
> Motivierung, dass:
>
Are you sure you're not embellishing a bit?? I note you have ?, and yet
the text clearly says "abshitellen".? The ? sound did not survive into
later Yiddish, usually becoming "i", and the ? sound apparently didn't
either... but is still there at this particular time and place.

> I only know a little Yiddish (one semester a long time ago), but I
> think Yiddish word order would be very different.? Also, 'we are'
> would be 'mir zaynen' instead of 'wir sind,' 'and' would be 'un'
> instead of 'und,' etc.
>
Yiddish "and" is now spelled "un" (alef-vav-finalnun), but I have seen
it spelled alef-vav-nun-geresh, indicating the elision of the final -d
in older texts.? It would not surprise me at all if some dialects
preserved the -d, in spelling anyway, longer than others.? "Mir zaynen"
is definitely "normal" Yiddish so far as I know... but how far do I know?

What is this argument over anyway?? "You claim that this animal is a
mutt, but I tell you it is clearly a dog of mixed breed!"

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Paragraph.jpg
Type: image/jpeg
Size: 181354 bytes
Desc: not available
URL:

From unicode at unicode.org Sun Nov 11 19:28:38 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sun, 11 Nov 2018 17:28:38 -0800
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>

Message-ID: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Sun Nov 11 22:28:01 2018
From: unicode at unicode.org (Beth Myre via Unicode)
Date: Sun, 11 Nov 2018 23:28:01 -0500
Subject: Aleph-umlaut
In-Reply-To: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>

<7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>
Message-ID:

Hi All,

I wanted to clarify how I got this:

*Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?),
noch Einsicht zu erwarten ist und dass sie die Konsequenzen
dieser rabbinischen Gutachten von sich absch?ttelen werden mit der
Motivierung, dass:*

As a (non-native) German speaker who knows the Hebrew alphabet, I looked at
the text, and then wrote the text contained in it using conventional German
spelling. I spelled absch?tteln wrong. I didn't change the word order or
vocabulary. The translation into English was also my own. The spelling of
the word 'Reue' surprised me and one of the letters looked odd, so I put a
question mark after it.

I wasn't transliterating letter-for-letter, which wouldn't be possible
because certain letters written next to each other produce specific
sounds. For example, the Hebrew letters yud-yud make the German sound
'ei,' and the letters vav-vav make the German sound 'w.' The Hebrew
alphabet just provides different material to work with than the Latin
alphabet. Speaking of, it will soon be Chanukah/Hanukkah/Hanukah! :)

The transliteration created by a computer program in one of the previous
emails makes the text look more Yiddish-y than it is, probably because it
was expecting Yiddish. It also made several clear errors. A few examples,
some of which Mark mentioned:
- The Hebrew letter aleph was always transliterated as 'a.' However,
whenever it had the small vowel symbol that looks like a 'T' underneath it,
it should have been an 'o.' And in several locations it's a 'shtumer
aleph' (a.k.a. silent aleph) that's basically just carrying the letter used
for 'u,' so it shouldn't be included at all.
- It was inconsistent in how it transliterated the Hebrew letter 'yud,'
sometimes making it an 'i' but more often a 'y.' The 'y' makes it look like
Yiddish, but they're both valid. It's also used in other parts of the text
for the German 'j.'
- It skipped the 'n' in 'Gegenpartei,' although it's definitely present in
the text. There's also an 'h' after the 't,' so the word is basically
spelled "Gegenparthei."
- It missed the difference between the 'f' sound and the 'p' sound, which
is represented in the text by the presence or absence of a small line over
the same Hebrew letter.

Mark, you asked why I brought up the question of whether this is Yiddish or
German. They're two separate but related languages, and I thought this
text was really interesting because it turned out not to be what I was
expecting. I'm not a scholar, and I didn't realize that anyone ever wrote
in German using Hebrew letters. It's a struggle for me to understand
Yiddish and my Hebrew is limited. Being able to understand entire
paragraphs written in Hebrew letters is a rare treat for me.

Beth

On Sun, Nov 11, 2018 at 8:31 PM Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 11/11/2018 4:20 PM, Mark E. Shoulson via Unicode wrote:
>
> On 11/11/18 4:16 PM, Asmus Freytag via Unicode wrote:
>
> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote:
>
>
> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?),
> noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser
> rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung,
> dass:
>
> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye ,
> nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer
> rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung ,
> dass :
>
>
>
> This agrees rather well with Beth's retranslation.
>
> Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these
> pronunciations are spelled in German (with a few outliers like "izt" for
> "ist", where the "s" isn't voiced in German). There's also a clear
> convention of using "kh" for "ch" (as in English "loch" but also for other
> pronunciation of the German "ch"). The one apparent mismatch is "ge-
> gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can
> stand for phonetic "p". "Parthey" might be how Germans could have written
> "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey"
> alternated with "ei", as in my last name). So, perhaps it's closer than it
> looks, superficially.
>
> I think that really IS a "p"; elsewhere in the document they seem to be
> quite careful to put a RAFE on top of the PEH when it means "f", and not
> using a DAGESH to mark "p". There definitely does seem to be usage of
> TET-HEH for "th"; in the Hebrew text at the beginning it talks about the
> ?????? community?took me a bit to work out that was an abbreviation for
> "Orthodox".
>
> From context, "Reue" is by far the best match for "Reye" and seems to
> match a tendency elsewhere in the sample where the transliteration, if
> pronounced as German, would result in a shifted quality for the vowels
> (making them sound more Yiddish, for lack of a better description).
>
> That word is hard to read in the original, hence the "?" in the
> transliteration. It isn't clear if it's YOD YOD or YOD VAV and the VAV is
> missing its body (the head looks different than it should if it were a
> YOD). Which would match your "Reue" fairly well?except that they generally
> use AYIN for "e", not "YOD".
>
>
> "absch?ttelen" - here the second "e" would not be part of Standard German
> orthography. It's either an artifact of the transcription system or
> possibly reflects that the writer is familiar with a different spelling
> convention (to my eyes the spelling "abshittelen" looks somehow more
> Yiddish, but I'm really not familiar enough with that language).
>
> The ? is, of course, not in the text in the original; it's just "i".
> German ? wound up as "i" in Yiddish, in most cases.
>
>
> I agree with Beth that the text reads like a transcription of a standard
> German text, not like a transcription of Yiddish, small infidelities in
> vowel/consonant renderings notwithstanding. These are either because the
> transcription conventions deliberately make some substitutions (presumably
> there's no Hebrew letter that would directly match an "?", so they picked
> "i") or because the writer, while trying to capture standard German in this
> instance, is aware of a different orthography. The result, before Beth
> tweaked it, would resemble a bit a phonetic transcription of someone
> speaking standard German with a Yiddish accent. The fact that there are no
> differences in grammar and the phrasing is absolutely natural for written
> German is what confirms the identification as German, rather than Yiddish
> text.
>
> Just because Yiddish is closely related to German doesn't mean that you
> can simply write the former with standard German phonetics and have it
> match a text in standard German to the point where there's no distinction.
> I think the sample is long enough and involved enough to give quite decent
> confidence in discriminating between these two Germanic languages. Grammar,
> phrasing and word choice are in that sense much better indicators than pure
> spelling; just as people trying to assume some foreign accent will give
> themselves away by faithfully maintaining the underlying structure of the
> language - that even works if the "accent" includes a few selected bits of
> "foreign" word order or grammar. In those artificial examples, there's
> rarely the kind of subtle mistake that a true non-native will make.
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Mon Nov 12 19:48:39 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 12 Nov 2018 20:48:39 -0500
Subject: Aleph-umlaut
In-Reply-To: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>

<7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>
Message-ID:

You know, you're right (as is Beth), and I don't know why I'm arguing
the point.? It's something I've been working on: I shouldn't defend a
position JUST because it's _my_ position, and yet that's just what I did.

So, yes, it certainly does seem essentially German.? I couldn't say why
they chose to write this part in German, or why they chose to transcribe
it in Hebrew letters, really.? I assumed Yiddish probably because of the
context and the alphabet used, but there's no reason for it not to be
German.? Now, the pamphlet originated from Kloizenberg, i.e.
https://en.wikipedia.org/wiki/Cluj-Napoca which is in Romania, but
German was probably enough of a lingua franca (after all, Yiddish
developed from it for that reason).? And the text being basically German
would explain the aleph-umlaut which was the start of all this, though
it doesn't so much need an "explanation": it's interesting enough that
it's _there_.? Also interesting that no other umlauted letters were
considered distinct enough to be transcribed so (or else they just
happened not to show up).? There are probably mildly interesting things
(depending on your interests) to be gleaned from studying how the
transliterations, how they seemed to use ? for word-final "e" in "die"
in some places but ? in others, etc.

Anyway, still interesting, I thought.

~mark

On 11/11/18 8:28 PM, Asmus Freytag via Unicode wrote:
>
> I agree with Beth that the text reads like a transcription of a
> standard German text, not like a transcription of Yiddish, small
> infidelities in vowel/consonant renderings notwithstanding. These are
> either because the transcription conventions deliberately make some
> substitutions (presumably there's no Hebrew letter that would directly
> match an "?", so they picked "i") or because the writer, while trying
> to capture standard German in this instance, is aware of a different
> orthography. The result, before Beth tweaked it, would resemble a bit
> a phonetic transcription of someone speaking standard German with a
> Yiddish accent. The fact that there are no differences in grammar and
> the phrasing is absolutely natural for written German is what confirms
> the identification as German, rather than Yiddish text.
>
> Just because Yiddish is closely related to German doesn't mean that
> you can simply write the former with standard German phonetics and
> have it match a text in standard German to the point where there's no
> distinction. I think the sample is long enough and involved enough to
> give quite decent confidence in discriminating between these two
> Germanic languages. Grammar, phrasing and word choice are in that
> sense much better indicators than pure spelling; just as people trying
> to assume some foreign accent will give themselves away by faithfully
> maintaining the underlying structure of the language - that even works
> if the "accent" includes a few selected bits of "foreign" word order
> or grammar. In those artificial examples, there's rarely the kind of
> subtle mistake that a true non-native will make.
>
> A./
>

From unicode at unicode.org Mon Nov 12 20:59:53 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 12 Nov 2018 18:59:53 -0800
Subject: Aleph-umlaut
In-Reply-To:
References:
<20181110002554.0334d757@spixxi>
<000c01d47888$aedea820$0c9bf860$@xencraft.com>
<328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com>

<3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com>

<7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com>

Message-ID: <93370887-36fd-db7e-fa94-aced566a8fa2@ix.netcom.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Tue Nov 20 14:57:57 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Tue, 20 Nov 2018 20:57:57 +0000 (GMT)
Subject: The encoding of the Welsh flag
Message-ID: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>

In Unicode? Technical Standard #51 Unicode Emoji there is the encoding for the Welsh flag.

This is in the section http://www.unicode.org/reports/tr51/#Sample_Valid_Emoji_Tag_Sequences

In the Status section near the start of the document is the following.

quote

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.

end quote

My questions are as follows please.

Is that encoding for the Welsh flag included

in both The Unicode Standard and ISO/IEC 10646

or is it only encoded in The Unicode Standard

or is it in neither The Unicode Standard nor ISO/IEC 10646?

Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system?

William Overington

Tuesday 20 November 2018

From unicode at unicode.org Tue Nov 20 15:50:25 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 20 Nov 2018 13:50:25 -0800
Subject: The encoding of the Welsh flag
In-Reply-To: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
Message-ID: <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>

On 11/20/2018 12:57 PM, William_J_G Overington via Unicode wrote:
> quote
>
> A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
>
> end quote
>
> My questions are as follows please.
>
> Is that encoding for the Welsh flag included
>
> in both The Unicode Standard and ISO/IEC 10646
>
> or is it only encoded in The Unicode Standard
>
> or is it in neither The Unicode Standard nor ISO/IEC 10646?

Neither.

A flag emoji is represented via a character sequence -- in this
particular case by an emoji tag sequence, as specified in UTS #51.

The representation of flag emoji via emoji tag sequences is *OUT OF
SCOPE* for both the Unicode Standard and for ISO/IEC 10646.

If you find that hard to understand, consider another example. The
spelling of the word "emoji" as the sequence of Unicode characters
<0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the
Unicode Standard and for ISO/IEC 10646. Neither standard specifies
English spelling rules; nor does either standard specify flag emoji
"spelling rules".

>
> Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system?

One declares conformance to UTS #51 and declares the version of emoji
that one's application supports -- including the RGI (recommended for
general interchange) list of emoji one has input and display support
for. If the declaration states support for the flags of England,
Scotland, and Wales, then one must do so via the specified emoji tag
sequences. Your interoperability derives from that.

--Ken

From unicode at unicode.org Wed Nov 21 10:00:36 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Wed, 21 Nov 2018 16:00:36 +0000 (GMT)
Subject: The encoding of the Welsh flag
In-Reply-To: <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
Message-ID: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>

Ken Whistler wrote as follows.

> A flag emoji is represented via a character sequence -- in this particular case by an emoji tag sequence, as specified in UTS #51.

> The representation of flag emoji via emoji tag sequences is *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646.

> If you find that hard to understand, consider another example. The spelling of the word "emoji" as the sequence of Unicode characters
<0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. Neither standard specifies
English spelling rules; nor does either standard specify flag emoji "spelling rules".

It seems to me that the two examples are fundamentally different each from the other.

The word emoji can be looked up in a dictionary and there one can find the sequence of glyphs that one needs to express that particular word.

https://en.oxforddictionaries.com/definition/emoji

If one then wishes to find the encoding of those glyphs, such that that particular word can become encoded as text characters in a message in an electronic system in an interoperable format, one can look in either The Unicode Standard or The ISO/IEC 10646 Standard and find code numbers. As the two standards are in synchronization one may, as I understand it, look in either.

The Welsh flag can be looked up in a list of flags and the desired glyph can be found.

If one then wishes to find the encoding of that glyph, such that that the glyph for that particular flag can become encoded as text characters in a message in an electronic system in an interoperable manner, then, as far as I am aware, that encoding cannot at this time be found in an International Standard.

Also, whereas there are many languages there is only one collection of flags, as flags are intended to be mutually distinguishable from any other flag.

WJGO >> Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system?

> One declares conformance to UTS #51 and declares the version of emoji that one's application supports -- including the RGI (recommended for general interchange) list of emoji one has input and display support for. If the declaration states support for the flags of England, Scotland, and Wales, then one must do so via the specified emoji tag sequences. Your interoperability derives from that.

Yet the interoperability does not derive from an International Standard.

Widening the discussion somewhat, are the encodings that are formed for glyphs, such as for Astronaut, that are not using tag characters yet are using a sequence of characters including one or more ZWJ characters listed in both The Unicode Standard and The ISO/IEC 10646 Standard?

It seems to me that tag sequences offer great possibilities for encoding, in effect a vast additional encoding space, yet for those encodings to be able to be used interoperably I opine they need to be listed in an International Standard, the International Standard in which they are listed may, but need not, being The ISO/IEC 10646 Standard.

William Overington

Wednesday 21 November 2018

From unicode at unicode.org Wed Nov 21 10:31:32 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Wed, 21 Nov 2018 08:31:32 -0800
Subject: The encoding of the Welsh flag
In-Reply-To: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
<26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>
Message-ID:

On 11/21/2018 8:00 AM, William_J_G Overington via Unicode wrote:
> Yet the interoperability does not derive from an International Standard.

The interoperability that enabled your mail to be delivered to me derives in part from the MIME standard (RFC 2045 et seq.) which is not an International Standard, but is instead maintained by the Networking Working Group of IETF.

The interoperability that enabled me to read the content of your mail derives from the HTML standard, which is not an International Standard, but is instead maintained by the W3C (a consortium).

The interoperability of any flag emoji embedded in that content derives from Unicode Technical Standard #51, which is not an International Standard, but is instead maintained by the Unicode Consortium.

These standards are all widely used *internationally*, but they are not an International Standard, which is effectively a moniker claimed by ISO for itself and its standards.

But in this day and age, expecting all technology, including technology related to computational processing, distribution, interchange, and rendering of text, to wait around for any related standard to be canonized as an International Standard is just silly. The world of technology does not work that way, and frankly, folks should be damn glad that it doesn't.

--Ken

From unicode at unicode.org Wed Nov 21 11:38:42 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Wed, 21 Nov 2018 17:38:42 +0000
Subject: The encoding of the Welsh flag
In-Reply-To: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
<26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>
Message-ID:

What really annoys me about this is that there is no flag for Northern Ireland. The folks at CLDR did not think to ask either the UK or the Irish representatives to SC2 about this. Yes, there is no ?official flag? for Northern Ireland. But there is one _universally_ used in sport, and that should have been made into an emoji at the same time when flags for Scotland, Wales, and England were made. And it still should.

Michael Everson

From unicode at unicode.org Wed Nov 21 12:00:56 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Wed, 21 Nov 2018 10:00:56 -0800
Subject: The encoding of the Welsh flag
In-Reply-To:
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
<26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>

Message-ID: <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net>

Michael,

On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote:
> What really annoys me about this is that there is no flag for Northern Ireland. The folks at CLDR did not think to ask either the UK or the Irish representatives to SC2 about this.

Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non
sequitur.

If you or Andrew West or anyone else is interested in pursuing an emoji
tag sequence for an emoji flag for Northern Ireland, then that should be
done by submitting a proposal, with justification, to the Emoji
Subcommittee, which *does* have jurisdiction.

https://unicode.org/emoji/proposals.html

See in particular, Section M of the selection criteria.

--Ken

From unicode at unicode.org Wed Nov 21 12:50:59 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Wed, 21 Nov 2018 19:50:59 +0100
Subject: The encoding of the Welsh flag
In-Reply-To: <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net>
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
<26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>

<5f740475-2283-6d8f-6474-a6a4976e45a3@att.net>
Message-ID:

We have gotten requests for this, but the stumbling block is the lack of an
official N. Ireland document describing what the official flag is and
should look like.

?However, whilst England (St George?s Cross) Scotland (St Andrew?s Cross)
and Wales (The Dragon) have individual regional flags, the Flags Institute
in London confirms that Northern Ireland has no official regional flag.?
https://www.newsletter.co.uk/news/new-northern-ireland-flag-should-be-created-says-lord-kilclooney-1-5753950

Should the N. Irish decide on a flag, I don't foresee any problem adding it.

Mark

On Wed, Nov 21, 2018 at 7:04 PM Ken Whistler via Unicode <
unicode at unicode.org> wrote:

> Michael,
>
> On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote:
> > What really annoys me about this is that there is no flag for Northern
> Ireland. The folks at CLDR did not think to ask either the UK or the Irish
> representatives to SC2 about this.
>
> Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non
> sequitur.
>
> If you or Andrew West or anyone else is interested in pursuing an emoji
> tag sequence for an emoji flag for Northern Ireland, then that should be
> done by submitting a proposal, with justification, to the Emoji
> Subcommittee, which *does* have jurisdiction.
>
> https://unicode.org/emoji/proposals.html
>
> See in particular, Section M of the selection criteria.
>
> --Ken
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 22 04:12:16 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Thu, 22 Nov 2018 12:12:16 +0200
Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers?
In-Reply-To:
References:

Message-ID:

On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ?? wrote:
>
> > That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use?
>
> One could parse for certain sequences, putting characters into a number of broad categories. Very approximately:
>
> junk ~= [[:cn:][:cs:][:co:]]+
> whitespace ~= [[:z:][:c:]-junk]+
> syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators
> identifiers ~= [all-else]+
>
> UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the focus there is on immutability.
>
> So an implementation could choose to follow that course, rather than the more narrowly defined identifiers in http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, one can conform to the Default Identifiers but declare a profile that expands the allowable characters. One could take a Swiftian approach, for example...

Thank you and sorry about my slow reply. Why is excluding junk important?

> On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode wrote:
>>
>> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen wrote:
>> > Considering that ruling out too much can be a problem later, but just
>> > treating anything above ASCII as opaque hasn't caused trouble (that I
>> > know of) for HTML other than compatibility issues with XML's stricter
>> > stance, why should a programming language, if it opts to support
>> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
>> > complexity of UAX #31 instead of allowing everything above ASCII in
>> > identifiers? In other words, what problem does making a programming
>> > language conform to UAX #31 solve?
>>
>> After refreshing my memory of XML history, I realize that mentioning
>> XML does not helpfully illustrate my question despite the mention of
>> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
>> ignore the XML part.
>>
>> Trying to rephrase my question more clearly:
>>
>> Let's assume that we are designing a computer-parseable syntax where
>> tokens consisting of user-chosen characters can't occur next to each
>> other and, instead, always have some syntax-reserved characters
>> between them. That is, I'm talking about syntaxes that look like this
>> (could be e.g. Java):
>>
>> ab.cd();
>>
>> Here, ab and cd are tokens with user-chosen characters whereas space
>> (the indent), period, parenthesis and the semicolon are
>> syntax-reserved. We know that ab and cd are distinct tokens, because
>> there is a period between them, and we know the opening parethesis
>> ends the cd token.
>>
>> To illustrate what I'm explicitly _not_ talking about, I'm not talking
>> about a syntax like this:
>>
>> ?????
>>
>> Here ?? and ?? are user-named variable names and ? is a user-named
>> operator and the distinction between different kinds of user-named
>> tokens has to be known somehow in order to be able to tell that there
>> are three distinct tokens: ??, ?, and ??.
>>
>> My question is:
>>
>> When designing a syntax where tokens with the user-chosen characters
>> can't occur next to each other without some syntax-reserved characters
>> between them, what advantages are there from limiting the user-chosen
>> characters according to UAX #31 as opposed to treating any character
>> that is not a syntax-reserved character as a character that can occur
>> in user-named tokens?
>>
>> I understand that taking the latter approach allows users to mint
>> tokens that on some aesthetic measure don't make sense (e.g. minting
>> tokens that consist of glyphless code points), but why is it important
>> to prescribe that this is prohibited as opposed to just letting users
>> choose not to mint tokens that are inconvenient for them to work with
>> given the behavior that their plain text editor gives to various
>> characters? That is, why is conforming to UAX #31 worth the risk of
>> prohibiting the use of characters that some users might want to use?
>> The introduction of XID after ID and the introduction of Extended
>> Hashtag Identifiers after XID is indicative of over-restriction having
>> been a problem.
>>
>> Limiting user-minted tokens to UAX #31 does not appear to be necessary
>> for security purposes considering that HTML and CSS exist in a
>> particularly adversarial environment and get away with taking the
>> approach that any character that isn't a syntax-reserved character is
>> collected as part of a user-minted identifier. (Informally, both treat
>> non-ASCII characters the same as an ASCII underscore. HTML even treats
>> non-whitespace, non-U+0000 ASCII controls that way.)
>>
>> --
>> Henri Sivonen
>> hsivonen at hsivonen.fi
>> https://hsivonen.fi/
>>
>

--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/

From unicode at unicode.org Thu Nov 22 04:27:31 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Thu, 22 Nov 2018 12:27:31 +0200
Subject: Unicode String Models
In-Reply-To:
References:

Message-ID:

On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ?? wrote:

>
> * The Python 3.3 model mentions the disadvantages of memory usage
>> cliffs but doesn't mention the associated perfomance cliffs. It would
>> be good to also mention that when a string manipulation causes the
>> storage to expand or contract, there's a performance impact that's not
>> apparent from the nature of the operation if the programmer's
>> intuition works on the assumption that the programmer is dealing with
>> UTF-32.
>>
>
> The focus was on immutable string models, but I didn't make that clear.
> Added some text.
>

Thanks.

> * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>> optionally, HotSpot
>> (
>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>> ).
>> That is, text has UTF-16 semantics, but if the high half of every code
>> unit in a string is zero, only the lower half is stored. This has
>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>
>
> Thanks, will add.
>

V8 source code shows it has a OneByteString storage option:
https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494
. From hearsay, I'm convinced that it means Latin1, but I've failed to find
a clear quotable statement from a V8 developer to that affect.

> 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>> have a different type in the type system than byte buffers. To go from
>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>> has been tagged as valid UTF-8, the validity is trusted completely so
>> that iteration by code point does not have "else" branches for
>> malformed sequences. If data that the type system indicates to be
>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>> language has a default "safe" side and an opt-in "unsafe" side. The
>> unsafe side is for performing low-level operations in a way where the
>> responsibility of upholding invariants is moved from the compiler to
>> the programmer. It's impossible to violate the UTF-8 validity
>> invariant using the safe part of the language.
>>
>
> Added a quote based on this; please check if it is ok.
>

Looks accurate. Thanks.

--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 22 05:08:30 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Thu, 22 Nov 2018 13:08:30 +0200
Subject: Generating U+FFFD when there's no content between ISO-2022-JP escape
sequences
Message-ID:

Context: https://github.com/whatwg/encoding/issues/115

Unicode Security Considerations say:
"3.6.2 Some Output For All Input

Character encoding conversion must also not simply skip an illegal
input byte sequence. Instead, it must stop with an error or substitute
a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER)
or an escape sequence in the output. (See also Section 3.5 Deletion of
Code Points.) It is important to do this not only for byte sequences
that encode characters, but also for unrecognized or "empty"
state-change sequences. For example:
[...]
ISO-2022 shift sequences without text characters before the next shift
sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
require at least one character in a text segment between shift
sequences. Security software written to the formal specification may
not detect malicious text (for example, "delete" with a
shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
(https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)

The WHATWG Encoding Standard bakes this requirement by the means of
"ISO-2022-JP output flag"
(https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
ISO-2022-JP decoder algorithm
(https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).

encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
WHATWG spec.

After Gecko switched to encoding_rs from an implementation that didn't
implement this U+FFFD generation behavior (uconv), a bug has been
logged in the context of decoding Japanese email in Thunderbird:
https://bugzilla.mozilla.org/show_bug.cgi?id=1508136

Ken Lunde also recalls seeing such email:
https://github.com/whatwg/encoding/issues/115#issuecomment-440661403

The root problem seems to be that the requirement gives ISO-2022-JP
the unusual and surprising property that concatenating two ISO-2022-JP
outputs from a conforming encoder can result in a byte sequence that
is non-conforming as input to a ISO-2022-JP decoder.

Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
sequence is immediately followed by another ISO-2022-JP escape
sequence. Chrome and Safari do, but their implementations of
ISO-2022-JP aren't independent of each other. Moreover, Chrome's
decoder implementations generally are informed by the Encoding
Standard (though the ISO-2022-JP decoder specifically might not be
yet), and I suspect that Safari's implementation (ICU) is either
informed by Unicode Security Considerations or vice versa.

The example given as rationale in Unicode Security Considerations,
obfuscating the ASCII string "delete", could be accomplished by
alternating between the ASCII and Roman states to that every other
character is in the ASCII state and the rest of the Roman state.

Is the requirement to generate U+FFFD when there is no content between
ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
transitions or useless transitions between ASCII and Roman are not
also required to generate U+FFFD? Would it even be feasible (in terms
of interop with legacy encoders) to make useless transitions between
ASCII and Roman generate U+FFFD?

--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/

From unicode at unicode.org Thu Nov 22 05:24:49 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 22 Nov 2018 12:24:49 +0100
Subject: Unicode String Models
In-Reply-To:
References:

Message-ID:

Thanks for the review! In case you're interested, I'd also welcome feedback
on Locale Identifiers

Mark

On Thu, Nov 22, 2018 at 11:27 AM Henri Sivonen wrote:

> On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ?? wrote:
>
>>
>> * The Python 3.3 model mentions the disadvantages of memory usage
>>> cliffs but doesn't mention the associated perfomance cliffs. It would
>>> be good to also mention that when a string manipulation causes the
>>> storage to expand or contract, there's a performance impact that's not
>>> apparent from the nature of the operation if the programmer's
>>> intuition works on the assumption that the programmer is dealing with
>>> UTF-32.
>>>
>>
>> The focus was on immutable string models, but I didn't make that clear.
>> Added some text.
>>
>
> Thanks.
>
>
>> * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
>>> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
>>> optionally, HotSpot
>>> (
>>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
>>> ).
>>> That is, text has UTF-16 semantics, but if the high half of every code
>>> unit in a string is zero, only the lower half is stored. This has
>>> properties analogous to the Python 3.3 model, except non-BMP doesn't
>>> expand to UTF-32 but uses UTF-16 surrogate pairs.
>>>
>>
>> Thanks, will add.
>>
>
> V8 source code shows it has a OneByteString storage option:
> https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494
> . From hearsay, I'm convinced that it means Latin1, but I've failed to find
> a clear quotable statement from a V8 developer to that affect.
>
>
>> 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
>>> have a different type in the type system than byte buffers. To go from
>>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
>>> has been tagged as valid UTF-8, the validity is trusted completely so
>>> that iteration by code point does not have "else" branches for
>>> malformed sequences. If data that the type system indicates to be
>>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
>>> language has a default "safe" side and an opt-in "unsafe" side. The
>>> unsafe side is for performing low-level operations in a way where the
>>> responsibility of upholding invariants is moved from the compiler to
>>> the programmer. It's impossible to violate the UTF-8 validity
>>> invariant using the safe part of the language.
>>>
>>
>> Added a quote based on this; please check if it is ok.
>>
>
> Looks accurate. Thanks.
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 22 09:27:09 2018
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Thu, 22 Nov 2018 16:27:09 +0100 (CET)
Subject: The encoding of the Welsh flag
In-Reply-To:
References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost>
<20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net>
<26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost>

<5f740475-2283-6d8f-6474-a6a4976e45a3@att.net>

Message-ID: <1201142793.143999.1542900429179@ox.hosteurope.de>

Mark Davis ??:
>
> We have gotten requests for this, but the stumbling block is the lack of an
> official N. Ireland document describing what the official flag is and
> should look like.

Such documents are lacking for several of the RIS flag emojis as well, though, e.g. for ???? from ISO 3166-1 code `UM` (United States Outlying Islands), resulting in unknown or duplicate flags, hence confusion. The solution there would have been to exclude codes for dependent territories becoming RGI emojis. ISO 3166 provides that property.

The fundamental problem of flag emojis, however, is that the most requested ones are those that have no appropriate ISO code element, simply because the people requesting them need them for representing their strive for independence from another entity, or for supranational communities.

From unicode at unicode.org Thu Nov 22 03:23:11 2018
From: unicode at unicode.org (- - via Unicode)
Date: Thu, 22 Nov 2018 04:23:11 -0500 (EST)
Subject: Compatibility Casefold Equivalence
Message-ID: <1251703928.316122.1542878591512@email.ionos.com>

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Thu Nov 22 13:18:48 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 22 Nov 2018 12:18:48 -0700
Subject: The encoding of the Welsh flag
Message-ID: <20BCF7D39DF643869175869F333C7DC1@DougEwell>

Ken Whistler replied to Michael Everson:

>> What really annoys me about this is that there is no flag for
>> Northern Ireland. The folks at CLDR did not think to ask either the
>> UK or the Irish representatives to SC2 about this.

[...]

> If you or Andrew West or anyone else is interested in pursuing an
> emoji tag sequence for an emoji flag for Northern Ireland, then that
> should be done by submitting a proposal, with justification, to the
> Emoji Subcommittee, which *does* have jurisdiction.

There is, of course, an encoding for the flag of Northern Ireland:

1F3F4 E0067 E0062 E006E E0069 E0072 E007F

where the tag characters are "gbnir" followed by TAG CANCEL.

What I suspect Michael means is that this sequence is not RGI, or
"recommended for general interchange," a status which applies for flag
emoji only to England, Scotland, and Wales, and not to any of the
thousands of other subdivisions worldwide.

The terminology currently in UTS #51 is definitely an improvement over
early drafts, which explicitly labeled such sequences "not recommended,"
but it still leads practically everyone. evidently including Michael, to
believe the sequences are invalid or non-existent.

I would certainly like to use the flag of Colorado, whose visual
appearance is very much standardized, but the vicious circle of vendor
support and UTS #51 categorization means no system will offer glyph
support, and some systems may even reject it as invalid.

--
Doug Ewell | Thornton, CO, US | ewellic.org

From unicode at unicode.org Thu Nov 22 13:29:29 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 22 Nov 2018 12:29:29 -0700
Subject: The encoding of the Welsh flag
Message-ID:

Christoph P?per wrote:

>> We have gotten requests for this, but the stumbling block is the lack
>> of an official N. Ireland document describing what the official flag
>> is and should look like.
>
> Such documents are lacking for several of the RIS flag emojis as well,
> though, e.g. for ???? from ISO 3166-1 code `UM` (United States
> Outlying
> Islands), resulting in unknown or duplicate flags, hence confusion.
> The solution there would have been to exclude codes for dependent
> territories becoming RGI emojis. ISO 3166 provides that property.

That's neither the problem nor the solution, IMHO. Even for RIS
sequences, you have no guarantee of exactly how the flag will be
depicted. For flags that have been recently changed, you might get the
old or the new. For UM, you might get the US flag or one of the
unofficially adopted flags. For Northern Ireland (if it were
RGI-blessed), you might get either the Ulster Banner or St. Patrick's
Saltire.

This situation is described, and explicitly so for the UM flags, in
Annex B of UTS #51 under "Caveats."

--
Doug Ewell | Thornton, CO, US | ewellic.org

From unicode at unicode.org Thu Nov 22 13:58:51 2018
From: unicode at unicode.org (Carl via Unicode)
Date: Thu, 22 Nov 2018 14:58:51 -0500 (EST)
Subject: Compatibility Casefold Equivalence
In-Reply-To: <1251703928.316122.1542878591512@email.ionos.com>
References: <1251703928.316122.1542878591512@email.ionos.com>
Message-ID: <1626926067.211518.1542916731686@email.ionos.com>

(It looks like my HTML email got scrubbed, sorry for the double post)

Hi,

In Chapter 3 Section 13, the Unicode spec defines D146:

"A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))"

I am trying to understand the "if and only if" part of this.? ?Specifically, why is the outermost NFKD necessary?? Could it also be a NFKC normalization?? ?Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?

My use case is that I am trying to store user-provided tags in a database.? I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.? However, because decomposition can result in much larger strings, I would prefer to keep? the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).

Thanks,

Carl

From unicode at unicode.org Sat Nov 24 16:33:15 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 24 Nov 2018 14:33:15 -0800
Subject: Compatibility Casefold Equivalence
In-Reply-To: <1626926067.211518.1542916731686@email.ionos.com>
References: <1251703928.316122.1542878591512@email.ionos.com>
<1626926067.211518.1542916731686@email.ionos.com>
Message-ID:

An HTML attachment was scrubbed...
URL:

From unicode at unicode.org Tue Nov 27 01:46:06 2018
From: unicode at unicode.org (Carl via Unicode)
Date: Tue, 27 Nov 2018 02:46:06 -0500 (EST)
Subject: Compatibility Casefold Equivalence
In-Reply-To:
References: <1251703928.316122.1542878591512@email.ionos.com>
<1626926067.211518.1542916731686@email.ionos.com>

Message-ID: <2125255467.459496.1543304766546@email.ionos.com>

Thanks for the reply. Responses inline:

> On November 24, 2018 at 5:33 PM Asmus Freytag via Unicode wrote:
>
>
> On 11/22/2018 11:58 AM, Carl via Unicode wrote:
> > (It looks like my HTML email got scrubbed, sorry for the double post)
> >
> > Hi,
> >
> >
> > In Chapter 3 Section 13, the Unicode spec defines D146:
> >
> >
> > "A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))"
> >
> >
> > I am trying to understand the "if and only if" part of this.? ?Specifically, why is the outermost NFKD necessary?? Could it also be a NFKC normalization?? ?Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?
> >
> >
> > My use case is that I am trying to store user-provided tags in a database.? I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.? However, because decomposition can result in much larger strings, I would prefer to keep? the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).
>
>
> Carl,
>
>
> you may find that some of the complications are limited to a small number of code points. In particular, classical (polytonic) Greek has some gnarly behavior wrt case; and some compatibility characters have odd edge cases.
>
>

I suspected that the number of edge cases would be small, but I lack a way of enumerating them. (i.e. I don't know what I don't know)

> I'm personally not a fan of allowing every single Unicode code point in things like usernames (or other types of identifiers). Especially, if including some code points makes the "general case" that much more complex, my personal recommendation would be to simply disallow / reject a small set of troublesome characters; especially if they aren't part of some widespread modern orthography.
>
>
> While Unicode is about being able to digitally represent all written text, identifiers don't follow the same rules. The main reason why people often allow "anything" is because it's easy in terms of specification. Sometimes, you may not have control over what to accept; for example if tags are generated from headers in a document, it would require some transform to handle disallowed code points.
>
>

The identifiers doc was what I had originally planned on using, but some of the rules there are too much. For example, IIUC variation selectors are not allowed (scrubbed?), which prevents use of some emoji sequences. Also, the ID_Start and XID_Start properties are too strict (since I'm not using this in a programming language or otherwise secure environment), as they forbid leading numbers. Hashtags are close to what I want, but again, they specify a leading "#".

Really the problem for me is that I don't know what liberties I can take with restricting/allowing certain characters. Being too restrictive might be culturally insensitive, but being too lax might open the system for abuse. Would it be overkill to render the tag text to a picture, hash the picture, and store that instead? It seems like it would force visually identical strings to the same set of bytes.

> Case is also only one of the types of duplication you may encounter. In many South and South East Asian scripts you may encounter cases where two sequences of characters, while different, will normally render identical. Arabic also has instances of that. Finally, you may ask yourself whether your system should treat simplified and traditional Chinese ideographs as separate or as a variant not unlike the way you treat case.
>
>

Ideally I would like the same kind of matching as my browser does when I press Ctrl-F. If simplified and traditional Chinese match, that's probably good enough.

> About storing your tag data: you can obviously store them as NFC, if you like: in that case, you will have to run the operations both on the stored and on the new tag.
>
>
> Finally, there are some cases where you can tell that two string are identical without actually carrying out the full set of operations:
>
>
> Y = X
>
>
> NFC(Y) = NFC(X)
>
>
> and so on. (If these conditions are true, the full condition above must also be true). For example, let's apply
>
> NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
>
>
> on both sides of
>
>
> NFC(Y) = NFC(X)
>
>
> First:
>
>
> NFD(NFC(Y)) = NFD(NFC(X))
>
>
> Because the two sides are equal, applying toCaseFold results in equal strings, and so on all the way to the outer NFKD.

As a minor followup, TR 15 section 7 says:

"NFKC(NFKD(x)) == NFKC(x)"

which implies that the outer NFKD can be replaced:

NFKC(toCasefold(NFKD(toCasefold(NFD(X)))))

>
>
> In other words, you can stop the comparison at any point where the two sides are equal. From that point on, the outer operations cannot add anything.

That's a good point. In my case, since one side of the equation will be stored in a DB, I believe I need to do the full transform. That said, It would be useful for in-memory comparisons.

>
>
> A./}}}}