From unicode at unicode.org Thu Nov 1 02:20:51 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 07:20:51 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> Message-ID: <20181101072051.38cc6a8d@JRWUBU2> On Wed, 31 Oct 2018 14:57:37 -0700 Asmus Freytag via Unicode wrote: > On 10/31/2018 10:18 AM, Marcel Schneider via Unicode wrote: >> Sad that Arabic ? and ? are still missing. > How about all the other sets of native digits? They might not be in natural use this way! Also, there is the possibility of non-spacing superscript digits, as in Devanagari, though they are chiefly not used for counting. But why limit consideration to digits? But what about oxidation states, which use spacing superscript Roman numerals - I couldn't find superscript capital 'V'. Richard. From unicode at unicode.org Thu Nov 1 02:33:28 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 01 Nov 2018 08:33:28 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: (Ken Whistler via Unicode's message of "Wed, 31 Oct 2018 12:14:36 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> Message-ID: <86lg6djlpz.fsf_-_@mimuw.edu.pl> On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote: > On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote: >> >> but we don't have an agreement that reproducing all variations in >> manuscripts is in scope. > > In fact, I would say that in the UTC, at least, we have an agreement > that that clearly is out of scope! > > Trying to represent all aspects of text in manuscripts, including > handwriting conventions, as plain text is hopeless. There is no > principled line to draw there before you get into arbitrary > calligraphic conventions. Your statements are perfect examples of "attacking a straw man": Straw Man (Fallacy Of Extension): attacking an exaggerated or caricatured version of your opponent's position. http://www.don-lindsay-archive.org/skeptic/arguments.html https://en.wikipedia.org/wiki/Straw_man https://en.wikipedia.org/wiki/The_Art_of_Being_Right Perhaps you are joking? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Nov 1 02:46:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 07:46:40 +0000 Subject: use vs mention (was: second attempt) In-Reply-To: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> Message-ID: <20181101074640.2866a022@JRWUBU2> On Wed, 31 Oct 2018 23:35:06 +0100 Piotr Karocki via Unicode wrote: > These are only examples of changes in meaning with or , > not all of these examples can really exist - but, then, another > question: can we know what author means? And as carbon and iodine > cannot exist, then of course CI should be interpreted as carbon on > first oxidation? Are you sure about the non-existence? Some pretty weird chemical species exist in interstellar space. Richard. From unicode at unicode.org Thu Nov 1 02:52:09 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 07:52:09 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com> References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com> Message-ID: <20181101075209.5ffbba7d@JRWUBU2> On Wed, 31 Oct 2018 11:35:19 -0700 Asmus Freytag via Unicode wrote: > On the other hand, I'm a firm believer in applying certain styling > attributes to things like e-mail or discussion papers. Well-placed > emphasis can make such texts more readable (without requiring that > they pay attention to all other facets of "fine typography".) Unfortunately, your emails are extremely hard to read in plain text. It is even difficult to tell who wrote what. Richard. From unicode at unicode.org Thu Nov 1 08:43:08 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Nov 2018 06:43:08 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181101075209.5ffbba7d@JRWUBU2> References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com> <20181101075209.5ffbba7d@JRWUBU2> Message-ID: <97890362-7550-2e43-2266-a41853b89ba7@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 10:43:21 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Nov 2018 08:43:21 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <86lg6djlpz.fsf_-_@mimuw.edu.pl> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 12:23:05 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Thu, 01 Nov 2018 18:23:05 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: (Asmus Freytag via Unicode's message of "Thu, 1 Nov 2018 08:43:21 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> Message-ID: <86d0roiufa.fsf@mimuw.edu.pl> On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote: > On 11/1/2018 12:33 AM, Janusz S. Bie? via Unicode wrote: > > On Wed, Oct 31 2018 at 12:14 -0700, Ken Whistler via Unicode wrote: > > On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote: > > > but we don't have an agreement that reproducing all variations in > manuscripts is in scope. > > > In fact, I would say that in the UTC, at least, we have an agreement > that that clearly is out of scope! > > Trying to represent all aspects of text in manuscripts, including > handwriting conventions, as plain text is hopeless. There is no > principled line to draw there before you get into arbitrary > calligraphic conventions. > > > Your statements are perfect examples of "attacking a straw man": > > > Perhaps you are joking? > > Not sure which of us you were suggesting as the jokester here. > > I don't think it's a joke to recognize that there is a continuum here > and that there is no line that can be drawn which is based on > straightforward principles. This is a pattern that keeps surfacing the > deeper you look at character coding questions. Looks like you completely missed my point. Nobody ever claimed that reproducing all variations in manuscripts is in scope of Unicode, so whom do you want to convince that it is not? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Thu Nov 1 12:39:16 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 18:39:16 +0100 Subject: UCA unnecessary collation weight 0000 Message-ID: I just remarked that there's absolutely NO utility of the collation weight 0000 anywhere in the algorithm. For example in UTR #10, section 3.3.1 gives a collection element : [.0000.0021.0002] for COMBINING GRAVE ACCENT. However it can also be simply: [.0021.0002] for a simple reason: the secondary or tertiary weights are necessarily LOWER then any primary weight (for conformance reason): any tertiary weight < any secondary weight < any primary weight (the set of all weights for all levels is fully partitioned into disjoint intervals in the same order, each interval containing all its weights, so weights are sorted by decreasing level, then increasing weight in all cases) This also means that we never need to handle 0000 weights when creating sort keys from multiple collection elements, as we can easily detect that [.0021.0002] given above starts by a secondary weight 0021 and is not a primary weight. As well we don't need to use any level separator 0000 in the sort key. This allows more interesting optimizations, and reduction of length for sort keys. What this means is that we can safely implement UCA using basic substitions (e.g. with a function like "string:gsub(map)" in Lua which uses a "map" to map source (binary) strings or regexps,into target (binary) strings: For a level-3 collation, you just then need only 3 calls to "string:gsub()" to compute any collation: - the first ":gsub(mapNormalize)" can decompose a source text into collation elements and can perform reordering to enforce a normalized order (possibly tuned for the tailored locale) using basic regexps. - the second ":gsub(mapSecondary)" will substitute any collection elements by their "intermediary" collation elements+tertiary weight. - the third ":gsub(mapSecondary)" will substitute any "intermediary" collation element by their primary weight + secondary weight The "intermediary" collection elements are just like source text, except that higher level differences are eliminated, i.e.all source collation element string are replaced by the collection element string that have the smallest collation element weights. They must be just encoded so that they are HIGHER than any higher level weights. How to do that: - reserve the weight range between .0000 (yes! not just .0001) and .001E for the last (tertiary) weight, make sure that all other intermediary collation elements will use only code units higher than .0020 (this means that they can remain encoded in their existing UTF form!) - reserve the weight .001F for the case where you don't want to use secondary differences (like letter case) and them to tertiary differences. This will be used in the second mapping to decompose source collection elements into "intermediary collation elements" + tertiary weight. you may then decide to leave tertiary weights in the substitute string, or because the "gsub()" finds match from left to right, to accumulate the tertiary weights into a separate buffer, so that the subtitution itself will still return a valid UTF string, containing only "intermediary collation elements" (with all tertiary differences erased). You can repeat the process with the next gsub() to return the primary collation elements" (still in UTF form), and separately the secondary weights (also accumulable in a separate buffer). Now there remains only 3 strings: - one contains only the primary collection elements (still in UTF-form, but using code units always higher than or equal to 0020) - another one contains only secondary weights (between MINSECONDARYWEIGHT and 001F) - another one contains only tertiary weights. (between 0000 and MINSECONDARYWEIGHT-1) For the rest I will assume that MINSECONDARYWEIGHT is 0010, so * primary weights are encoded with one or more code units in [0020..] (multiple code units are possible if you reserve some of these code units to be prefixes or longer sequences) * secondary weights are encoded with one or more code units in [0010..001E] (same remark about multiple code units if you need them) * tertiary weights are encoded with one or more code units in [0010..001F] (same remark about multiple code units if you need them) The last gsub() will only reorder the primary collection elements to remap them in a suitable binary order (it will be a simple bijective permutation, except that the target does not have to use multiple code units, but a single one, when there are contractions). It's always possible to make this permutation generate integers higher than 0020. The resulting weights can remain encodable with UTF-8 as if it was source text. And to return the sort key, all you need is to concatenate * the string containing all primary weights encoded with code units in [0020..], then * the string containing secondary weights encoded with code units in [0010..001E], then * the string containing tertiary weights encoded with code units in [0000..001F]. * you don't need to insert ANY [0000] as a level separator in the final sort key, because each concatenated part in the final sort key respect the wellformedness constraint WF2 of the UCA algorithm. You may choose to not use tertiary weights encoded with [0000] code units, if you want the final string containing the sort key to be null-terminated. In summary: * there's no longer any special role given in UCA for [0000]. More compaction possible for storing the mapping of source collation element strings (in their original UTF encoding) to strings of collation weights (themselves still encodage with an UTF!). * Any tailored collation (except those requiring preprocessing that may apply specific reorderings, possibly made by using subtitution with one or more regexps to apply, repeated in a defined order) is just specified by one map per collation level, containing source UTF strings (or regexps) to replace by their mapped string of collation weights. * You are free to choose the UTF to use for the source string or for the collation weight (these UTF may be different or may be both UTF-8. If you use a conforming UTF, the only code units you cannot use are those in [D800..DFFF], reserved for surrogates. * Normal string library packages can be used to implement UCA, even those that can only work with texts encoded with a valid UTF. * Given that the resulting sort keys are valid UTF, they are displayable: in many circonstances, the initial part of the string (containing primary weights only) will display the normal UTF encoding of readable text; if there are additional secondary or tertiary weights after it, because they are represented using C0 controls, you may still display them using a notation like \xNN (you only need to escape '\' if it is present as a litteral in the readable part of the sort key containing primary weights). Note: Isolated surrogates found in a non-conforming source string need to be preprocessed if you want to accept them in a collator: - You can do that by preprocessing [0000] or [D800..DFFF], into [0000] followed by only one codeunits in [0020..], so they form a single collation element [0000][0020..]; use [0000][0020] as the collation element representing the source [0000] and just insert a single [0000] before any isolated surrogate you'll replace by a code unit in [0800..0FFF]. The result will be a conforming UTF string on which your collator will return valid UTF strings of weights. - If you don't want to have any [0000] within sort keys, you can also preprocess the source string by reencoding [0000] into [0001][0020], and [0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into [0001][0800..0FFF]. Here also the result will be a conforming UTF string on which your collator will return valid UTF strings of weights. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:08:05 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Thu, 1 Nov 2018 13:08:05 -0700 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: There are lots of ways to implement the UCA. When you want fast string comparison, the zero weights are useful for processing -- and you don't actually assemble a sort key. People who want sort keys usually want them to be short, so you spend time on compression. You probably also build sort keys as byte vectors not uint16 vectors (because byte vectors fit into more APIs and tend to be shorter), like ICU does using the CLDR collation data file. The CLDR root collation data file remunges all weights into fractional byte sequences, and leaves gaps for tailoring. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:10:16 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:10:16 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: For example, Figure 3 in the UTR#10 contains: Figure 3. Comparison of Sort Keys StringSort Key 1 cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0002* 0002 0002 2 Cab *0706* 06D9 06EE *0000* 0020 0020 *0020* *0000* *0008* 0002 0002 3 c?b *0706* 06D9 06EE *0000* 0020 0020 *0021* *0000* 0002 0002 0002 0002 4 dab *0712* 06D9 06EE *0000* 0020 0020 0020 *0000* 0002 0002 0002 The 0000 weights are never needed, even if any of the source strings ("cab", "Cab", "c?b", "dab") is followed by ANY other string, or if any other string (higher than "b") replaces their final "b". What is really important is to understand where the input text (after initial transforms like reodering and expansion) is broken at specific boundaries between collatable elements. But the boundaries of weights indicated each part of the sort key can always be infered for example between 06EE and 0020, or between 0020 and 0002. So this can obviously be changed to just: Figure 3. Comparison of Sort Keys StringSort Key 1 cab *0706* 06D9 06EE 0020 0020 *0020* *0002* 0002 0002 2 Cab *0706* 06D9 06EE 0020 0020 *0020* *0008* 0002 0002 3 c?b *0706* 06D9 06EE 0020 0020 *0021* 0020 0002 0002 0002 0002 4 dab *0712* 06D9 06EE 0020 0020 0020 0002 0002 0002 As well (emphasized by black blackground above), * when the secondary weights in the sort key are terminated by any sequence of 0020 (the minimal secondary weight), you can suppress them from the collation key. * when the tertiary weights are in the sort key are terminated by any sequence of 0002 (the minimal tertiary weight), you can suppress them from the collation key. This gives: Figure 3. Comparison of Sort Keys StringSort Key 1 cab *0706* 06D9 06EE 2 Cab *0706* 06D9 06EE *0008* 3 c?b *0706* 06D9 06EE 0020 0020 *0021* 4 dab *0712* 06D9 06EE See the reduction ! Le jeu. 1 nov. 2018 ? 18:39, Philippe Verdy a ?crit : > I just remarked that there's absolutely NO utility of the collation weight > 0000 anywhere in the algorithm. > > For example in UTR #10, section 3.3.1 gives a collection element : > [.0000.0021.0002] > for COMBINING GRAVE ACCENT. However it can also be simply: > [.0021.0002] > for a simple reason: the secondary or tertiary weights are necessarily > LOWER then any primary weight (for conformance reason): > any tertiary weight < any secondary weight < any primary weight > (the set of all weights for all levels is fully partitioned into disjoint > intervals in the same order, each interval containing all its weights, so > weights are sorted by decreasing level, then increasing weight in all cases) > > This also means that we never need to handle 0000 weights when creating > sort keys from multiple collection elements, as we can easily detect that > [.0021.0002] given above starts by a secondary weight 0021 and is not a > primary weight. > > As well we don't need to use any level separator 0000 in the sort key. > > This allows more interesting optimizations, and reduction of length for > sort keys. > What this means is that we can safely implement UCA using basic > substitions (e.g. with a function like "string:gsub(map)" in Lua which uses > a "map" to map source (binary) strings or regexps,into target (binary) > strings: > > For a level-3 collation, you just then need only 3 calls to > "string:gsub()" to compute any collation: > > - the first ":gsub(mapNormalize)" can decompose a source text into > collation elements and can perform reordering to enforce a normalized order > (possibly tuned for the tailored locale) using basic regexps. > > - the second ":gsub(mapSecondary)" will substitute any collection > elements by their "intermediary" collation elements+tertiary weight. > > - the third ":gsub(mapSecondary)" will substitute any "intermediary" > collation element by their primary weight + secondary weight > > The "intermediary" collection elements are just like source text, except > that higher level differences are eliminated, i.e.all source collation > element string are replaced by the collection element string that have the > smallest collation element weights. They must be just encoded so that they > are HIGHER than any higher level weights. > > How to do that: > - reserve the weight range between .0000 (yes! not just .0001) and .001E > for the last (tertiary) weight, make sure that all other intermediary > collation elements will use only code units higher than .0020 (this means > that they can remain encoded in their existing UTF form!) > - reserve the weight .001F for the case where you don't want to use > secondary differences (like letter case) and them to tertiary differences. > > This will be used in the second mapping to decompose source collection > elements into "intermediary collation elements" + tertiary weight. you may > then decide to leave tertiary weights in the substitute string, or because > the "gsub()" finds match from left to right, to accumulate the tertiary > weights into a separate buffer, so that the subtitution itself will still > return a valid UTF string, containing only "intermediary collation > elements" (with all tertiary differences erased). > > You can repeat the process with the next gsub() to return the primary > collation elements" (still in UTF form), and separately the secondary > weights (also accumulable in a separate buffer). > > Now there remains only 3 strings: > - one contains only the primary collection elements (still in UTF-form, > but using code units always higher than or equal to 0020) > - another one contains only secondary weights (between MINSECONDARYWEIGHT > and 001F) > - another one contains only tertiary weights. (between 0000 and > MINSECONDARYWEIGHT-1) > > For the rest I will assume that MINSECONDARYWEIGHT is 0010, so > * primary weights are encoded with one or more code units in [0020..] > (multiple code units are possible if you reserve some of these code units > to be prefixes or longer sequences) > * secondary weights are encoded with one or more code units in > [0010..001E] (same remark about multiple code units if you need them) > * tertiary weights are encoded with one or more code units > in [0010..001F] (same remark about multiple code units if you need them) > > The last gsub() will only reorder the primary collection elements to remap > them in a suitable binary order (it will be a simple bijective permutation, > except that the target does not have to use multiple code units, but a > single one, when there are contractions). It's always possible to make this > permutation generate integers higher than 0020. The resulting weights can > remain encodable with UTF-8 as if it was source text. > > And to return the sort key, all you need is to concatenate > * the string containing all primary weights encoded with code units in > [0020..], then > * the string containing secondary weights encoded with code units in > [0010..001E], then > * the string containing tertiary weights encoded with code units in > [0000..001F]. > * you don't need to insert ANY [0000] as a level separator in the final > sort key, because each concatenated part in the final sort key respect the > wellformedness constraint WF2 of the UCA algorithm. > > You may choose to not use tertiary weights encoded with [0000] code units, > if you want the final string containing the sort key to be null-terminated. > > In summary: > * there's no longer any special role given in UCA for [0000]. More > compaction possible for storing the mapping of source collation element > strings (in their original UTF encoding) to strings of collation weights > (themselves still encodage with an UTF!). > * Any tailored collation (except those requiring preprocessing that may > apply specific reorderings, possibly made by using subtitution with one or > more regexps to apply, repeated in a defined order) is just specified by > one map per collation level, containing source UTF strings (or regexps) to > replace by their mapped string of collation weights. > * You are free to choose the UTF to use for the source string or for the > collation weight (these UTF may be different or may be both UTF-8. If you > use a conforming UTF, the only code units you cannot use are those in > [D800..DFFF], reserved for surrogates. > * Normal string library packages can be used to implement UCA, even those > that can only work with texts encoded with a valid UTF. > * Given that the resulting sort keys are valid UTF, they are displayable: > in many circonstances, the initial part of the string (containing primary > weights only) will display the normal UTF encoding of readable text; if > there are additional secondary or tertiary weights after it, because they > are represented using C0 controls, you may still display them using a > notation like \xNN (you only need to escape '\' if it is present as a > litteral in the readable part of the sort key containing primary weights). > > Note: Isolated surrogates found in a non-conforming source string need to > be preprocessed if you want to accept them in a collator: > - You can do that by preprocessing [0000] or [D800..DFFF], into [0000] > followed by only one codeunits in [0020..], so they form a single collation > element [0000][0020..]; use [0000][0020] as the collation element > representing the source [0000] and just insert a single [0000] before any > isolated surrogate you'll replace by a code unit in [0800..0FFF]. The > result will be a conforming UTF string on which your collator will return > valid UTF strings of weights. > - If you don't want to have any [0000] within sort keys, you can also > preprocess the source string by reencoding [0000] into [0001][0020], and > [0001] into [0001][0021], and isolated surrogates in [D800..DFFF] into > [0001][0800..0FFF]. Here also the result will be a conforming UTF string on > which your collator will return valid UTF strings of weights. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:13:46 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:13:46 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: I'm not speaking just about how collation keys will finally be stored (as uint16 or bytes, or sequences of bits with variable length); I'm just refering to the sequence of weights you generate. You absolutely NEVER need ANYWHERE in the UCA algorithm any 0000 weight, not even during processing, or un the DUCET table. Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a ?crit : > There are lots of ways to implement the UCA. > > When you want fast string comparison, the zero weights are useful for > processing -- and you don't actually assemble a sort key. > > People who want sort keys usually want them to be short, so you spend time > on compression. You probably also build sort keys as byte vectors not > uint16 vectors (because byte vectors fit into more APIs and tend to be > shorter), like ICU does using the CLDR collation data file. The CLDR root > collation data file remunges all weights into fractional byte sequences, > and leaves gaps for tailoring. > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:31:15 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:31:15 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a ?crit : > When you want fast string comparison, the zero weights are useful for >> processing -- and you don't actually assemble a sort key. >> > And no, I absolutely no case where any 0000 weight is useful during processing, it does not distinguish any case, even for "fast" string comparison. Even if you don't build any sort key, may be you'll want to return 0000 it you query the weight for a specific collatable element, but this would be the same as querying if the collatable element is ignorable or not for a given specific level; this query just returns a false or true boolean, like this method of a Collator object: bool isIgnorable(int level, string collatable element) and you can also make this reliable for any collector: int getLevel(int weight); int getMinWeight(int level); int getWeightAt(string element, int level, int position); so you can use these two last functions to write the first one: bool isIgnorable(int level, string element) { return getLevel(getWeightAt(element, 0)) > getMinWeight(level); } That's enough you can write the fast comparison... What I said is not a complicate "compression" this is done on the fly, without any complex transform. All that counts is that any primary weight value is higher than any secondary weight, and any secondary weight is higher than a tertiary weight. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:34:05 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Nov 2018 13:34:05 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <86d0roiufa.fsf@mimuw.edu.pl> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> Message-ID: <923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:35:29 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:35:29 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: Le jeu. 1 nov. 2018 ? 21:31, Philippe Verdy a ?crit : > so you can use these two last functions to write the first one: > > bool isIgnorable(int level, string element) { > return getLevel(getWeightAt(element, 0)) > getMinWeight(level); > } > correction: return getWeightAt(element, 0) > getMinWeight(level); -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:42:02 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 1 Nov 2018 21:42:02 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209> Message-ID: <1bd96f61-d33a-258f-cd8e-9ab29db2bd92@orange.fr> On 01/11/2018 01:21, Asmus Freytag via Unicode wrote: > On 10/31/2018 3:37 PM, Marcel Schneider via Unicode wrote: >> On 31/10/2018 19:42, Asmus Freytag via Unicode wrote: [?] >>> It is a fallacy that all text output on a computer should match the convention >>> of "fine typography". >>> >>> Much that is written on computers represents an (unedited) first draft. Giving >>> such texts the appearance of texts, which in the day of hot metal typography, >>> was reserved for texts that were fully edited and in many cases intended for >>> posterity is doing a disservice to the reader. >>> >> The disconnect is in many people believing the user should be disabled to write >> [prevented from writing] Thank you for correcting. >> his or her language without disfiguring it by lack of decent keyboarding, and >> that such input should be considered standard for user input. Making such text >> usable for publishing needs extra work, that today many users cannot afford, >> while the mass of publishing has increased exponentially over the past decades. >> The result is garbage, following the rule of ?garbage in, garbage out.? > > No argument that there are some things that users cannot key in easily and that the common > fallbacks from the days of typewritten drafts are not really appropriate in many texts that > otherwise fall short of being "fine typography". The goal I wanted to reach by discussing and invalidating the biased and misused concept of ?fine typography? is that this thread could get rid of it, but I?m definitely unfortunate. It?s hard for you to understand that relegating abbreviation indicators into the realm of ?fine typography? recalls me what I got to hear (undisclosed for privacy) when asking that the French standard keyboard layouts (plural) support punctuation spacing with NARROW NO-BREAK SPACE, and that is closely related to the issue about social media that you pointed below. Don?t worry about users not being able to ?key in easily? what is needed for the digital representation of their language, as long as: 1. Unicode has encoded what is needed; 2. Unicode does not prohibit the use of the needed characters. The rest is up to keylayout designers. Keying in anything else is not an issue so far. > >> The real >> disservice to the reader is not to enable the inputting user to write his or her >> language correctly. A draft whose backbone is a string usable as-is for publishing >> is not a disservice, but a service to the reader, paying the reader due respect. >> Such a draft is also a service to the user, enabling him or her to streamline the >> workflow. Such streamlining brings monetary and reputational benefit to the user. > > I see a huge disconnect between "writing correctly" and "usable as-is for publishing". These > two things are not at all the same. > > Publishing involves making many choices that simply aren't necessary for more "rough & ready" > types of texts. Not every twitter or e-mail message needs to be "usable as-is for publishing", but > should allow "correctly written" text as far as possible. Not every message, especially not those whose readers expect a quick response. The reverse is true with new messages (tweets, thread lauchers, requests, invitations). As already discussed, there are several levels of correctness. We?re talking only about the accurate digital representation of human languages, which includes correct punctuation. E.g. in languages using letter apostrophe, hashtags made of a word including an apostrophe are broken when ASCII or punctuation apostrophe (close quote) is used, as we?ve been told. Supposedly part of this discussion would be streamlined if one could experience how easy it can be to type in one?s language?s accurate digital representation. But it?s better to be told what goes on, and what ?strawmen? we?re confused with, since, again, informed discussion brings advancement. > > When "desktop publishing" as it was called then, became available, too many people started to > obsess with form over content. You would get these beautifully laid out documents, the contents > of which barely warranted calling them a first draft. Typing in one?s language?s accurate digital representation is not being obsessed with form over content, provided that appropriate keyboarding is available. E.g. the punctuation apostrophe is on level 1 where the ASCII apostrophe is when digits are locked on level 1 on the French keyboard I?ve in use; else, digits are on level 3 where is also superscript e for ready input of most of the ordinals (except 1??/1??, 2?? for ranges, and plural with ?): 2??3??4??5??6??7??8??9??10??11??12?. Hopefully that demo makes clear what is intended. Users not needing accurate repsesentation in a given string are free to type in otherwise. The goal of this discussion is that Unicode allow accurate representation, not impose it. Actually Unicode is still imposing inaccurate representation to some languages due to TUS prohibiting the use of precomposed superscript letters in text representing human languages with standard orthography, which is what ?ordinary text? seems to boil down to. > >> That disconnect seems to originate from the time where the computer became a tool >> empowering the user to write in all of the world?s languages thanks to Unicode. > > No, this has nothing to do with Unicode / multi-script support. Why not? Accurate interoperable digital representation of French was totally impossible before version 3.0 of Unicode (bringing the *new* NARROW NO-BREAK SPACE), while before, the Standard was prevented to have such a character by misdefining the line-break property of U+2008 PUCTUATION SPACE, that has the right width and serves no purpose only because unlike related U+2007 FIGURE SPACE (but not U+2012 FIGURE DASH, mistakenly added to the list in my previous e-mail), it is not non-breakable. Useful punctuation spacing was dismissed as being too ?fine? a typography for being universally available and interoperable, while the opposite is true: It?s the only way of writing French without being at risk of conveying the impression of poor craftmanship (see below). >> The concept of ?fine typography? was then used to draw a borderline between what >> the user is supposed to input, and what he or she needs to get for publication. > > This same dividing line applies in English (or any of the other individual languages). Yes of course. The four lines above only intended to set the scene. AFAICS, the disconnect of an encoding standard designed for accuracy and interoperability, the use and the usefulness of which is intentionally throttled down in order to get non-accurate and non-interoperable digital representations of some languages, is unprecedented, and it originates from the time the Unicode Standard was set up. Spacing has been fixed, ordinal indicators are being fixed, and now, other abbreviation indicators still need fixing. >> In the same move, that concept was extended in a way that it should include the >> quality of the string, additionally to what _fine typography_ really is: fine >> tuning of the page layout, such as vertical justification, slight variations in >> the width of non-breakable spaces, and of course, discretionary ligatures. > > Certain elements of styling are also part of fine typography. In some cases, readying a "string" > for publication also means applying spelling conventions or grammatical conventions (for those > cases where there are ambiguities in the common language, or applying preferred word choices > or ways of formulating things that may be particular to individual publishers or types of publications. None of these is a reason not to be able to input abbreviation indicators in plain text. But for the rest, I cannot see that applying style guides? orthographies is part of fine typography, just of publishing. These parameters are at the discretion of the management. That does not preclude the input of superscript on a keyboard, and as a side note, the intake of publishers is mainly at least rich text or another markup convention, most currently TeX (for scientific publications). But Unicode promises accurate interoperable representation of all of the world?s languages in plain text. Hence, authors are advised that a good way to make TeX more human-readable is to use more Unicode. > > Using HYPHEN-MINUS instead of "EN DASH" or "HYPHEN" is perfectly OK for early stages of > drafting a text. Attempting to follow those and similar conventions during that phase forces > the author to pay attention to the wrong thing - his or her focus should be on the ideas and > the content, not the form of the document. There is some good point in that. But a close look at just these two conventions leads to significantly lessen the advantage of not using accurate punctuation in one?s drafts. 1. HYPHEN-MINUS vs EN DASH or, should be added, EM DASH: That is not possible in locales using no spacing around EM DASH. Right, SPACE, HYPHEN-MINUS, SPACE is easily replaced with SPACE, EN DASH, SPACE or any other dashing convention at a later stage. But not using a correct dash out of U+2013, U+2014 and U+2015 is not nearly useful if all these are on level 2 of three digit keys (1, 2, 3 or another range). Additionally that brings the advantage of being able to differenciate while thinking at the content. Nobody else can do that job later with a comparable efficiency. 2. HYPHEN-MINUS vs HYPHEN: That has much of a non-starter. As already discussed in detail on this List, HYPHEN is a useless duplicate encoding of HYPHEN-MINUS, which in almost all fonts has the glyph of HYPHEN and is used for the system hyphen from the automated hyphenation when a .docx is exported as a .pdf file. Using fonts designed otherwise requires either a special keyboard layout or weird replacements because the HYPHEN-MINUS in URLs and e-mail addresses must not be replaced. So using HYPHEN-MINUS everywhere a HYPHEN is intended is OK even in publishing. Only some fonts may need fixing (I don?t know more than a single one). > >> Producing a plain text string usable for publishing was then put out of reach >> of most common mortals, by using the lever of deficient keyboarding, but also >> supposedly by an ?encoding error? (scare quotes) in the line break property of >> U+2008 PUNCTUATION SPACE, that should be non-breakable like its siblings >> U+2007 FIGURE SPACE (still?as per UAX #14?recommended for use in numbers) and >> U+2012 FIGURE DASH to gain the narrow non-breaking space needed to space the [corrected, see above] >> triads in numbers using space as a group separator, and to space big punctuation >> in a Latin script using locale, where JTC1/SC2/WG2 had some meetings for the UCS: >> French. > > Those details should be handled in a post-processing phase for documents that are intended > for publication. Not at all, as already stated above. Making a mess of any text file that is not print-ready, is an insult to the reader. And any *French* text not spacing punctuations with NNBSP is at risk of ending up as a mess. > One of the big problem in current architectures is that things like "autocorrect" > which attempt to overcome the limitations of the current keyboards, That is another disconnect, already pointed out repeatedly. Current keyboards have no intrinsic ?limitations?, and referring to outdated keyboard layouts as a fatality is in disconnect with the reality, since all OS vendors offer facilities to complete, enhance or change the keyboard layout. > are applied at input time > only; and authors need to constantly interact with these helpers to make sure they don't mis- > fire. Correct; that is also where originated what was called ?the apostrophe catastrophe.? > Much text that is laboriously prepared this way, will not survive future revisions during > the editing process needed to get the *content* to publication quality. That only applies to files fed in an editing process. Many people are directly publishing out-of-the-keyboard, and that is where complete and readily available Unicode support matters most. Anything else can be made up by the rendering engine, as you already noted. The force of Unicode being interoperability and data exchange, I can see no technical reason not to type in Unicode on one?s keyboard, including abbrevation indicators of any kind. > > All because users have no convenient tool to "touch-up" these dashes, quotes, and spaces > in a later phase; at the same time they apply copy-editing, for example. Because once you are in a WYSIWYG environment, you cannot simply transfer the text to your text editor to apply regexes, and people need to write macros in VBA to get things done I figure out. Autocorrect is consistent with WYSIWYG. People not interested in seeing what they?re typing may wish to use LaTeX, where they can see it in another window. What I cannot see is why these important issues should preclude users from typing preformatted superscripts on their keyboard, be it via a ?superscript? dead key. Such a dead key is already standardized, but again, Karl Pentzlin?s proposal to encode the missing characters has been rejected, while in this thread we could see there is an interest for what could be called a UnicodeChem notation, a nearly plain text encoding of chemical elements, compounds and processes. > >> For everybody having beneath his or her hands a keyboard whose layout driver is >> programmed in a fully usable way, the disconnect implodes. At encoding and input >> levels (the only ones that are really on-topic in this thread) the sorcery called >> fine typography sums then up to nothing else than having the keyboard inserting >> fully diacriticized letters, right punctuation, accurate space characters, and >> superscript letters as ordinal indicators and abbreviation endings, depending >> on the requirements. > > In the days of typewritten manuscripts you had to follow certain conventions that allowed the > typesetter to select the intended symbols and styled letters. I'm not arguing that we should > return to where such fallbacks are used. And certainly not arguing that we should be using > ASCII fallbacks for letters with diacritics, such as "oe" for "?". > > But many issues around selecting the precise type of space or dash are not so much issues > of correct content but precisely issues of typography. That is right so far as the French national printing office recommends to use NBSP with the colon, while the industry widely uses NNBSP for colon, too, Philippe Verdy reported on this List. It also states that the same should be done for angle quotation marks, but does not so. Here is indeed matter for fine-tuning, but as stated above and below, NBSP does not work in every environment, even not in most of the most common ones where users are typing text. I still call a string publication ready where big punctuations are spaced with NNBSP uniformely. > > Some occupy an intermediate level, where it would be quite appropriate to apply them to > many automatically generated texts. (I am aware of your efforts in CLDR to that effect). Thank you for the occasion to invite everyone to join in and contribute to the oncoming surveys of Unicode?s Common Locale Data Repository. Much needs to be done in French and in many locales already present, even if the stress should naturally be on adding *new* locales still not in CLDR. > But I still believe that they have no place in content focused writing. That is only the effect of an error of perception, that is widely fueled by the deficient keyboard design not supporting automated punctuation spacing for French. See ticket in Trac. > >> Now was I talking about ?all text output on a computer?? No, I wasn?t. >> >> The computer is able to accept input of publishing-ready strings, since we have >> Unicode. Precluding the user from using the needed characters by setting up >> caveats and prohibitions in the Unicode Standard seems to me nothing else than >> an outdated operating mode. U+202F NARROW NO-BREAK SPACE, encoded in 1999 for >> Mongolian [1][2], has been readily ripped off by the French graphic industry. >> In 2014, TUS started mentioning its use in French [3]; in 2018, it put it on >> top [4]. >> That seems to me a striking example of how things encoded for other purposes >> are reused (or following a certain usage, ?abused?, ?hacked?, ?hijacked?) in >> locales like French. If it wasn?t an insult to minority languages, that >> language could be called, too, ?digitally disfavored? in a certain sense. >> >>> On the other hand, I'm a firm believer in applying certain styling attributes >>> to things like e-mail or discussion papers. Well-placed emphasis can make such >>> texts more readable (without requiring that they pay attention to all other >>> facets of "fine typography".) >> The parenthesized sidenote (that is probably the intended main content?) makes >> this paragraph wrong. I?d buy it if either the parenthesis is removed or if it >> comes after the following. > > Now you are copy-editing my e-mails. :) :) > > I don't read or write French on the level that I can evaluate your contention that the language > is digitally disadvantaged. It was heavily disadvantaged until U+202F?NARROW NO-BREAK SPACE was encoded and widely implemented. Implementation would have been speedy and straightforward if only it had been present from the beginning on, as U+2008 PUNCTUATION SPACE. Even the character name would have matched the purpose. Perhaps the Frenchmen implied were hindered in fixing that bug while being aware of its gravity. Then it was still disadvantaged by lack of ordinal indicators, but that is now fixed thanks to CLDR Technical Committee, past summer. Many thanks. Ultimately it is part of the languages using superscript as the abbreviation indicator, and not allowed by Unicode to use even the already encoded superscript letters. That was not fixed in CLDR for v34 because the browsers used to display the data, notably in the SurveyTool implemented as a web interface, still are not using decent fonts having Unicode conformant glyphs for all superscript letters and even digits as seen in some webmail interfaces. The resulting ransome note effect made it impossible to responsively back the use of those letters in natural languages as abbreviation indicators, because unlike phonetics using these letters in isolation, natural languages may have abbreviation endings encompassing more than the final letter. For the abbreviation of Magister like on the Polish postcard, that is not a problem. > > To some extent, software will always reflect the biases of its creators, and in some subtle ways > these will end up in conflict with conventions in other languages. In some cases, conventions > applied by human typesetters cannot easily be duplicated by software that cannot recognize > the meaning of the text, Very good point. That is exactly the reason why the author should be enabled to take full control over his or her text, and that is best and most universally done by correctly programming the layout driver of the keyboard used. > and in some cases we have seen languages abandoning these > conventions in recent reforms in favor of a set of rules that are a bit more "mechanistic" > if you will. > > In German, it used to be necessary to understand the word division to know whether or not > to apply a ligature. Some of the rules for combining words into compounds were changed > and that may have made that process more regular as well. That is a fine step forward for good typography. > > But still, forcing all users to become typesetters was one of the wrong turns taken during the > early development of publishing on computers. I don?t think so at all. Users were not ?forced? to do anything. If the autocorrect facilities helping over the deficient keyboarding were not welcome, they could easily be turned off. And professional typesetters always remained active, turning to the computer in the wake. I?ve experienced myself being able thanks to Microsoft?s word processor to do professionally looking typesetting. (As I was responsible for the content anyway, it didn?t make a difference.) But first I had to add some entries to Word?s autocorrect for tweaking the keyboard. > You seem to revel in knowing all the little > details in French usage, Not at all. That knowledge is a sheer necessity, and fortunately it is so narrow that you don?t need to know that much to digitally typeset French. But you need to know the relevant points. The fact that NARROW NO-BREAK SPACE is narrow doesn?t make it little, but it misleads people to classify it under ?fine typography?, even more in French where (as found in TUS, in French in the text) it?s called an ?espace fine ins?cable?. > but I bet not even all educated French people reach your level. Precisely on this point, perhaps not but that point is relevant mainly to those programming and documenting keyboard layouts. After that, punctuation spacing is automated on level 2 (just press Shift) and easily turned off by several means. I hope that will be welcome, as almost everyone in France is very careful to always space the big punctuation marks by the means available so far. And to always superscript the ordinal indicators and other abbreviation indicators, at least while handwriting. > > The best keyboard drivers won't help. Why do you see that they won?t help? > So the idea that every string is supposed to be > "publication-ready" remains a fallacy. However, there shouldn't be encoding obstacles > to creating publication-ready strings. (Whether created by copy-editors, typesetters, or > advanced tools that post-process draft texts). What I?d mainly like to see is that Unicode (supposing that you are writing on behalf of the Consortium) do not impose a division of the workflow. Everybody should be able to apply to any task the most appropriate process, no matter of how many parts it will consist. If a subset of end-users wish to input strings that won?t need to be modified in detail for publishing (except headings), Unicode is here to empower them to do so. Can that be taken for granted? > > If an Twitter message uses spaces around punctuation that are not the right width, who > cares; As pointed out in the paragraph of my previous e-mail just below, the main issue around punctuation spacing in French in non-justifying layout is not the width of the space characters, but their line-breaking property. Believe it or not, U+00A0 NO-BREAK SPACE is breakable in those environments, that are therefore messing around with spaced punctuation unless the space used is U+202F NARROW NO-BREAK SPACE. Or U+2007 FIGURE SPACE, but if we?re having to use an extra space character, we may as well pick the right one, given FIGURE SPACE is not fit for publishing, while NNBSP is. > but if your copy-editor can't prepare a manuscript for publication because of software > limitations, that's a different can of worms. My copy-editor is me. I wrote in my previous (perhaps too long, but couldn?t help) e-mail: ?Making such text usable for publishing needs extra work, that today many users cannot afford?, and: ?Such a draft is also a service to the user, enabling him or her to streamline the workflow. Such streamlining brings monetary and reputational benefit to the user.? The working scheme used with TeX or regexes is not interoperable, and the drafts are not all-purpose. A publishing-ready draft is in my opinion a plain text string that can be copy-pasted as-is ? or typed directly ? in a blog post composer form while being sure that all punctuation and punctuation spacing is fully operational. I don?t currently do this, but many people do, and are doing word processing where the same applies, given the autocorrect doesn?t use the up-to-date space and can hardly guess in every case what the user intends to type, you pointed out. > > A./ > >> With due respect, I need to add that the disconnect in that is visible only to >> French readers. Without NNBSP, punctuation ? la fran?aise in e-mails is messed >> up because even NBSP is ignored (I don?t know what exactly happens at backend; >> anyway at frontend it?s like a normal space in at least one e-mail client and >> in several if not all browsers, and if pasted in plain text from MS Word, it?s >> truly replaced with SP. All that makes e-mails harder to read. Correct spacing >> with punctuation in French is often considered ?fine-tuning?, but only if that >> punctuation spacing is not supported by the keyboard driver, and that?s still >> almost always the case, except on the updated version 1.1 of the b?po layout >> (and some personal prototypes not yet released). >> Best regards, Marcel From unicode at unicode.org Thu Nov 1 15:42:05 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:42:05 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: The 0000 is there in the UCA only because the DUCET is published in a format that uses it, but here also this format is useless: you never need any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET just needs to indicate what is the minimum weight assigned for every level (except the highest level where it is "implicitly" 0001, and not 0000). Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a ?crit : > There are lots of ways to implement the UCA. > > When you want fast string comparison, the zero weights are useful for > processing -- and you don't actually assemble a sort key. > > People who want sort keys usually want them to be short, so you spend time > on compression. You probably also build sort keys as byte vectors not > uint16 vectors (because byte vectors fit into more APIs and tend to be > shorter), like ICU does using the CLDR collation data file. The CLDR root > collation data file remunges all weights into fractional byte sequences, > and leaves gaps for tailoring. > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 15:57:02 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 21:57:02 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: In summary, this step given in the algorithm is completely unneeded and can be dropped completely: *S3.2 *If L is not 1, append a *level separator* *Note:*The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter?in the absence of special features like contractions. For example: "abc" < "abcX" where "X" can be any character(s). Remove any reference to the "level separator" from the UCA. You never need it. As well this paragraph 7.3 Form Sort Keys *Step 3.* Construct a sort key for each collation element array by successively appending all non-zero weights from the collation element array. Figure 2 gives an example of the application of this step to one collation element array. Figure 2. Collation Element Array to Sort Key Collation Element ArraySort Key [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 can be written with this figure: Figure 2. Collation Element Array to Sort Key Collation Element ArraySort Key [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) The parentheses mark the collation weights 0020 and 0002 that can be safely removed if they are respectively the minimum secondary weight and minimum tertiary weight. But note that 0020 is kept in two places as they are followed by a higher weight 0021. This is general for any tailored collation (not just the DUCET). Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a ?crit : > The 0000 is there in the UCA only because the DUCET is published in a > format that uses it, but here also this format is useless: you never need > any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET > just needs to indicate what is the minimum weight assigned for every level > (except the highest level where it is "implicitly" 0001, and not 0000). > > > Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a > ?crit : > >> There are lots of ways to implement the UCA. >> >> When you want fast string comparison, the zero weights are useful for >> processing -- and you don't actually assemble a sort key. >> >> People who want sort keys usually want them to be short, so you spend >> time on compression. You probably also build sort keys as byte vectors not >> uint16 vectors (because byte vectors fit into more APIs and tend to be >> shorter), like ICU does using the CLDR collation data file. The CLDR root >> collation data file remunges all weights into fractional byte sequences, >> and leaves gaps for tailoring. >> >> markus >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 16:04:40 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 22:04:40 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: So it should be clear in the UCA algorithm and in the DUCET datatable that "0000" is NOT a valid weight It is just a notational placeholder used as ".0000", only indicating in the DUCET format that there's NO weight assigned at the indicated level, because the collation element is ALWAYS ignorable at this level. The DUCET could have as well used the notation ".none", or just dropped every ".0000" in its file (provided it contains a data entry specifying what is the minimum weight used for each level). This notation is only intended to be read by humans editing the file, so they don't need to wonder what is the level of the first indicated weight or remember what is the minimum weight for that level. But the DUCET table is actually generated by a machine and processed by machines. Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a ?crit : > In summary, this step given in the algorithm is completely unneeded and > can be dropped completely: > > *S3.2 *If L is not 1, append a *level > separator* > > *Note:*The level separator is zero (0000), which is guaranteed to be > lower than any weight in the resulting sort key. This guarantees that when > two strings of unequal length are compared, where the shorter string is a > prefix of the longer string, the longer string is always sorted after the > shorter?in the absence of special features like contractions. For example: > "abc" < "abcX" where "X" can be any character(s). > > Remove any reference to the "level separator" from the UCA. You never need > it. > > As well this paragraph > > 7.3 Form Sort Keys > > *Step 3.* Construct a sort key for each collation element array by > successively appending all non-zero weights from the collation element > array. Figure 2 gives an example of the application of this step to one > collation element array. > > Figure 2. Collation Element Array to Sort Key > > Collation Element ArraySort Key > [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706 > 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 > > can be written with this figure: > > Figure 2. Collation Element Array to Sort Key > > Collation Element ArraySort Key > [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 > 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) > > The parentheses mark the collation weights 0020 and 0002 that can be > safely removed if they are respectively the minimum secondary weight and > minimum tertiary weight. > But note that 0020 is kept in two places as they are followed by a higher > weight 0021. This is general for any tailored collation (not just the > DUCET). > > Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a ?crit : > >> The 0000 is there in the UCA only because the DUCET is published in a >> format that uses it, but here also this format is useless: you never need >> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >> just needs to indicate what is the minimum weight assigned for every level >> (except the highest level where it is "implicitly" 0001, and not 0000). >> >> >> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a >> ?crit : >> >>> There are lots of ways to implement the UCA. >>> >>> When you want fast string comparison, the zero weights are useful for >>> processing -- and you don't actually assemble a sort key. >>> >>> People who want sort keys usually want them to be short, so you spend >>> time on compression. You probably also build sort keys as byte vectors not >>> uint16 vectors (because byte vectors fit into more APIs and tend to be >>> shorter), like ICU does using the CLDR collation data file. The CLDR root >>> collation data file remunges all weights into fractional byte sequences, >>> and leaves gaps for tailoring. >>> >>> markus >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 16:30:23 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 21:30:23 +0000 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: <20181101213023.51380fa7@JRWUBU2> On Thu, 1 Nov 2018 22:04:40 +0100 Philippe Verdy via Unicode wrote: > The DUCET could have as well used the notation ".none", or > just dropped every ".0000" in its file (provided it contains a data > entry specifying what is the minimum weight used for each level). > This notation is only intended to be read by humans editing the file, > so they don't need to wonder what is the level of the first indicated > weight or remember what is the minimum weight for that level. > But the DUCET table is actually generated by a machine and processed > by machines. A fair few humans have tailored it by hand. Richard. From unicode at unicode.org Thu Nov 1 16:32:01 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 21:32:01 +0000 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: <20181101213201.2a9a986d@JRWUBU2> On Thu, 1 Nov 2018 21:13:46 +0100 Philippe Verdy via Unicode wrote: > I'm not speaking just about how collation keys will finally be stored > (as uint16 or bytes, or sequences of bits with variable length); I'm > just refering to the sequence of weights you generate. > You absolutely NEVER need ANYWHERE in the UCA algorithm any 0000 > weight, not even during processing, or un the DUCET table. If you take the zero weights out, you have a different table structure to store, e.g. the CLDR fractional weight tables. Richard. From unicode at unicode.org Thu Nov 1 16:47:40 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 21:47:40 +0000 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: <20181101214740.57853cc1@JRWUBU2> On Thu, 1 Nov 2018 18:39:16 +0100 Philippe Verdy via Unicode wrote: > What this means is that we can safely implement UCA using basic > substitions (e.g. with a function like "string:gsub(map)" in Lua > which uses a "map" to map source (binary) strings or regexps,into > target (binary) strings: > > For a level-3 collation, you just then need only 3 calls to > "string:gsub()" to compute any collation: > > - the first ":gsub(mapNormalize)" can decompose a source text into > collation elements and can perform reordering to enforce a normalized > order (possibly tuned for the tailored locale) using basic regexps. Are you sure of this? Will you publish the algorithm? Have you passed the official conformance tests? (Mind you, DUCET is a relatively easy UCA collation to implement successfully.) > - the second ":gsub(mapSecondary)" will substitute any collection > elements by their "intermediary" collation elements+tertiary weight. > > - the third ":gsub(mapSecondary)" will substitute any "intermediary" > collation element by their primary weight + secondary weight Richard. From unicode at unicode.org Thu Nov 1 16:56:06 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Nov 2018 21:56:06 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <86d0roiufa.fsf@mimuw.edu.pl> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> Message-ID: <20181101215606.30dd6ced@JRWUBU2> On Thu, 01 Nov 2018 18:23:05 +0100 "Janusz S. Bie? via Unicode" wrote: > On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote: > > I don't think it's a joke to recognize that there is a continuum > > here and that there is no line that can be drawn which is based on > > straightforward principles. This is a pattern that keeps surfacing > > the deeper you look at character coding questions. > > Looks like you completely missed my point. Nobody ever claimed that > reproducing all variations in manuscripts is in scope of Unicode, so > whom do you want to convince that it is not? I think the counter-claim is that one will never be able to encode all the meaning-conveying distinctions of text in Unicode. Richard. From unicode at unicode.org Thu Nov 1 18:38:08 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Nov 2018 00:38:08 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: As well the step 2 of the algorithm speaks about a single "array" of collation elements. Actually it's best to create one separate array per level, and append weights for each level in the relevant array for that level. The steps S2.2 to S2.4 can do this, including for derived collation elements in section 10.1, or variable weighting in section 4. This also means that for fast string compares, the primary weights can be processed on the fly (without needing any buffering) is the primary weights are different between the two strings (including when one or both of the two strings ends, and the secondary weights or tertiary weights detected until then have not found any weight higher than the minimum weight value for each level). Otherwise: - the first secondary weight higher that the minimum secondary weght value, and all subsequent secondary weights must be buffered in a secondary buffer . - the first tertiary weight higher that the minimum secondary weght value, and all subsequent secondary weights must be buffered in a tertiary buffer. - and so on for higher levels (each buffer just needs to keep a counter, when it's first used, indicating how many weights were not buffered while processing and counting the primary weights, because all these weights were all equal to the minimum value for the relevant level) - these secondary/tertiary/etc. buffers will only be used once you reach the end of the two strings when processing the primary level and no difference was found: you'll start by comparing the initial counters in these buffers and the buffer that has the largest counter value is necessarily for the smaller compared string. If both counters are equal, then you start comparing the weights stored in each buffer, until one of the buffers ends before another (the shorter buffer is for the smaller compared string). If both weight buffers reach the end, you use the next pair of buffers built for the next level and process them with the same algorithm. Nowhere you'll ever need to consider any [.0000] weight which is just a notation in the format of the DUCET intended only to be readable by humans but never needed in any machine implementation. Now if you want to create sort keys this is similar except that you don"t have two strings to process and compare, all you want is to create separate arrays of weights for each level: each level can be encoded separately, the encoding must be made so that when you'll concatenate the encoded arrays, the first few encoded *bits* in the secondary or tertiary encodings cannot be larger or equal to the bits used by the encoding of the primary weights (this only limits how you'll encode the 1st weight in each array as its first encoding *bits* must be lower than the first bits used to encode any weight in previous levels). Nowhere you are required to encode weights exactly like their logical weight, this encoding is fully reversible and can use any suitable compression technics if needed. As long as you can safely detect when an encoding ends, because it encounters some bits (with lower values) used to start the encoding of one of the higher levels, the compression is safe. For each level, you can reserve only a single code used to "mark" the start of another higher level followed by some bits to indicate which level it is, then followed by the compressed code for the level made so that each weight is encoded by a code not starting by the reserved mark. That encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' (if the encoding must be readable as ASCII or UTF-8-based, and must not use any control or SPACE or isolated surrogate) and codes used to encode each weight must not start by a byte lower or equal to this mark. The binary or ASCII code units used to encode each weight must just be comparable, so that comparing codes is equivalent to compare weights represented by each code. As well, you are not required to store multiple "marks". This is just one of the possibilities to encode in the sort key which level is encoded after each "mark", and the marks are not necessarily the same before each level (their length may also vary depending on the level they are starting): these marks may be completely removed from the final encoding if the encoding/compression used allows discriminating the level used by all weights, encoded in separate sets of values. Typical compression technics are for example differencial, notably in secondary or higher levels, and run-legth encoded to skip sequences of weights all equal to the minimum weight. The code units used by the weigh encoding for each level may also need to avoid some forbidden values if needed (e.g. when encoding the weights to UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units reserved for or representing an isolate surrogate in U+D800..U+DFFF as this would create a string not conforming to any standard UTF). Once again this means that the sequence of logical weight will can sefely become a readable string, even suitable to be transmitted as plain-text using any UTF, and that compression is also possible in that case: you can create and store lot of sort keys even for very long texts However it is generally better to just encode sort keys only for a reasonnably discriminant part of the text, e.g. no sort key longer than 255 bytes (created from the start of the original texts): if you compare two sort keys and find that they are equal, and if both sort keys have this length of 255 bytes, then you'll compare the full original texts using the fast-compare algorithm: you don't need to store full sort keys in addition to the original texts. This can save lot of storage, provided that original texts are sufficiently discriminated by their start, and that cases where the sort keys were truncated to the limit of 255 bytes are exceptionnal. For short texts however, truncated sortkeys may save time at the price of a reasonnable storage cost (but sortkeys can be also encoded with roughly the same size as the original text: compression is modest for the encoded primary level. But compression is frequently very effective for higher levels where their smaller weight also have less possible variations of value, in a smaller set. Notably for the secondary level used to encode case differences, only 3 bits are enough per weight, and you just need to reserve the 3-bit value "000" as the "mark" for indicating the start of another higher level, while encoding secondary weights as "001" to "111". (This means that primary levels have to be encoded so that none of their encoded primary weights are starting with "000" marking the start of the secondary level. So primary weights can be encoded in patterns starting by "0001", "001", "01", or "1" and followed by other bits: this allows encoding them as readable UTF-8 if these characters are all different at primary level, excluding only the 16 first C0 controls which need to be preprocessed into escape sequences using the first permitted C0 control as an escape, and escaping that C0 control itself). The third level, started by the mark "00" and followed by the encoded weights indicating this is a tertiary level and not an higher level, will also be used to encode a small set of weights (in most locales, this is not more than 8 or 16, so you need only 3 or 4 bits to encode weights (using differential coding on 3-bits, you reserve "000" as the "mark" for the next higher level, then use "001" to "111" to encode differencial weights, the differencial weights being initially based on the minimum tertiary weight, you'll use the bit pattern "001" to encode the most frequent minimum tertiary weight, and patterns "01" to "11" plus additional bits to encode other positive or negative differences of tertiary weights, or to use run-length compression). Here also it is possible to map the patterns so that the encoded secondary weight will be readable valid UTF-8. The fourth level, started by the mark "000" can use the pattern "001" to encode the most frequent minimum quaternary weight, and patterns "010" to "011" followed by other bits to differentially encode the quaternary weights. Here again it is possible to create an encoding for quaternary weights that can use some run-length compression and can also be readable valid UTF-8! And so on. Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a ?crit : > So it should be clear in the UCA algorithm and in the DUCET datatable that > "0000" is NOT a valid weight > It is just a notational placeholder used as ".0000", only indicating in > the DUCET format that there's NO weight assigned at the indicated level, > because the collation element is ALWAYS ignorable at this level. > The DUCET could have as well used the notation ".none", or just dropped > every ".0000" in its file (provided it contains a data entry specifying > what is the minimum weight used for each level). This notation is only > intended to be read by humans editing the file, so they don't need to > wonder what is the level of the first indicated weight or remember what is > the minimum weight for that level. > But the DUCET table is actually generated by a machine and processed by > machines. > > > > Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a ?crit : > >> In summary, this step given in the algorithm is completely unneeded and >> can be dropped completely: >> >> *S3.2 *If L is not 1, append a *level >> separator* >> >> *Note:*The level separator is zero (0000), which is guaranteed to be >> lower than any weight in the resulting sort key. This guarantees that when >> two strings of unequal length are compared, where the shorter string is a >> prefix of the longer string, the longer string is always sorted after the >> shorter?in the absence of special features like contractions. For example: >> "abc" < "abcX" where "X" can be any character(s). >> >> Remove any reference to the "level separator" from the UCA. You never >> need it. >> >> As well this paragraph >> >> 7.3 Form Sort Keys >> >> *Step 3.* Construct a sort key for each collation element array by >> successively appending all non-zero weights from the collation element >> array. Figure 2 gives an example of the application of this step to one >> collation element array. >> >> Figure 2. Collation Element Array to Sort Key >> >> Collation Element ArraySort Key >> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706 >> 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002 >> >> can be written with this figure: >> >> Figure 2. Collation Element Array to Sort Key >> >> Collation Element ArraySort Key >> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >> >> The parentheses mark the collation weights 0020 and 0002 that can be >> safely removed if they are respectively the minimum secondary weight and >> minimum tertiary weight. >> But note that 0020 is kept in two places as they are followed by a higher >> weight 0021. This is general for any tailored collation (not just the >> DUCET). >> >> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a >> ?crit : >> >>> The 0000 is there in the UCA only because the DUCET is published in a >>> format that uses it, but here also this format is useless: you never need >>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>> just needs to indicate what is the minimum weight assigned for every level >>> (except the highest level where it is "implicitly" 0001, and not 0000). >>> >>> >>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a >>> ?crit : >>> >>>> There are lots of ways to implement the UCA. >>>> >>>> When you want fast string comparison, the zero weights are useful for >>>> processing -- and you don't actually assemble a sort key. >>>> >>>> People who want sort keys usually want them to be short, so you spend >>>> time on compression. You probably also build sort keys as byte vectors not >>>> uint16 vectors (because byte vectors fit into more APIs and tend to be >>>> shorter), like ICU does using the CLDR collation data file. The CLDR root >>>> collation data file remunges all weights into fractional byte sequences, >>>> and leaves gaps for tailoring. >>>> >>>> markus >>>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 1 21:45:27 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 2 Nov 2018 02:45:27 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181101215606.30dd6ced@JRWUBU2> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> Message-ID: Richard Wordingham responded to Janusz S. Bie?, >> ... Nobody ever claimed that reproducing all variations >> in manuscripts is in scope of Unicode, so whom do you want >> to convince that it is not? > > I think the counter-claim is that one will never be able > to encode all the meaning-conveying distinctions of text > in Unicode. I think that the general agreement is that Unicode plain text isn't intended for preserving stylistic differences.? The dilemma is that opinions differ as to what constitutes a stylistic difference. If there had been an "International Typewriter Usage Consortium" a hundred years ago which had issued an edict like "the underscore is placed on the keyboard for the explicit purpose of typing empty lines for 'fill-in-the-blank' forms, and must never be used by the typist to underline any other element of type", then that consortium would have been dictating how users perceive their own written symbols along with preventing users from establishing new conventions using existing symbols, experimenting, or innovating. Some people consider that Unicode is essentially doing the same kind of thing.? It's *that* perception which needs to be addressed, perhaps with FAQs and education, or with some kind of revisiting and rethinking.? Or both. From unicode at unicode.org Thu Nov 1 21:59:46 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 2 Nov 2018 02:59:46 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> Message-ID: <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Alphabetic script users write things the way they are spelled and spell things the way they are written.? The abbreviation in question as written consists of three recognizable symbols.? An "M", a superscript "r", and an equal sign (= two lines).? It can be printed, handwritten, or in fraktur; it will still consist of those same three recognizable symbols. We're supposed to be preserving the past, not editing it or revising it. From unicode at unicode.org Fri Nov 2 00:22:59 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 1 Nov 2018 22:22:59 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 00:44:35 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 02 Nov 2018 06:44:35 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com> (Asmus Freytag via Unicode's message of "Thu, 1 Nov 2018 13:34:05 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <923eca1e-53d3-ed49-58c6-fe0b7a5ac508@ix.netcom.com> Message-ID: <86r2g4uj7g.fsf@mimuw.edu.pl> On Thu, Nov 01 2018 at 13:34 -0700, Asmus Freytag via Unicode wrote: > On 11/1/2018 10:23 AM, Janusz S. Bie? via Unicode wrote: [...] > Looks like you completely missed my point. Nobody ever claimed that > reproducing all variations in manuscripts is in scope of Unicode, so > whom do you want to convince that it is not? > > Looks like you are missing my point about there being a continuum with > not clear lines that can be perfectly drawn a-priori. Why do you think so? There is nothing in my posts which can be used to support your claim. Perhaps you confused me with some other poster? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Nov 2 01:05:06 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 02 Nov 2018 07:05:06 +0100 Subject: mail attribution (was: A sign/abbreviation for "magister") References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com> <20181101075209.5ffbba7d@JRWUBU2> <97890362-7550-2e43-2266-a41853b89ba7@ix.netcom.com> Message-ID: <865zxgui99.fsf@mimuw.edu.pl> On Thu, Nov 01 2018 at 6:43 -0700, Asmus Freytag via Unicode wrote: > On 11/1/2018 12:52 AM, Richard Wordingham via Unicode wrote: > > On Wed, 31 Oct 2018 11:35:19 -0700 > Asmus Freytag via Unicode wrote: [...] > Unfortunately, your emails are extremely hard to read in plain text. > It is even difficult to tell who wrote what. My previous mail is unfortunately an example. > > Not sure why that is. After they make the round trip, they look fine > to me. When displaying your HTML mail, Emacs Gnus doesn't show correctly the attributions. If I forget to edit it by hand when replying, we get the confusion like in my previous mail. I guess I should submit this as a bug or feature request to Emacs developers. Perhaps Richard Wordingham should do the same for the mail agent he uses. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Nov 2 02:16:35 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 2 Nov 2018 07:16:35 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: Asmus Freytag wrote, > Alphabetic script users' handwriting does not match > print in all features. Traditional German handwriting > used a line like a macron over the letter 'u' to > distinguish it from 'n'. Rendering this with a > u-macron in print would be the height of absurdity. If German text were displayed with a traditional German handwriting (cursive) font, then every "u" would display with a macron.? (Except the ones with umlauts.)? That's because the macron is part and parcel of the identity of the stylistic variant (cursive) of the letter, not because the addition of the macron makes a stylistic variation.? It would indeed be silly to encode such macrons in data derived from a traditional German handwriting specimen.? Hopefully most everyone here agrees with that. We all seem to accept that, for example, d = d = d = d. We all don't seem to agree that d # d?. Or that "Mr." # "Mr" # "M?" # "M??" # "M:r". From unicode at unicode.org Fri Nov 2 03:54:36 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Fri, 2 Nov 2018 08:54:36 +0000 (GMT) Subject: A sign/abbreviation for "magister" References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: On 2018-11-02, James Kass via Unicode wrote: > Alphabetic script users write things the way they are spelled and spell > things the way they are written.? The abbreviation in question as > written consists of three recognizable symbols.? An "M", a superscript > "r", and an equal sign (= two lines).? It can be printed, handwritten, That's not true. The squiggle under the r is a squiggle - it is a matter of interpretation (on which there was some discussion a hundred messages up-thread or so :) whether it was intended to be = . Just as it is a matter of interpretation whether the superscript and squiggle were deeply meaningful to the writer, or whether they were just a stylistic flourish for Mr. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Fri Nov 2 04:48:01 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 2 Nov 2018 09:48:01 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: Julian Bradfield wrote, >> consists of three recognizable symbols.? An "M", a superscript >> "r", and an equal sign (= two lines).? It can be printed, handwritten, > > That's not true. The squiggle under the r is a squiggle - it is a > matter of interpretation (on which there was some discussion a hundred > messages up-thread or so :) whether it was intended to be = . I recall Asmus pointing out that the Z-like squiggle was likely a handwritten "=" and that there was some agreement to this, but didn't realize that it was in dispute.? FWIW, I agree that the squiggle which looks kind of like "?" is simply the cursive, stylistic variant of "=", especially when written quickly. > Just as it is a matter of interpretation whether the superscript and > squiggle were deeply meaningful to the writer, or whether they were > just a stylistic flourish for Mr. A third possibility is that the double-underlined superscript was a writing/spelling convention of the time for writing/spelling abbreviations. Even if someone produced contemporary Polish manuscripts abbreviating magister as "Mr", it could be argued that the two writers were simply using different conventions. From unicode at unicode.org Fri Nov 2 06:31:06 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 2 Nov 2018 11:31:06 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> Suppose someone found a hundred year old form from Poland which included a section for "sign your name" and "print your name" which had been filled out by a man with the typically Polish name of Bogus McCoy?? And he was a Magister, to boot!? And proud of it. If he signed the magister abbreviation using double-underlined superscript and likewise his surname *and* printed it the same way -- it might still be arguable as to whether it was a writing/spelling or a stylish distinction, I suppose. But if he signed using double-underlined superscripts and printed using baseline lower case Latin letters, *that* might be persuasive. Doesn't seem likely, though, does it? (Bogus?aw is a legitimate Polish masculine given name.? Its nickname is Bogus.? McCoy is not, however, a typical Polish surname.? The snarky combination of "Bogus McCoy" was irresistible to someone of my character and temperament.? "Bogus" is American slang for fake and "McCoy" connotes being genuine, as in "the real McCoy".) From unicode at unicode.org Fri Nov 2 07:09:51 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 2 Nov 2018 05:09:51 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> Message-ID: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 08:03:37 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 2 Nov 2018 14:03:37 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: You may not like the format of the data, but you are not bound to it. If you don't like the data format (eg you want [.0021.0002] instead of [.0000.0021.0002]), you can transform it however you want as long as you get the same answer, as it says here: http://unicode.org/reports/tr10/#Conformance ?The Unicode Collation Algorithm is a logical specification. Implementations are free to change any part of the algorithm as long as any two strings compared by the implementation are ordered the same as they would be by the algorithm as specified. Implementations may also use a different format for the data in the Default Unicode Collation Element Table. The sort key is a logical intermediate object: if an implementation produces the same results in comparison of strings, the sort keys can differ in format from what is specified in this document. (See Section 9, Implementation Notes.)? That is what is done, for example, in ICU's implementation. See http://demo.icu-project.org/icu-bin/collation.html and turn on "raw collation elements" and "sort keys" to see the transformed collation elements (from the DUCET + CLDR) and the resulting sort keys. a =>[29,05,_05] => 29 , 05 , 05 . a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . ? => A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . ? => Mark On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < unicode at unicode.org> wrote: > As well the step 2 of the algorithm speaks about a single "array" of > collation elements. Actually it's best to create one separate array per > level, and append weights for each level in the relevant array for that > level. > The steps S2.2 to S2.4 can do this, including for derived collation > elements in section 10.1, or variable weighting in section 4. > > This also means that for fast string compares, the primary weights can be > processed on the fly (without needing any buffering) is the primary weights > are different between the two strings (including when one or both of the > two strings ends, and the secondary weights or tertiary weights detected > until then have not found any weight higher than the minimum weight value > for each level). > Otherwise: > - the first secondary weight higher that the minimum secondary weght > value, and all subsequent secondary weights must be buffered in a > secondary buffer . > - the first tertiary weight higher that the minimum secondary weght value, > and all subsequent secondary weights must be buffered in a tertiary buffer. > - and so on for higher levels (each buffer just needs to keep a counter, > when it's first used, indicating how many weights were not buffered while > processing and counting the primary weights, because all these weights were > all equal to the minimum value for the relevant level) > - these secondary/tertiary/etc. buffers will only be used once you reach > the end of the two strings when processing the primary level and no > difference was found: you'll start by comparing the initial counters in > these buffers and the buffer that has the largest counter value is > necessarily for the smaller compared string. If both counters are equal, > then you start comparing the weights stored in each buffer, until one of > the buffers ends before another (the shorter buffer is for the smaller > compared string). If both weight buffers reach the end, you use the next > pair of buffers built for the next level and process them with the same > algorithm. > > Nowhere you'll ever need to consider any [.0000] weight which is just a > notation in the format of the DUCET intended only to be readable by humans > but never needed in any machine implementation. > > Now if you want to create sort keys this is similar except that you don"t > have two strings to process and compare, all you want is to create separate > arrays of weights for each level: each level can be encoded separately, the > encoding must be made so that when you'll concatenate the encoded arrays, > the first few encoded *bits* in the secondary or tertiary encodings cannot > be larger or equal to the bits used by the encoding of the primary weights > (this only limits how you'll encode the 1st weight in each array as its > first encoding *bits* must be lower than the first bits used to encode any > weight in previous levels). > > Nowhere you are required to encode weights exactly like their logical > weight, this encoding is fully reversible and can use any suitable > compression technics if needed. As long as you can safely detect when an > encoding ends, because it encounters some bits (with lower values) used to > start the encoding of one of the higher levels, the compression is safe. > > For each level, you can reserve only a single code used to "mark" the > start of another higher level followed by some bits to indicate which level > it is, then followed by the compressed code for the level made so that each > weight is encoded by a code not starting by the reserved mark. That > encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' > (if the encoding must be readable as ASCII or UTF-8-based, and must not use > any control or SPACE or isolated surrogate) and codes used to encode each > weight must not start by a byte lower or equal to this mark. The binary or > ASCII code units used to encode each weight must just be comparable, so > that comparing codes is equivalent to compare weights represented by each > code. > > As well, you are not required to store multiple "marks". This is just one > of the possibilities to encode in the sort key which level is encoded after > each "mark", and the marks are not necessarily the same before each level > (their length may also vary depending on the level they are starting): > these marks may be completely removed from the final encoding if the > encoding/compression used allows discriminating the level used by all > weights, encoded in separate sets of values. > > Typical compression technics are for example differencial, notably in > secondary or higher levels, and run-legth encoded to skip sequences of > weights all equal to the minimum weight. > > The code units used by the weigh encoding for each level may also need to > avoid some forbidden values if needed (e.g. when encoding the weights to > UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units > reserved for or representing an isolate surrogate in U+D800..U+DFFF as this > would create a string not conforming to any standard UTF). > > Once again this means that the sequence of logical weight will can sefely > become a readable string, even suitable to be transmitted as plain-text > using any UTF, and that compression is also possible in that case: you can > create and store lot of sort keys even for very long texts > > However it is generally better to just encode sort keys only for a > reasonnably discriminant part of the text, e.g. no sort key longer than 255 > bytes (created from the start of the original texts): if you compare two > sort keys and find that they are equal, and if both sort keys have this > length of 255 bytes, then you'll compare the full original texts using the > fast-compare algorithm: you don't need to store full sort keys in addition > to the original texts. This can save lot of storage, provided that original > texts are sufficiently discriminated by their start, and that cases where > the sort keys were truncated to the limit of 255 bytes are exceptionnal. > > For short texts however, truncated sortkeys may save time at the price of > a reasonnable storage cost (but sortkeys can be also encoded with roughly > the same size as the original text: compression is modest for the encoded > primary level. But compression is frequently very effective for higher > levels where their smaller weight also have less possible variations of > value, in a smaller set. > > Notably for the secondary level used to encode case differences, only 3 > bits are enough per weight, and you just need to reserve the 3-bit value > "000" as the "mark" for indicating the start of another higher level, while > encoding secondary weights as "001" to "111". > > (This means that primary levels have to be encoded so that none of their > encoded primary weights are starting with "000" marking the start of the > secondary level. So primary weights can be encoded in patterns starting by > "0001", "001", "01", or "1" and followed by other bits: this allows > encoding them as readable UTF-8 if these characters are all different at > primary level, excluding only the 16 first C0 controls which need to be > preprocessed into escape sequences using the first permitted C0 control as > an escape, and escaping that C0 control itself). > > The third level, started by the mark "00" and followed by the encoded > weights indicating this is a tertiary level and not an higher level, will > also be used to encode a small set of weights (in most locales, this is not > more than 8 or 16, so you need only 3 or 4 bits to encode weights (using > differential coding on 3-bits, you reserve "000" as the "mark" for the next > higher level, then use "001" to "111" to encode differencial weights, the > differencial weights being initially based on the minimum tertiary weight, > you'll use the bit pattern "001" to encode the most frequent minimum > tertiary weight, and patterns "01" to "11" plus additional bits to encode > other positive or negative differences of tertiary weights, or to use > run-length compression). Here also it is possible to map the patterns so > that the encoded secondary weight will be readable valid UTF-8. > > The fourth level, started by the mark "000" can use the pattern "001" to > encode the most frequent minimum quaternary weight, and patterns "010" to > "011" followed by other bits to differentially encode the quaternary > weights. Here again it is possible to create an encoding for quaternary > weights that can use some run-length compression and can also be readable > valid UTF-8! > > And so on. > > > > > > > > > Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a ?crit : > >> So it should be clear in the UCA algorithm and in the DUCET datatable >> that "0000" is NOT a valid weight >> It is just a notational placeholder used as ".0000", only indicating in >> the DUCET format that there's NO weight assigned at the indicated level, >> because the collation element is ALWAYS ignorable at this level. >> The DUCET could have as well used the notation ".none", or just dropped >> every ".0000" in its file (provided it contains a data entry specifying >> what is the minimum weight used for each level). This notation is only >> intended to be read by humans editing the file, so they don't need to >> wonder what is the level of the first indicated weight or remember what is >> the minimum weight for that level. >> But the DUCET table is actually generated by a machine and processed by >> machines. >> >> >> >> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a >> ?crit : >> >>> In summary, this step given in the algorithm is completely unneeded and >>> can be dropped completely: >>> >>> *S3.2 *If L is not 1, append a *level >>> separator* >>> >>> *Note:*The level separator is zero (0000), which is guaranteed to be >>> lower than any weight in the resulting sort key. This guarantees that when >>> two strings of unequal length are compared, where the shorter string is a >>> prefix of the longer string, the longer string is always sorted after the >>> shorter?in the absence of special features like contractions. For example: >>> "abc" < "abcX" where "X" can be any character(s). >>> >>> Remove any reference to the "level separator" from the UCA. You never >>> need it. >>> >>> As well this paragraph >>> >>> 7.3 Form Sort Keys >>> >>> *Step 3.* Construct a sort key for each collation element array by >>> successively appending all non-zero weights from the collation element >>> array. Figure 2 gives an example of the application of this step to one >>> collation element array. >>> >>> Figure 2. Collation Element Array to Sort Key >>> >>> Collation Element ArraySort Key >>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], >>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 >>> 0002 0002 0002 >>> >>> can be written with this figure: >>> >>> Figure 2. Collation Element Array to Sort Key >>> >>> Collation Element ArraySort Key >>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >>> >>> The parentheses mark the collation weights 0020 and 0002 that can be >>> safely removed if they are respectively the minimum secondary weight and >>> minimum tertiary weight. >>> But note that 0020 is kept in two places as they are followed by a >>> higher weight 0021. This is general for any tailored collation (not just >>> the DUCET). >>> >>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a >>> ?crit : >>> >>>> The 0000 is there in the UCA only because the DUCET is published in a >>>> format that uses it, but here also this format is useless: you never need >>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>>> just needs to indicate what is the minimum weight assigned for every level >>>> (except the highest level where it is "implicitly" 0001, and not 0000). >>>> >>>> >>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a >>>> ?crit : >>>> >>>>> There are lots of ways to implement the UCA. >>>>> >>>>> When you want fast string comparison, the zero weights are useful for >>>>> processing -- and you don't actually assemble a sort key. >>>>> >>>>> People who want sort keys usually want them to be short, so you spend >>>>> time on compression. You probably also build sort keys as byte vectors not >>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be >>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root >>>>> collation data file remunges all weights into fractional byte sequences, >>>>> and leaves gaps for tailoring. >>>>> >>>>> markus >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 08:44:25 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 2 Nov 2018 13:44:25 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: I write my 7?s and Z?s with a horizontal line through them. ? is encoded not for this purpose, but because Z and ? are distinct in orthographies for varieties of Tatar, Chechen, Karelian, and Mongolian. This is a contemporary writing convention but it does not argue for a new SEVEN WITH STROKE character or that I should use ? rather than Z when I write *?an?ibar. Michael Everson > On 2 Nov 2018, at 09:48, James Kass via Unicode wrote: > > A third possibility is that the double-underlined superscript was a writing/spelling convention of the time for writing/spelling abbreviations. From unicode at unicode.org Fri Nov 2 08:47:24 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 2 Nov 2018 14:47:24 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181101215606.30dd6ced@JRWUBU2> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> Message-ID: <94d21f1f-7adc-c433-38ce-465383daca01@orange.fr> On 01/11/2018 22:56, Richard Wordingham via Unicode wrote: > On Thu, 01 Nov 2018 18:23:05 +0100 > "Janusz S. Bie? via Unicode" wrote: > >> On Thu, Nov 01 2018 at 8:43 -0700, Asmus Freytag via Unicode wrote: > >>> I don't think it's a joke to recognize that there is a continuum As a sidenote: I remember something called the "continuum bias" but turn out unable to retrieve a relevant page on the internet. >>> here and that there is no line that can be drawn which is based on >>> straightforward principles. This is a pattern that keeps surfacing >>> the deeper you look at character coding questions. >> >> Looks like you completely missed my point. Nobody ever claimed that >> reproducing all variations in manuscripts is in scope of Unicode, so >> whom do you want to convince that it is not? > > I think the counter-claim is that one will never be able to encode all > the meaning-conveying distinctions of text in Unicode. Much is already done using variation selectors, so I can easily figure out that UTC will allow one of the 200+ already encoded variation selectors to be defined as directing the rendering engine to add a double line below a superscript abbreviation indicator, and another one to add a single line, according to mainstream ordinal indicators having one or zero underlines depending on the typeface, and NUMERO SIGN showing currently two lines like the "Magister" abbreviation on the Polish postcard. Another option would be using the variation selector scheme to make any letter an abbreviation indicator needing appropriate display in superscript plus zero through two underlines. Personally I wouldn?t favor this scheme for Latin abbreviations, given using preformatted superscripts is most straightforward. Best regards, Marcel From unicode at unicode.org Fri Nov 2 08:54:19 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Nov 2018 14:54:19 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: It's not just a question of "I like it or not". But the fact that the standard makes the presence of 0000 required in some steps, and the requirement is in fact wrong: this is in fact NEVER required to create an equivalent collation order. these steps are completely unnecessary and should be removed. Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit : > You may not like the format of the data, but you are not bound to it. If > you don't like the data format (eg you want [.0021.0002] instead of > [.0000.0021.0002]), you can transform it however you want as long as you > get the same answer, as it says here: > > http://unicode.org/reports/tr10/#Conformance > ?The Unicode Collation Algorithm is a logical specification. > Implementations are free to change any part of the algorithm as long as any > two strings compared by the implementation are ordered the same as they > would be by the algorithm as specified. Implementations may also use a > different format for the data in the Default Unicode Collation Element > Table. The sort key is a logical intermediate object: if an implementation > produces the same results in comparison of strings, the sort keys can > differ in format from what is specified in this document. (See Section 9, > Implementation Notes.)? > > > That is what is done, for example, in ICU's implementation. See > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw > collation elements" and "sort keys" to see the transformed collation > elements (from the DUCET + CLDR) and the resulting sort keys. > > a =>[29,05,_05] => 29 , 05 , 05 . > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . > ? => > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . > ? => > > Mark > > > On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> As well the step 2 of the algorithm speaks about a single "array" of >> collation elements. Actually it's best to create one separate array per >> level, and append weights for each level in the relevant array for that >> level. >> The steps S2.2 to S2.4 can do this, including for derived collation >> elements in section 10.1, or variable weighting in section 4. >> >> This also means that for fast string compares, the primary weights can be >> processed on the fly (without needing any buffering) is the primary weights >> are different between the two strings (including when one or both of the >> two strings ends, and the secondary weights or tertiary weights detected >> until then have not found any weight higher than the minimum weight value >> for each level). >> Otherwise: >> - the first secondary weight higher that the minimum secondary weght >> value, and all subsequent secondary weights must be buffered in a >> secondary buffer . >> - the first tertiary weight higher that the minimum secondary weght >> value, and all subsequent secondary weights must be buffered in a tertiary >> buffer. >> - and so on for higher levels (each buffer just needs to keep a counter, >> when it's first used, indicating how many weights were not buffered while >> processing and counting the primary weights, because all these weights were >> all equal to the minimum value for the relevant level) >> - these secondary/tertiary/etc. buffers will only be used once you reach >> the end of the two strings when processing the primary level and no >> difference was found: you'll start by comparing the initial counters in >> these buffers and the buffer that has the largest counter value is >> necessarily for the smaller compared string. If both counters are equal, >> then you start comparing the weights stored in each buffer, until one of >> the buffers ends before another (the shorter buffer is for the smaller >> compared string). If both weight buffers reach the end, you use the next >> pair of buffers built for the next level and process them with the same >> algorithm. >> >> Nowhere you'll ever need to consider any [.0000] weight which is just a >> notation in the format of the DUCET intended only to be readable by humans >> but never needed in any machine implementation. >> >> Now if you want to create sort keys this is similar except that you don"t >> have two strings to process and compare, all you want is to create separate >> arrays of weights for each level: each level can be encoded separately, the >> encoding must be made so that when you'll concatenate the encoded arrays, >> the first few encoded *bits* in the secondary or tertiary encodings cannot >> be larger or equal to the bits used by the encoding of the primary weights >> (this only limits how you'll encode the 1st weight in each array as its >> first encoding *bits* must be lower than the first bits used to encode any >> weight in previous levels). >> >> Nowhere you are required to encode weights exactly like their logical >> weight, this encoding is fully reversible and can use any suitable >> compression technics if needed. As long as you can safely detect when an >> encoding ends, because it encounters some bits (with lower values) used to >> start the encoding of one of the higher levels, the compression is safe. >> >> For each level, you can reserve only a single code used to "mark" the >> start of another higher level followed by some bits to indicate which level >> it is, then followed by the compressed code for the level made so that each >> weight is encoded by a code not starting by the reserved mark. That >> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' >> (if the encoding must be readable as ASCII or UTF-8-based, and must not use >> any control or SPACE or isolated surrogate) and codes used to encode each >> weight must not start by a byte lower or equal to this mark. The binary or >> ASCII code units used to encode each weight must just be comparable, so >> that comparing codes is equivalent to compare weights represented by each >> code. >> >> As well, you are not required to store multiple "marks". This is just one >> of the possibilities to encode in the sort key which level is encoded after >> each "mark", and the marks are not necessarily the same before each level >> (their length may also vary depending on the level they are starting): >> these marks may be completely removed from the final encoding if the >> encoding/compression used allows discriminating the level used by all >> weights, encoded in separate sets of values. >> >> Typical compression technics are for example differencial, notably in >> secondary or higher levels, and run-legth encoded to skip sequences of >> weights all equal to the minimum weight. >> >> The code units used by the weigh encoding for each level may also need to >> avoid some forbidden values if needed (e.g. when encoding the weights to >> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units >> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this >> would create a string not conforming to any standard UTF). >> >> Once again this means that the sequence of logical weight will can sefely >> become a readable string, even suitable to be transmitted as plain-text >> using any UTF, and that compression is also possible in that case: you can >> create and store lot of sort keys even for very long texts >> >> However it is generally better to just encode sort keys only for a >> reasonnably discriminant part of the text, e.g. no sort key longer than 255 >> bytes (created from the start of the original texts): if you compare two >> sort keys and find that they are equal, and if both sort keys have this >> length of 255 bytes, then you'll compare the full original texts using the >> fast-compare algorithm: you don't need to store full sort keys in addition >> to the original texts. This can save lot of storage, provided that original >> texts are sufficiently discriminated by their start, and that cases where >> the sort keys were truncated to the limit of 255 bytes are exceptionnal. >> >> For short texts however, truncated sortkeys may save time at the price of >> a reasonnable storage cost (but sortkeys can be also encoded with roughly >> the same size as the original text: compression is modest for the encoded >> primary level. But compression is frequently very effective for higher >> levels where their smaller weight also have less possible variations of >> value, in a smaller set. >> >> Notably for the secondary level used to encode case differences, only 3 >> bits are enough per weight, and you just need to reserve the 3-bit value >> "000" as the "mark" for indicating the start of another higher level, while >> encoding secondary weights as "001" to "111". >> >> (This means that primary levels have to be encoded so that none of their >> encoded primary weights are starting with "000" marking the start of the >> secondary level. So primary weights can be encoded in patterns starting by >> "0001", "001", "01", or "1" and followed by other bits: this allows >> encoding them as readable UTF-8 if these characters are all different at >> primary level, excluding only the 16 first C0 controls which need to be >> preprocessed into escape sequences using the first permitted C0 control as >> an escape, and escaping that C0 control itself). >> >> The third level, started by the mark "00" and followed by the encoded >> weights indicating this is a tertiary level and not an higher level, will >> also be used to encode a small set of weights (in most locales, this is not >> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using >> differential coding on 3-bits, you reserve "000" as the "mark" for the next >> higher level, then use "001" to "111" to encode differencial weights, the >> differencial weights being initially based on the minimum tertiary weight, >> you'll use the bit pattern "001" to encode the most frequent minimum >> tertiary weight, and patterns "01" to "11" plus additional bits to encode >> other positive or negative differences of tertiary weights, or to use >> run-length compression). Here also it is possible to map the patterns so >> that the encoded secondary weight will be readable valid UTF-8. >> >> The fourth level, started by the mark "000" can use the pattern "001" to >> encode the most frequent minimum quaternary weight, and patterns "010" to >> "011" followed by other bits to differentially encode the quaternary >> weights. Here again it is possible to create an encoding for quaternary >> weights that can use some run-length compression and can also be readable >> valid UTF-8! >> >> And so on. >> >> >> >> >> >> >> >> >> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a >> ?crit : >> >>> So it should be clear in the UCA algorithm and in the DUCET datatable >>> that "0000" is NOT a valid weight >>> It is just a notational placeholder used as ".0000", only indicating in >>> the DUCET format that there's NO weight assigned at the indicated level, >>> because the collation element is ALWAYS ignorable at this level. >>> The DUCET could have as well used the notation ".none", or just dropped >>> every ".0000" in its file (provided it contains a data entry specifying >>> what is the minimum weight used for each level). This notation is only >>> intended to be read by humans editing the file, so they don't need to >>> wonder what is the level of the first indicated weight or remember what is >>> the minimum weight for that level. >>> But the DUCET table is actually generated by a machine and processed by >>> machines. >>> >>> >>> >>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a >>> ?crit : >>> >>>> In summary, this step given in the algorithm is completely unneeded and >>>> can be dropped completely: >>>> >>>> *S3.2 *If L is not 1, append a *level >>>> separator* >>>> >>>> *Note:*The level separator is zero (0000), which is guaranteed to be >>>> lower than any weight in the resulting sort key. This guarantees that when >>>> two strings of unequal length are compared, where the shorter string is a >>>> prefix of the longer string, the longer string is always sorted after the >>>> shorter?in the absence of special features like contractions. For example: >>>> "abc" < "abcX" where "X" can be any character(s). >>>> >>>> Remove any reference to the "level separator" from the UCA. You never >>>> need it. >>>> >>>> As well this paragraph >>>> >>>> 7.3 Form Sort Keys >>>> >>>> *Step 3.* Construct a sort key for each collation element array by >>>> successively appending all non-zero weights from the collation element >>>> array. Figure 2 gives an example of the application of this step to one >>>> collation element array. >>>> >>>> Figure 2. Collation Element Array to Sort Key >>>> >>>> Collation Element ArraySort Key >>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], >>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 >>>> 0002 0002 0002 >>>> >>>> can be written with this figure: >>>> >>>> Figure 2. Collation Element Array to Sort Key >>>> >>>> Collation Element ArraySort Key >>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >>>> >>>> The parentheses mark the collation weights 0020 and 0002 that can be >>>> safely removed if they are respectively the minimum secondary weight and >>>> minimum tertiary weight. >>>> But note that 0020 is kept in two places as they are followed by a >>>> higher weight 0021. This is general for any tailored collation (not just >>>> the DUCET). >>>> >>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a >>>> ?crit : >>>> >>>>> The 0000 is there in the UCA only because the DUCET is published in a >>>>> format that uses it, but here also this format is useless: you never need >>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>>>> just needs to indicate what is the minimum weight assigned for every level >>>>> (except the highest level where it is "implicitly" 0001, and not 0000). >>>>> >>>>> >>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a >>>>> ?crit : >>>>> >>>>>> There are lots of ways to implement the UCA. >>>>>> >>>>>> When you want fast string comparison, the zero weights are useful for >>>>>> processing -- and you don't actually assemble a sort key. >>>>>> >>>>>> People who want sort keys usually want them to be short, so you spend >>>>>> time on compression. You probably also build sort keys as byte vectors not >>>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be >>>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root >>>>>> collation data file remunges all weights into fractional byte sequences, >>>>>> and leaves gaps for tailoring. >>>>>> >>>>>> markus >>>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 09:23:39 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 2 Nov 2018 15:23:39 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: The table is the way it is because it is easier to process (and comprehend) when the first field is always the primary weight, second is always the secondary, etc. Go ahead and transform the input DUCET files as you see fit. The "should be removed" is your personal preference. Unless we hear strong demand otherwise from major implementers, people have better things to do than change their parsers to suit your preference. Mark On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote: > It's not just a question of "I like it or not". But the fact that the > standard makes the presence of 0000 required in some steps, and the > requirement is in fact wrong: this is in fact NEVER required to create an > equivalent collation order. these steps are completely unnecessary and > should be removed. > > Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit : > >> You may not like the format of the data, but you are not bound to it. If >> you don't like the data format (eg you want [.0021.0002] instead of >> [.0000.0021.0002]), you can transform it however you want as long as you >> get the same answer, as it says here: >> >> http://unicode.org/reports/tr10/#Conformance >> ?The Unicode Collation Algorithm is a logical specification. >> Implementations are free to change any part of the algorithm as long as any >> two strings compared by the implementation are ordered the same as they >> would be by the algorithm as specified. Implementations may also use a >> different format for the data in the Default Unicode Collation Element >> Table. The sort key is a logical intermediate object: if an implementation >> produces the same results in comparison of strings, the sort keys can >> differ in format from what is specified in this document. (See Section 9, >> Implementation Notes.)? >> >> >> That is what is done, for example, in ICU's implementation. See >> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw >> collation elements" and "sort keys" to see the transformed collation >> elements (from the DUCET + CLDR) and the resulting sort keys. >> >> a =>[29,05,_05] => 29 , 05 , 05 . >> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . >> ? => >> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . >> ? => >> >> Mark >> >> >> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < >> unicode at unicode.org> wrote: >> >>> As well the step 2 of the algorithm speaks about a single "array" of >>> collation elements. Actually it's best to create one separate array per >>> level, and append weights for each level in the relevant array for that >>> level. >>> The steps S2.2 to S2.4 can do this, including for derived collation >>> elements in section 10.1, or variable weighting in section 4. >>> >>> This also means that for fast string compares, the primary weights can >>> be processed on the fly (without needing any buffering) is the primary >>> weights are different between the two strings (including when one or both >>> of the two strings ends, and the secondary weights or tertiary weights >>> detected until then have not found any weight higher than the minimum >>> weight value for each level). >>> Otherwise: >>> - the first secondary weight higher that the minimum secondary weght >>> value, and all subsequent secondary weights must be buffered in a >>> secondary buffer . >>> - the first tertiary weight higher that the minimum secondary weght >>> value, and all subsequent secondary weights must be buffered in a tertiary >>> buffer. >>> - and so on for higher levels (each buffer just needs to keep a counter, >>> when it's first used, indicating how many weights were not buffered while >>> processing and counting the primary weights, because all these weights were >>> all equal to the minimum value for the relevant level) >>> - these secondary/tertiary/etc. buffers will only be used once you reach >>> the end of the two strings when processing the primary level and no >>> difference was found: you'll start by comparing the initial counters in >>> these buffers and the buffer that has the largest counter value is >>> necessarily for the smaller compared string. If both counters are equal, >>> then you start comparing the weights stored in each buffer, until one of >>> the buffers ends before another (the shorter buffer is for the smaller >>> compared string). If both weight buffers reach the end, you use the next >>> pair of buffers built for the next level and process them with the same >>> algorithm. >>> >>> Nowhere you'll ever need to consider any [.0000] weight which is just a >>> notation in the format of the DUCET intended only to be readable by humans >>> but never needed in any machine implementation. >>> >>> Now if you want to create sort keys this is similar except that you >>> don"t have two strings to process and compare, all you want is to create >>> separate arrays of weights for each level: each level can be encoded >>> separately, the encoding must be made so that when you'll concatenate the >>> encoded arrays, the first few encoded *bits* in the secondary or tertiary >>> encodings cannot be larger or equal to the bits used by the encoding of the >>> primary weights (this only limits how you'll encode the 1st weight in each >>> array as its first encoding *bits* must be lower than the first bits used >>> to encode any weight in previous levels). >>> >>> Nowhere you are required to encode weights exactly like their logical >>> weight, this encoding is fully reversible and can use any suitable >>> compression technics if needed. As long as you can safely detect when an >>> encoding ends, because it encounters some bits (with lower values) used to >>> start the encoding of one of the higher levels, the compression is safe. >>> >>> For each level, you can reserve only a single code used to "mark" the >>> start of another higher level followed by some bits to indicate which level >>> it is, then followed by the compressed code for the level made so that each >>> weight is encoded by a code not starting by the reserved mark. That >>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' >>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use >>> any control or SPACE or isolated surrogate) and codes used to encode each >>> weight must not start by a byte lower or equal to this mark. The binary or >>> ASCII code units used to encode each weight must just be comparable, so >>> that comparing codes is equivalent to compare weights represented by each >>> code. >>> >>> As well, you are not required to store multiple "marks". This is just >>> one of the possibilities to encode in the sort key which level is encoded >>> after each "mark", and the marks are not necessarily the same before each >>> level (their length may also vary depending on the level they are >>> starting): these marks may be completely removed from the final encoding if >>> the encoding/compression used allows discriminating the level used by all >>> weights, encoded in separate sets of values. >>> >>> Typical compression technics are for example differencial, notably in >>> secondary or higher levels, and run-legth encoded to skip sequences of >>> weights all equal to the minimum weight. >>> >>> The code units used by the weigh encoding for each level may also need >>> to avoid some forbidden values if needed (e.g. when encoding the weights to >>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units >>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this >>> would create a string not conforming to any standard UTF). >>> >>> Once again this means that the sequence of logical weight will can >>> sefely become a readable string, even suitable to be transmitted as >>> plain-text using any UTF, and that compression is also possible in that >>> case: you can create and store lot of sort keys even for very long texts >>> >>> However it is generally better to just encode sort keys only for a >>> reasonnably discriminant part of the text, e.g. no sort key longer than 255 >>> bytes (created from the start of the original texts): if you compare two >>> sort keys and find that they are equal, and if both sort keys have this >>> length of 255 bytes, then you'll compare the full original texts using the >>> fast-compare algorithm: you don't need to store full sort keys in addition >>> to the original texts. This can save lot of storage, provided that original >>> texts are sufficiently discriminated by their start, and that cases where >>> the sort keys were truncated to the limit of 255 bytes are exceptionnal. >>> >>> For short texts however, truncated sortkeys may save time at the price >>> of a reasonnable storage cost (but sortkeys can be also encoded with >>> roughly the same size as the original text: compression is modest for the >>> encoded primary level. But compression is frequently very effective for >>> higher levels where their smaller weight also have less possible variations >>> of value, in a smaller set. >>> >>> Notably for the secondary level used to encode case differences, only 3 >>> bits are enough per weight, and you just need to reserve the 3-bit value >>> "000" as the "mark" for indicating the start of another higher level, while >>> encoding secondary weights as "001" to "111". >>> >>> (This means that primary levels have to be encoded so that none of their >>> encoded primary weights are starting with "000" marking the start of the >>> secondary level. So primary weights can be encoded in patterns starting by >>> "0001", "001", "01", or "1" and followed by other bits: this allows >>> encoding them as readable UTF-8 if these characters are all different at >>> primary level, excluding only the 16 first C0 controls which need to be >>> preprocessed into escape sequences using the first permitted C0 control as >>> an escape, and escaping that C0 control itself). >>> >>> The third level, started by the mark "00" and followed by the encoded >>> weights indicating this is a tertiary level and not an higher level, will >>> also be used to encode a small set of weights (in most locales, this is not >>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using >>> differential coding on 3-bits, you reserve "000" as the "mark" for the next >>> higher level, then use "001" to "111" to encode differencial weights, the >>> differencial weights being initially based on the minimum tertiary weight, >>> you'll use the bit pattern "001" to encode the most frequent minimum >>> tertiary weight, and patterns "01" to "11" plus additional bits to encode >>> other positive or negative differences of tertiary weights, or to use >>> run-length compression). Here also it is possible to map the patterns so >>> that the encoded secondary weight will be readable valid UTF-8. >>> >>> The fourth level, started by the mark "000" can use the pattern "001" to >>> encode the most frequent minimum quaternary weight, and patterns "010" to >>> "011" followed by other bits to differentially encode the quaternary >>> weights. Here again it is possible to create an encoding for quaternary >>> weights that can use some run-length compression and can also be readable >>> valid UTF-8! >>> >>> And so on. >>> >>> >>> >>> >>> >>> >>> >>> >>> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a >>> ?crit : >>> >>>> So it should be clear in the UCA algorithm and in the DUCET datatable >>>> that "0000" is NOT a valid weight >>>> It is just a notational placeholder used as ".0000", only indicating in >>>> the DUCET format that there's NO weight assigned at the indicated level, >>>> because the collation element is ALWAYS ignorable at this level. >>>> The DUCET could have as well used the notation ".none", or just dropped >>>> every ".0000" in its file (provided it contains a data entry specifying >>>> what is the minimum weight used for each level). This notation is only >>>> intended to be read by humans editing the file, so they don't need to >>>> wonder what is the level of the first indicated weight or remember what is >>>> the minimum weight for that level. >>>> But the DUCET table is actually generated by a machine and processed by >>>> machines. >>>> >>>> >>>> >>>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a >>>> ?crit : >>>> >>>>> In summary, this step given in the algorithm is completely unneeded >>>>> and can be dropped completely: >>>>> >>>>> *S3.2 *If L is not 1, append >>>>> a *level separator* >>>>> >>>>> *Note:*The level separator is zero (0000), which is guaranteed to be >>>>> lower than any weight in the resulting sort key. This guarantees that when >>>>> two strings of unequal length are compared, where the shorter string is a >>>>> prefix of the longer string, the longer string is always sorted after the >>>>> shorter?in the absence of special features like contractions. For example: >>>>> "abc" < "abcX" where "X" can be any character(s). >>>>> >>>>> Remove any reference to the "level separator" from the UCA. You never >>>>> need it. >>>>> >>>>> As well this paragraph >>>>> >>>>> 7.3 Form Sort Keys >>>>> >>>>> *Step 3.* Construct a sort key for each collation element array by >>>>> successively appending all non-zero weights from the collation element >>>>> array. Figure 2 gives an example of the application of this step to one >>>>> collation element array. >>>>> >>>>> Figure 2. Collation Element Array to Sort Key >>>>> >>>>> Collation Element ArraySort Key >>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], >>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 >>>>> 0002 0002 0002 >>>>> >>>>> can be written with this figure: >>>>> >>>>> Figure 2. Collation Element Array to Sort Key >>>>> >>>>> Collation Element ArraySort Key >>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >>>>> >>>>> The parentheses mark the collation weights 0020 and 0002 that can be >>>>> safely removed if they are respectively the minimum secondary weight and >>>>> minimum tertiary weight. >>>>> But note that 0020 is kept in two places as they are followed by a >>>>> higher weight 0021. This is general for any tailored collation (not just >>>>> the DUCET). >>>>> >>>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a >>>>> ?crit : >>>>> >>>>>> The 0000 is there in the UCA only because the DUCET is published in a >>>>>> format that uses it, but here also this format is useless: you never need >>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>>>>> just needs to indicate what is the minimum weight assigned for every level >>>>>> (except the highest level where it is "implicitly" 0001, and not 0000). >>>>>> >>>>>> >>>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer a >>>>>> ?crit : >>>>>> >>>>>>> There are lots of ways to implement the UCA. >>>>>>> >>>>>>> When you want fast string comparison, the zero weights are useful >>>>>>> for processing -- and you don't actually assemble a sort key. >>>>>>> >>>>>>> People who want sort keys usually want them to be short, so you >>>>>>> spend time on compression. You probably also build sort keys as byte >>>>>>> vectors not uint16 vectors (because byte vectors fit into more APIs and >>>>>>> tend to be shorter), like ICU does using the CLDR collation data file. The >>>>>>> CLDR root collation data file remunges all weights into fractional byte >>>>>>> sequences, and leaves gaps for tailoring. >>>>>>> >>>>>>> markus >>>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 09:39:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Nov 2018 14:39:49 +0000 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: <20181102143949.4165d666@JRWUBU2> On Fri, 2 Nov 2018 14:54:19 +0100 Philippe Verdy via Unicode wrote: > It's not just a question of "I like it or not". But the fact that the > standard makes the presence of 0000 required in some steps, and the > requirement is in fact wrong: this is in fact NEVER required to > create an equivalent collation order. these steps are completely > unnecessary and should be removed. > > Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a > ?crit : > > > You may not like the format of the data, but you are not bound to > > it. If you don't like the data format (eg you want [.0021.0002] > > instead of [.0000.0021.0002]), you can transform it however you > > want as long as you get the same answer, as it says here: > > > > http://unicode.org/reports/tr10/#Conformance > > ?The Unicode Collation Algorithm is a logical specification. > > Implementations are free to change any part of the algorithm as > > long as any two strings compared by the implementation are ordered > > the same as they would be by the algorithm as specified. > > Implementations may also use a different format for the data in the > > Default Unicode Collation Element Table. The sort key is a logical > > intermediate object: if an implementation produces the same results > > in comparison of strings, the sort keys can differ in format from > > what is specified in this document. (See Section 9, Implementation > > Notes.)? Given the above paragraph, how does the standard force you to use a special 0000? Perhaps the wording of the standard can be changed to prevent your unhappy interpretation. > > That is what is done, for example, in ICU's implementation. See > > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw > > collation elements" and "sort keys" to see the transformed collation > > elements (from the DUCET + CLDR) and the resulting sort keys. > > > > a =>[29,05,_05] => 29 , 05 , 05 . > > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . > > ? => > > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . > > ? => As you can see, Mark does not come to the same conclusion as you, and nor do I. Richard. From unicode at unicode.org Fri Nov 2 10:04:13 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Fri, 2 Nov 2018 16:04:13 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> Message-ID: <20181102150413.r2mdgoulkoe46trq@angband.pl> On Fri, Nov 02, 2018 at 01:44:25PM +0000, Michael Everson via Unicode wrote: > I write my 7?s and Z?s with a horizontal line through them. ? is encoded > not for this purpose, but because Z and ? are distinct in orthographies > for varieties of Tatar, Chechen, Karelian, and Mongolian. This is a > contemporary writing convention but it does not argue for a new SEVEN WITH > STROKE character or that I should use ? rather than Z when I write > *?an?ibar. And that use conflicts with ? ? being an allograph of Polish ? ?, used especially when marks above cap height are unwanted or when readability is important (?? is too similar to ??). It also happened to be nicely renderable with Z^H- z^H- vs Z^H' z^H' on printers which had backspace. I unsuccessfully argued for such a variant on a "historical terminals" font: https://github.com/rbanffy/3270font/issues/19 But in either case the difference is purely visual rather than semantic. The latter still applies to _some_ uses of superscript, but not to the mgr. Meow! -- ??????? Have you heard of the Amber Road? For thousands of years, the ??????? Romans and co valued amber, hauled through the Europe over the ??????? mountains and along the Vistula, from Gda?sk. To where it came ??????? together with silk (judging by today's amber stalls). From unicode at unicode.org Fri Nov 2 10:10:21 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 2 Nov 2018 16:10:21 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> Message-ID: <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> On 01/11/2018 16:43, Asmus Freytag via Unicode wrote: [quoted mail] > I don't think it's a joke to recognize that there is a continuum here and that > there is no line that can be drawn which is based on straightforward principles. [?] > In this case, there is no such framework that could help establish pragmatic > boundaries dividing the truly useful from the merely fanciful. I think the red line was always between the positive and the negative answer to the question whether a given graphic is relevant for legibility/readability of the plain text backbone. But humans can be trained to mentally disambiguate a mass of confusables, so the line vanishes and the continuum remains intact. On 02/11/2018 06:22, Asmus Freytag via Unicode wrote: > On 11/1/2018 7:59 PM, James Kass via Unicode wrote: >> >> Alphabetic script users write things the way they are spelled and spell things >> the way they are written. The abbreviation in question as written consists of >> three recognizable symbols. An "M", a superscript "r", and an equal sign >> (= two lines). It can be printed, handwritten, or in fraktur; it will still >> consist of those same three recognizable symbols. >> >> We're supposed to be preserving the past, not editing it or revising it. >> > Alphabetic script users' handwriting does not match print in all features. > Traditional German handwriting used a line like a macron over the letter 'u' > to distinguish it from 'n'. Rendering this with a u-macron in print would be > the height of absurdity. > > I feel similarly about the assertion that the "two lines" are something that > needs to be encoded, but only an expert would know for sure. Indeed it would be relevant to know whether it is mandatory in Polish, and I?m not an expert. But looking at several scripts using abbreviation indicators as superscript, i.e. Latin and Cyrillic (when using the Latin-script-written abbreviation of "Numero", given Cyrillic for "N" is "?", so it?s strictly speaking one single script, and two scripts using it), then we can easily see how single and double underlines are added or not depending on font design and on customary writing and display. E.g. the Romance feminine and masculine ordinal indicators have one or zero underlines, to such extent that French typography specifies that the masculine ordinal indicator, despite beinga superscript small o, is unfit to compose the French "num?ro" abbreviation, that must not have an underline. Hence DEGREE SIGN is less bad than U+00BA. If applying the same to Polish, "Magister" is "M?" and is straigtforward to input when using a new French keyboard layout or an enhanced variant of any national Latin one having small supersripts on the Shift+Num level, or via a ?superscript? dead key, mapped e.g. on Shift + AltGr/Option + E or any of the 26 letter keys as mnemonically convenient ("superscript" translates to French "exposant"); or ?Compose? ?^? [e] (where the ASCII circumflex or caret is repurposed for superscript compose sequences, while ?circumflex accent? is active *after* LESS-THAN SIGN, consistently with the *new* convention for ?inverted breve? using LEFT PARENTHESIS rather than "g)". These details are posted in this thread on this List rather than CLDR-USERS in order to make clear that typing superscript letters directly via the keyboard is easy, and therefore to propose it is not to harrass the end-user. On 02/11/2018 13:09, Asmus Freytag via Unicode wrote: [quoted mail] [?] > To transcribe the postcard would mean selecting the characters appropriate > for the printed equivalent of the text. As already suggested, selecting the variants can be done using variation selectors, provided the Standard has defined the intended use case. > > If the printed form had a standard way of superscripting letters with a > decoration below when used for abbreviations, As already pointed out, Latin script does not benefit from a consensus to use underline for superscript. E.g. Italian, Portuguese and Spanish do use underline for superscript, English and French do not. > then, and only then would we start discussing whether this decoration > needs to be encoded, or whether it is something a font can supply as part > of rendering the (sequence of) superscripted letters. I think the problem is not completely outlined, as long as the use of variation sequences is not mentioned. There is no "all" or "nothing" dilemma, given Unicode has the means of providing a standard way of representing calligraphic variations using variation selectors. E.g. the letter ENG is preferred in big lowercase form when writing Bambara, while other locales may like it in hooked uppercase. The Bambara Arial font allows to make sure it is the right glyph, and Arial in general follows the Bambara preference, but other fonts do not, while some of them have the Bambara-fit glyph inside but don?t display it unless urged by an OpenType supporting renderer, and appropriate settings turned on, e.g. on a locale identifier basis. > (Perhaps with the aid of markup identifying the sequence as abbreviation). That seems to me a regression, after the front has moved in favor of recognizing Latin script needs preformatted superscript. The use case is clear, as we have ?, ?, and n? with degree sign, and so on as already detailed in long e-mails in this thread and elsewhere. There is no point in setting up or maintaining a Unicode policy stating otherwise, as such a policy would be inconsistent with longlasting and extremely widespread practice. The main thing to fix is the font stack of user agents, that is finally everyone?s computer. Alternatively web sites may wish to use web fonts. In order to have superscripts displayed in a professional and civilized way, with no ransome note effect. In aUnicode conformant way, to say it shortly. > > All else is just applying visual hacks to simulate a specific appearance, > at the possible cost of obscuring the contents. As already pointed out, the hack here is to use a higher level protocol to simulate the effect of abbreviation indicator superscript. Using the latter is not ?obscuring?, but _clarifying_ ?the contents.? But I agree that adding combining diacritics to get the related underlines may obscure the content if unsupported (displaying as .notdef box). The concern about machine readability of the content is addressed by setting up equivalence classes and using DUCET discussed in the parallel thread. Best regards, Marcel From unicode at unicode.org Fri Nov 2 10:38:45 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 02 Nov 2018 08:38:45 -0700 Subject: A sign/abbreviation for "magister" Message-ID: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com> Do we have any other evidence of this usage, besides a single handwritten postcard? -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Nov 2 10:42:52 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 02 Nov 2018 08:42:52 -0700 Subject: A sign/abbreviation for "magister" Message-ID: <20181102084252.665a7a7059d7ee80bb4d670165c8327d.5aa2c4d5b0.wbe@email03.godaddy.com> Michael Everson wrote: > I write my 7?s and Z?s with a horizontal line through them. ? is > encoded not for this purpose, but because Z and ? are distinct in > orthographies for varieties of Tatar, Chechen, Karelian, and > Mongolian. This is a contemporary writing convention but it does not > argue for a new SEVEN WITH STROKE character or that I should use ? > rather than Z when I write *?an?ibar. http://www.unicode.org/L2/L2018/18323-open-four.pdf -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Nov 2 11:20:00 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 02 Nov 2018 17:20:00 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> (Asmus Freytag via Unicode's message of "Fri, 2 Nov 2018 05:09:51 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> Message-ID: <86ftwjpi33.fsf@mimuw.edu.pl> On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: [...] > To transcribe the postcard would mean selecting the characters > appropriate for the printed equivalent of the text. You seem to make implicit assumptions which are not necessarily true. For me to transcribe the postcard would mean to answer the needs of the intended transcription users. > If the printed form had a standard way of superscripting letters with > a decoration below when used for abbreviations, then, and only then > would we start discussing whether this decoration needs to be encoded, > or whether it is something a font can supply as part of rendering the > (sequence of) superscripted letters. (Perhaps with the aid of markup > identifying the sequence as abbreviation). As I wrote already some time ago on the list, the alternative "encoding or using a specialized font" is wrong. These days texts are encoding for processing (in particular searching), rendering is just a kind of side-effect. On the other hand, whom do you mean by "we" and what do you mean by "encoding"? If I guess correctly what do you mean by these words then you are discussing an issue which was never raised by anybody (if I'm wrong, please quote the relevant post). Again is not clear for me whom you want to convince or inform. > All else is just applying visual hacks I don't mind hacks if they are useful and serve the intended purpose, even if they are visual :-) > to simulate a specific appearance, As I said above, the appearance is not necessarily of primary importance. > at the possible cost of obscuring the contents. It's for the users of the transcription to decide what is obscuring the text and what, to the contrary, makes the transcription more readable and useful. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Nov 2 11:37:21 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 2 Nov 2018 17:37:21 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> Message-ID: <55047cad-d1de-707a-70b7-fdf8fb17bbc3@orange.fr> On 31/10/2018 at 19:34, Asmus Freytag via Unicode wrote: > > On 10/31/2018 10:32 AM, Janusz S. Bie? via Unicode wrote: > > > > Let me remind what plain text is according to the Unicode glossary: > > > > Computer-encoded text that consists only of a sequence of code > > points from a given standard, with no other formatting or structural > > information. > > > > If you try to use this definition to decide what is and what is not a > > character, you get vicious circle. > > > > As mentioned already by others, there is no other generally accepted > > definition of plain text. Being among those who argued that the ?plain text? concept cannot?and therefore mustn?t?be used per se to disallow the use of a more or less restricted or extended set of characters in what is called ?ordinary text?, I?m ending up adding the following in case it might be of interest: > > This definition becomes tautological only when you try to invoke it in making > encoding decisions, that is, if you couple it with the statement that only > "elements of plain text" are ever encoded. I don?t think that Janusz S. Bie??s concern is about this definition being ?tautological?. AFAICS the Unicode definition of ?plain text? is quoted to back the assumption that it?s hard to use that concept to argue against the use of a given Unicode character in a given context, or to use it to kill a proposal for characters significant in natural languages. The reasoning is that the call not to use character X in plain text, while X is a legal Unicode character whose use is not discouraged for technical reasons, is like if ?ordinary people? (scarequoted derivative from ?ordinary text?) were told that X is not a Unicode character. That discourse is a ?vicious circle? in that there is no limit to it until Latin script is pulled down to plain ASCII. As already well known, diacritics are handled by the rendering system and don?t need to be displayed as such in the plain text backbone. I don?t believe that the same applies to other scripts, but these are often not considered when the encoding of Latin preformatted letters is fought, given superscripting seems to be proper to Latin, and originated from longlasting medieval practice and writing conventions. > > For that purpose, you need a number of other definitions of "plain text". > Including the definition that plain text is the "backbone" to which you apply > formatting and layout information. I personally believer that there are more > 2D notations where it's quite obvious to me that what is "placed" is a text > element. More like maps and music and less like a circuit diagram, where the > elements are less text like (I deliberately include symbols in the definition > of text, but not any random graphical line art). All two-dimensional notations here (outside the parenthetical) use higher-level protocols; maps and diagrams are often vector graphics. But Unicode strived to encode all needed plain text elements, such as symbols for maritime and wheather maps. Even arrows of many possible shapes, including 3D-looking ones, have been encoded. While freehand (rather than ?any random?) graphical art is out of scope, we have a lot of box drawing, used with appropriate fonts to draw e.g. layouts of keyboards above the relevant source code in plain text files (examples in XKB). As a sidenote: Box drawing while useful is unduly neglected on font level, even in the Code Charts where the advance width, usually half an em, is inconsistent between different sorts of elements belonging to the same block. > > Another definition of plain text is that which contains the "readable content" > of the text. As already discussed on this List, many documents in PDF have hard-to-read plain text backbones, even misleading Google Search, for the purpose of handling special glyphs (and, in some era, even special characters). > As we've discussed here, this definition has edge cases; some > content is traditionally left to styling. Many pre-Unicode traditions are found out there, that stay in use, partly for technical reasons (mainly by lack of updated keyboard layouts), partly for consistency with accustomed ways of doing. Being traditionally-left-to-styling is the more unconvincing. Even a letter that got to become LATIN SMALL LETTER O E (Unicode 1.0) was composed on typewriters using the half-backspace, and should be _left to styling_ when it was pulled out of the draft ISO/IEC 8859-1 by the fault of a Frenchman (name undisclosed for privacy). And we?ve been told on this List that the tradition using styling (a special font) to display the additional Latin letters used to write Bambara survived. > Example: some of the small words in > some Scandinavian languages are routinely italicized to disambiguate their > reading. Other languages use titlecase to achieve the same disambiguation. E.g. French titlecases the noun "Une" which means the "cover", not the undefined article, and German did the same when "Ein(e)" is a numeral, but today, other means, including italics, are more common. > Other languages use accents for this purpose - sometimes without > recognizing either the accented letter as part of the alphabet, or the accented > form as a dictionary entry. Talking about Dutch stressing acute, discussed earlier on this List. > Which nicely shows, that this level disambiguation > is intuitively viewed as less orthographic, something that applies to the cases > where italics are used for the same purpose. Another Unicode-conformant means of noting stress would be adding an emoji. :-| If stress is close to emotion, it could be represented in a similar way. Strictly speaking, that is off-topic in this thread, that is about representing abbreviations in a legible rather than merely decipherable way. In plain text. If stress is not represented, you still can read the sentence without stumbling. That is not always true when abbreviations are not superscripted. I remember an ASCII-only environment localized in French, where "no centre mess" is "num?ro du centre de messagerie", "dial number of the message platform". Being unfamiliar, I did stumble prior to completing and understanding the meaning: "n? centre mess." > > In some contexts (Western Math) the scope of readable content is different than > that of ordinary text. Therefore, this definition of "plain text" isn't universal. > In principle, you could argue that your definition of readable content should apply; > however, as a standard, Unicode will insist on limiting the encoding to text elements > required by some common, widely shared and reasonably agreed-upon definition of > plain text -- corresponding to a particular division between text elements and styling. > So far, we have ordinary text, math and phonetics, Thanks for clarification. Nevertheless, that partition of roles has something arbitrary as long as abbreviation indicators are excluded from the scope of ordinary text. That is, that policy is applied and promoted without being well designed. It implodes from the beginning on, given the feminine and masculine ordinal indicators pre-dated Unicode and are a living proof of the importance of preformatted superscripts. Instead of drawing the borderline between usages only, Unicode draw it between natural languages, stating that Italian, Portuguese and Spanish are entitled to use superscript ordinal indicators, whereas on the other hand, English and French are not. In the same vein, Italian, Portuguese and Spanish are granted the right of composing titles and some other abbreviations using preformatted superscript letters, as long as the set doesn?t exceed a and o, but other languages are not when using other or more letters, or when not being accustomed to underlining as an additional abbreviation indicator. Fortunately that is no longer true, so the point is actually to redact the relevant paragraphs in TUS, already for consistency with CLDR. Contributions are hopefully welcome. Best regards, Marcel From unicode at unicode.org Fri Nov 2 11:45:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Nov 2018 17:45:58 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> Message-ID: Le ven. 2 nov. 2018 ? 16:20, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > That seems to me a regression, after the front has moved in favor of > recognizing Latin script needs preformatted superscript. The use case is > clear, as we have ?, ?, and n? with degree sign, and so on as already > detailed in long e-mails in this thread and elsewhere. There is no point > in setting up or maintaining a Unicode policy stating otherwise, as such > a policy would be inconsistent with longlasting and extremely widespread > practice. > Using variation selectors is only appropriate for these existing (preencoded) superscript letters ? and ? so that they display the appropriate (underlined or not underlined) glyph. It is not a solution for creating superscripts on any letters and mark that it should be rendered as superscript (notably, the base letter to transform into superscript may also have its own combining diacritics, that must be encoded explicitly, and if you use the varaition selector, it should allow variation on the presence or absence of the underline (which must then be encoded explicitly as a combining character. So finally what we get with variation selectors is: and which is NOT canonically equivalent. Using a combining character avoids this caveat: and which ARE canonically equivalent. And this explicitly states the semantic (something that is lost if we are forced to use presentational superscripts in a higher level protocol like HTML/CSS for rich text format, and one just extracts the plain text; using collation will not help at all, except if collators are built with preprocessing that will first infer the presence of a to insert after each combining sequence of the plain-text enclosed in a italic style). There's little risk: if the is not mapped in fonts (or not recognized by text renderers to create synthetic superscript scripts from existing recognized clusters), it will render as a visible .notdef (tofu). But normally text renderers recognize the basic properties of characters in the UCD and can see that has a combining mark general property (it also knows that it has a 0 combinjing class, so canonical equivalences are not broken) to render a better symbols than the .notdef "tofu": it should better render a dotted circle. Even if this tofu or dotted circle is rendered, it still explicitly marks the presence of the abbreviation mark, so there's less confusion about what is preceding it (the combining sequence that was supposed to be superscripted). The can also have its own to select other styles when they are optional, such as adding underlines to the superscripted letter, or rendering the letter instead as underscript, or as a small baseline letter with a dot after it: this is still an explicit abbreviation mark, and the meaning of the plein text is still preserved: the variation selector is only suitable to alter the rendering of a cluster when it has effectively several variants and the default rendering is not universal, notably across font styles initially designed for specific markets with their own local preferences: the variation selector still allows the same fonts to map all known variants distinctly, independantly of the initial arbitrary choice of the default glyph used when the variation selector is missing). Even if fonts (or text renderers may map the to variable glyphs, this is purely stylictic, the semantic of the plain text is not lost because the is still there. There's no need of any rich-text to encode it (the rich -text styles are not explicitly encoding that a superscript is actually an abbreviation mark, so it cannot also allow variation like rendering an underscript, or a baseline small glyph with an added dot. Typically a used in an English style would render the letter (or cluster) before it as a "small" letter without any added dot. So I really think that is far better than: * using preencoded superscript letters (they don't map all the necessary repertoire of clusters where the abbreviation is needed, it now just covers Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a few other letters like stops; it's impossible to rencode the full Unicode repertoire and its allowed combining sequences or extended default grapheme clusters!), * or using variation selectors to make them appear as a superscript (does not work with all clusters containing other diacritics like accents), * or using rich-text styling (from which you cannot safely infer any semantic (there no warranty that Mr in HTML is actually an abbreviation of "Mister"; in HTML this is encoded elsewhere as Mr or Mr (the semantic of the abbreviation has to be looked a possible container element and the meaning of the abbreviation is to look inside its title attribute, so obviously this requires complex preprocessing before we can infer a plaintext version (suitable for example in plain-text searches where you don't want to match a mathematical object M, like a matrix, elevated to the power r, or a single plaintext M followed by a footnote call noted by the letter "r"). It solves all practical problems: legacy encoding using the preencoded superscript Latin letters (aka "modifier letters") should have never been used or needed (not even for IPA usage which could have used an explicit for its superscripted symbols, or for its distinctive "a" and "g"). We should not have needed to encode the variants for "a" and "g": these were old hacks that broke the Unicode character encoding model since the beginning. However only roundtrip compatibility with legacy non UCS charsets milited only for keeping the ordinal feminine or ordinal masculine mark, or the "Numero" cluster (actually made of two letters, the second one followed by an implicit abbreviation mark, but transformed in the legacy charset to be treated as a single unbreakable cluster containing only one symbol; even Unicode considers the abbreviated Numero as being only "compatibility equivalent" to the letter N followed by the masculine ordinal symbol, the latter being also only "compatibility equivalent" to a letter o with an implicit superscript, but also with an optional combining underline). All these superscripts in Unicode (as well as Mathematical "styled" letters, which were also completely unnecessary and will necessarily be incomplete for the intended usage) are now to be treated only as legacy practices, they should be deprecated in favor of the more semantic and logical character encoding model, deprecating complelely the legacy visual encoding. Only precombined characters, recognized by canonical equivalences are part of the standard and may be kept as "non"-legacy: they still fit in the logical encoding. As well the extended default grapheme clusters include the precomposed Hangul LVT and LV syllables, and CGJ used before combining marks with non-zero combining class, and variation selectors used only after base letters with the zero combining class and that start the extended default graphgeme clusters. Let's return to the root of the far better logical encoding which remains the recommended practice. All the rest is legacy (some of them came from decision taken to preserve roundtrip compatibility with legacy charsets, including prepended letters in Thai, and so we have a few compatibility characters (which are not the recommended practive), but the rest was bad decisions made by Unicode and ISO WG to break the logical character encoding model. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 11:46:42 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 2 Nov 2018 17:46:42 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <86ftwjpi33.fsf@mimuw.edu.pl> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <86ftwjpi33.fsf@mimuw.edu.pl> Message-ID: <72e3fabb-6b01-2b77-16c8-56e049ab2707@orange.fr> On 02/11/2018 17:20, Janusz S. Bie? via Unicode wrote: > On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: > > [...] > >> To transcribe the postcard would mean selecting the characters >> appropriate for the printed equivalent of the text. > > You seem to make implicit assumptions which are not necessarily > true. For me to transcribe the postcard would mean to answer the needs > of the intended transcription users. > >> If the printed form had a standard way of superscripting letters with >> a decoration below when used for abbreviations, then, and only then >> would we start discussing whether this decoration needs to be encoded, >> or whether it is something a font can supply as part of rendering the >> (sequence of) superscripted letters. (Perhaps with the aid of markup >> identifying the sequence as abbreviation). > > As I wrote already some time ago on the list, the alternative "encoding > or using a specialized font" is wrong. These days texts are encoding for > processing (in particular searching), rendering is just a kind of > side-effect. Indeed, not using MODIFIER LETTER SMALL R to encode the r in "M?" would make it harder to retrieve the "Magister" abbreviation in a database. Eg Bing Search having less extended equivalence classes when I tested it for mathematical preformatted letters, it was able to retrieve them precisely. Perhaps it still is. Best regards, Marcel From unicode at unicode.org Fri Nov 2 12:02:05 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 2 Nov 2018 18:02:05 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: I was replying not about the notational repreentation of the DUCET data table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. Which remains highly confusive, and contains completely unnecesary steps, and just complicates things with absoiluytely no benefit at all by introducing confusion about these "0000". UTR#10 still does not explicitly state that its use of "0000" does not mean it is a valid "weight", it's a notation only (but the notation is used for TWO distinct purposes: one is for presenting the notation format used in the DUCET itself to present how collation elements are structured, the other one is for marking the presence of a possible, but not always required, encoding of an explicit level separator for encoding sort keys). UTR#10 is still needlessly confusive. Even the example tables can be made without using these "0000" (for example in tables showing how to build sort keys, it can present the list of weights splitted in separate columns, one column per level, without any "0000". The implementation does not necessarily have to create a buffer containing all weight values in a row, when separate buffers for each level is far superior (and even more efficient as it can save space in memory). The step "S3.2" in the UCA algorithm should not even be there (it is made in favor an specific implementation which is not even efficient or optimal), it complicates the algorithm with absoluytely no benefit at all); you can ALWAYS remove it completely and this still generates equivalent results. Le ven. 2 nov. 2018 ? 15:23, Mark Davis ?? a ?crit : > The table is the way it is because it is easier to process (and > comprehend) when the first field is always the primary weight, second is > always the secondary, etc. > > Go ahead and transform the input DUCET files as you see fit. The "should > be removed" is your personal preference. Unless we hear strong demand > otherwise from major implementers, people have better things to do than > change their parsers to suit your preference. > > Mark > > > On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy wrote: > >> It's not just a question of "I like it or not". But the fact that the >> standard makes the presence of 0000 required in some steps, and the >> requirement is in fact wrong: this is in fact NEVER required to create an >> equivalent collation order. these steps are completely unnecessary and >> should be removed. >> >> Le ven. 2 nov. 2018 ? 14:03, Mark Davis ?? a ?crit : >> >>> You may not like the format of the data, but you are not bound to it. If >>> you don't like the data format (eg you want [.0021.0002] instead of >>> [.0000.0021.0002]), you can transform it however you want as long as you >>> get the same answer, as it says here: >>> >>> http://unicode.org/reports/tr10/#Conformance >>> ?The Unicode Collation Algorithm is a logical specification. >>> Implementations are free to change any part of the algorithm as long as any >>> two strings compared by the implementation are ordered the same as they >>> would be by the algorithm as specified. Implementations may also use a >>> different format for the data in the Default Unicode Collation Element >>> Table. The sort key is a logical intermediate object: if an implementation >>> produces the same results in comparison of strings, the sort keys can >>> differ in format from what is specified in this document. (See Section 9, >>> Implementation Notes.)? >>> >>> >>> That is what is done, for example, in ICU's implementation. See >>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw >>> collation elements" and "sort keys" to see the transformed collation >>> elements (from the DUCET + CLDR) and the resulting sort keys. >>> >>> a =>[29,05,_05] => 29 , 05 , 05 . >>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 . >>> ? => >>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 . >>> ? => >>> >>> Mark >>> >>> >>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode < >>> unicode at unicode.org> wrote: >>> >>>> As well the step 2 of the algorithm speaks about a single "array" of >>>> collation elements. Actually it's best to create one separate array per >>>> level, and append weights for each level in the relevant array for that >>>> level. >>>> The steps S2.2 to S2.4 can do this, including for derived collation >>>> elements in section 10.1, or variable weighting in section 4. >>>> >>>> This also means that for fast string compares, the primary weights can >>>> be processed on the fly (without needing any buffering) is the primary >>>> weights are different between the two strings (including when one or both >>>> of the two strings ends, and the secondary weights or tertiary weights >>>> detected until then have not found any weight higher than the minimum >>>> weight value for each level). >>>> Otherwise: >>>> - the first secondary weight higher that the minimum secondary weght >>>> value, and all subsequent secondary weights must be buffered in a >>>> secondary buffer . >>>> - the first tertiary weight higher that the minimum secondary weght >>>> value, and all subsequent secondary weights must be buffered in a tertiary >>>> buffer. >>>> - and so on for higher levels (each buffer just needs to keep a >>>> counter, when it's first used, indicating how many weights were not >>>> buffered while processing and counting the primary weights, because all >>>> these weights were all equal to the minimum value for the relevant level) >>>> - these secondary/tertiary/etc. buffers will only be used once you >>>> reach the end of the two strings when processing the primary level and no >>>> difference was found: you'll start by comparing the initial counters in >>>> these buffers and the buffer that has the largest counter value is >>>> necessarily for the smaller compared string. If both counters are equal, >>>> then you start comparing the weights stored in each buffer, until one of >>>> the buffers ends before another (the shorter buffer is for the smaller >>>> compared string). If both weight buffers reach the end, you use the next >>>> pair of buffers built for the next level and process them with the same >>>> algorithm. >>>> >>>> Nowhere you'll ever need to consider any [.0000] weight which is just a >>>> notation in the format of the DUCET intended only to be readable by humans >>>> but never needed in any machine implementation. >>>> >>>> Now if you want to create sort keys this is similar except that you >>>> don"t have two strings to process and compare, all you want is to create >>>> separate arrays of weights for each level: each level can be encoded >>>> separately, the encoding must be made so that when you'll concatenate the >>>> encoded arrays, the first few encoded *bits* in the secondary or tertiary >>>> encodings cannot be larger or equal to the bits used by the encoding of the >>>> primary weights (this only limits how you'll encode the 1st weight in each >>>> array as its first encoding *bits* must be lower than the first bits used >>>> to encode any weight in previous levels). >>>> >>>> Nowhere you are required to encode weights exactly like their logical >>>> weight, this encoding is fully reversible and can use any suitable >>>> compression technics if needed. As long as you can safely detect when an >>>> encoding ends, because it encounters some bits (with lower values) used to >>>> start the encoding of one of the higher levels, the compression is safe. >>>> >>>> For each level, you can reserve only a single code used to "mark" the >>>> start of another higher level followed by some bits to indicate which level >>>> it is, then followed by the compressed code for the level made so that each >>>> weight is encoded by a code not starting by the reserved mark. That >>>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!' >>>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use >>>> any control or SPACE or isolated surrogate) and codes used to encode each >>>> weight must not start by a byte lower or equal to this mark. The binary or >>>> ASCII code units used to encode each weight must just be comparable, so >>>> that comparing codes is equivalent to compare weights represented by each >>>> code. >>>> >>>> As well, you are not required to store multiple "marks". This is just >>>> one of the possibilities to encode in the sort key which level is encoded >>>> after each "mark", and the marks are not necessarily the same before each >>>> level (their length may also vary depending on the level they are >>>> starting): these marks may be completely removed from the final encoding if >>>> the encoding/compression used allows discriminating the level used by all >>>> weights, encoded in separate sets of values. >>>> >>>> Typical compression technics are for example differencial, notably in >>>> secondary or higher levels, and run-legth encoded to skip sequences of >>>> weights all equal to the minimum weight. >>>> >>>> The code units used by the weigh encoding for each level may also need >>>> to avoid some forbidden values if needed (e.g. when encoding the weights to >>>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units >>>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this >>>> would create a string not conforming to any standard UTF). >>>> >>>> Once again this means that the sequence of logical weight will can >>>> sefely become a readable string, even suitable to be transmitted as >>>> plain-text using any UTF, and that compression is also possible in that >>>> case: you can create and store lot of sort keys even for very long texts >>>> >>>> However it is generally better to just encode sort keys only for a >>>> reasonnably discriminant part of the text, e.g. no sort key longer than 255 >>>> bytes (created from the start of the original texts): if you compare two >>>> sort keys and find that they are equal, and if both sort keys have this >>>> length of 255 bytes, then you'll compare the full original texts using the >>>> fast-compare algorithm: you don't need to store full sort keys in addition >>>> to the original texts. This can save lot of storage, provided that original >>>> texts are sufficiently discriminated by their start, and that cases where >>>> the sort keys were truncated to the limit of 255 bytes are exceptionnal. >>>> >>>> For short texts however, truncated sortkeys may save time at the price >>>> of a reasonnable storage cost (but sortkeys can be also encoded with >>>> roughly the same size as the original text: compression is modest for the >>>> encoded primary level. But compression is frequently very effective for >>>> higher levels where their smaller weight also have less possible variations >>>> of value, in a smaller set. >>>> >>>> Notably for the secondary level used to encode case differences, only 3 >>>> bits are enough per weight, and you just need to reserve the 3-bit value >>>> "000" as the "mark" for indicating the start of another higher level, while >>>> encoding secondary weights as "001" to "111". >>>> >>>> (This means that primary levels have to be encoded so that none of >>>> their encoded primary weights are starting with "000" marking the start of >>>> the secondary level. So primary weights can be encoded in patterns starting >>>> by "0001", "001", "01", or "1" and followed by other bits: this allows >>>> encoding them as readable UTF-8 if these characters are all different at >>>> primary level, excluding only the 16 first C0 controls which need to be >>>> preprocessed into escape sequences using the first permitted C0 control as >>>> an escape, and escaping that C0 control itself). >>>> >>>> The third level, started by the mark "00" and followed by the encoded >>>> weights indicating this is a tertiary level and not an higher level, will >>>> also be used to encode a small set of weights (in most locales, this is not >>>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using >>>> differential coding on 3-bits, you reserve "000" as the "mark" for the next >>>> higher level, then use "001" to "111" to encode differencial weights, the >>>> differencial weights being initially based on the minimum tertiary weight, >>>> you'll use the bit pattern "001" to encode the most frequent minimum >>>> tertiary weight, and patterns "01" to "11" plus additional bits to encode >>>> other positive or negative differences of tertiary weights, or to use >>>> run-length compression). Here also it is possible to map the patterns so >>>> that the encoded secondary weight will be readable valid UTF-8. >>>> >>>> The fourth level, started by the mark "000" can use the pattern "001" >>>> to encode the most frequent minimum quaternary weight, and patterns "010" >>>> to "011" followed by other bits to differentially encode the quaternary >>>> weights. Here again it is possible to create an encoding for quaternary >>>> weights that can use some run-length compression and can also be readable >>>> valid UTF-8! >>>> >>>> And so on. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Le jeu. 1 nov. 2018 ? 22:04, Philippe Verdy a >>>> ?crit : >>>> >>>>> So it should be clear in the UCA algorithm and in the DUCET datatable >>>>> that "0000" is NOT a valid weight >>>>> It is just a notational placeholder used as ".0000", only indicating >>>>> in the DUCET format that there's NO weight assigned at the indicated level, >>>>> because the collation element is ALWAYS ignorable at this level. >>>>> The DUCET could have as well used the notation ".none", or just >>>>> dropped every ".0000" in its file (provided it contains a data entry >>>>> specifying what is the minimum weight used for each level). This notation >>>>> is only intended to be read by humans editing the file, so they don't need >>>>> to wonder what is the level of the first indicated weight or remember what >>>>> is the minimum weight for that level. >>>>> But the DUCET table is actually generated by a machine and processed >>>>> by machines. >>>>> >>>>> >>>>> >>>>> Le jeu. 1 nov. 2018 ? 21:57, Philippe Verdy a >>>>> ?crit : >>>>> >>>>>> In summary, this step given in the algorithm is completely unneeded >>>>>> and can be dropped completely: >>>>>> >>>>>> *S3.2 *If L is not 1, append >>>>>> a *level separator* >>>>>> >>>>>> *Note:*The level separator is zero (0000), which is guaranteed to be >>>>>> lower than any weight in the resulting sort key. This guarantees that when >>>>>> two strings of unequal length are compared, where the shorter string is a >>>>>> prefix of the longer string, the longer string is always sorted after the >>>>>> shorter?in the absence of special features like contractions. For example: >>>>>> "abc" < "abcX" where "X" can be any character(s). >>>>>> >>>>>> Remove any reference to the "level separator" from the UCA. You never >>>>>> need it. >>>>>> >>>>>> As well this paragraph >>>>>> >>>>>> 7.3 Form Sort Keys >>>>>> >>>>>> *Step 3.* Construct a sort key for each collation element array by >>>>>> successively appending all non-zero weights from the collation element >>>>>> array. Figure 2 gives an example of the application of this step to one >>>>>> collation element array. >>>>>> >>>>>> Figure 2. Collation Element Array to Sort Key >>>>>> >>>>>> Collation Element ArraySort Key >>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], >>>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 >>>>>> 0002 0002 0002 >>>>>> >>>>>> can be written with this figure: >>>>>> >>>>>> Figure 2. Collation Element Array to Sort Key >>>>>> >>>>>> Collation Element ArraySort Key >>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706 >>>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002) >>>>>> >>>>>> The parentheses mark the collation weights 0020 and 0002 that can be >>>>>> safely removed if they are respectively the minimum secondary weight and >>>>>> minimum tertiary weight. >>>>>> But note that 0020 is kept in two places as they are followed by a >>>>>> higher weight 0021. This is general for any tailored collation (not just >>>>>> the DUCET). >>>>>> >>>>>> Le jeu. 1 nov. 2018 ? 21:42, Philippe Verdy a >>>>>> ?crit : >>>>>> >>>>>>> The 0000 is there in the UCA only because the DUCET is published in >>>>>>> a format that uses it, but here also this format is useless: you never need >>>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET >>>>>>> just needs to indicate what is the minimum weight assigned for every level >>>>>>> (except the highest level where it is "implicitly" 0001, and not 0000). >>>>>>> >>>>>>> >>>>>>> Le jeu. 1 nov. 2018 ? 21:08, Markus Scherer >>>>>>> a ?crit : >>>>>>> >>>>>>>> There are lots of ways to implement the UCA. >>>>>>>> >>>>>>>> When you want fast string comparison, the zero weights are useful >>>>>>>> for processing -- and you don't actually assemble a sort key. >>>>>>>> >>>>>>>> People who want sort keys usually want them to be short, so you >>>>>>>> spend time on compression. You probably also build sort keys as byte >>>>>>>> vectors not uint16 vectors (because byte vectors fit into more APIs and >>>>>>>> tend to be shorter), like ICU does using the CLDR collation data file. The >>>>>>>> CLDR root collation data file remunges all weights into fractional byte >>>>>>>> sequences, and leaves gaps for tailoring. >>>>>>>> >>>>>>>> markus >>>>>>>> >>>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 12:34:20 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Fri, 02 Nov 2018 18:34:20 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com> (Doug Ewell via Unicode's message of "Fri, 02 Nov 2018 08:38:45 -0700") References: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com> Message-ID: <86bm77o02r.fsf@mimuw.edu.pl> I have a feeling this discussion became too chaotic: about 90 posts in October and about 30 in November, all interesting but too many of them only loosely related to my original post. I propose to close the thread. I hope some time in the future to prepare a short summary (but first I would like to check some technical issues, so it will take some time). Thank you very much to all who contributed. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Fri Nov 2 13:52:05 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 2 Nov 2018 19:52:05 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> Message-ID: <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote: [quoted mail] > > Using variation selectors is only appropriate for these existing > (preencoded) superscript letters ? and ? so that they display the > appropriate (underlined or not underlined) glyph. And it is for forcing the display of DIGIT ZERO with a short stroke: 0030 FE00; short diagonal stroke form; # DIGIT ZERO https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt From that it becomes unclear why that isn?t applied to 4, 7, z and Z mentioned in this thread, to be displayed open or with a short bar. > It is not a solution for creating superscripts on any letters and > mark that it should be rendered as superscript (notably, the base > letter to transform into superscript may also have its own combining > diacritics, that must be encoded explicitly, and if you use the > varaition selector, it should allow variation on the presence or > absence of the underline (which must then be encoded explicitly as a > combining character. I totally agree that abbreviation indicating superscript should not be encoded using variation selectors, as already stated I don?t prefer it. > > So finally what we get with variation selectors is: variation selector, combining diacritic> and precombined with the diacritic, variation selector> which is NOT > canonically equivalent. That seems to me like a flaw in canonical equivalence. Variations must be canonically equivalent, and the variation selector position should be handled or parsed accordingly. Personally I?m unaware of this rule. > > Using a combining character avoids this caveat: combining diacritic, combining abbreviation mark> and precombined with the diacritic, combining abbreviation mark> which > ARE canonically equivalent. And this explicitly states the semantic > (something that is lost if we are forced to use presentational > superscripts in a higher level protocol like HTML/CSS for rich text > format, and one just extracts the plain text; using collation will > not help at all, except if collators are built with preprocessing > that will first infer the presence of a > to insert after each combining sequence of the plain-text enclosed in > a italic style). That exactly outlines my concern with calls for relegating superscript as an abbreviation indicator to higher level protocols like HTML/CSS. > > There's little risk: if the is not > mapped in fonts (or not recognized by text renderers to create > synthetic superscript scripts from existing recognized clusters), it > will render as a visible .notdef (tofu). But normally text renderers > recognize the basic properties of characters in the UCD and can see > that has a combining mark general > property (it also knows that it has a 0 combinjing class, so > canonical equivalences are not broken) to render a better symbols > than the .notdef "tofu": it should better render a dotted circle. > Even if this tofu or dotted circle is rendered, it still explicitly > marks the presence of the abbreviation mark, so there's less > confusion about what is preceding it (the combining sequence that was > supposed to be superscripted). The problem with the you are proposing is that it contradicts streamlined implementation as well as easy input of current abbreviations like ordinal indicators in French and, optionally, in English. Preformatted superscripts are already widely implemented, and coding of "4?" only needs two characters, input using only three fingers in two times (thumb on AltGr, press key E04 then E12) with an appropriately programmed layout driver. I?m afraid that the solution with would be much less straightforward. > > The can also have its own selector> to select other styles when they are optional, such as > adding underlines to the superscripted letter, or rendering the > letter instead as underscript, or as a small baseline letter with a > dot after it: this is still an explicit abbreviation mark, and the > meaning of the plein text is still preserved: the variation selector > is only suitable to alter the rendering of a cluster when it has > effectively several variants and the default rendering is not > universal, notably across font styles initially designed for specific > markets with their own local preferences: the variation selector > still allows the same fonts to map all known variants distinctly, > independantly of the initial arbitrary choice of the default glyph > used when the variation selector is missing). I don?t think German users would welcome being directed to input a plus a instead of a period. > > Even if fonts (or text renderers may map the mark> to variable glyphs, this is purely stylictic, the semantic of > the plain text is not lost because the > is still there. There's no need of any rich-text to encode it (the > rich -text styles are not explicitly encoding that a superscript is > actually an abbreviation mark, so it cannot also allow variation like > rendering an underscript, or a baseline small glyph with an added > dot. Typically a used in an English > style would render the letter (or cluster) before it as a "small" > letter without any added dot. The advantage of preformatted superscripts is that the English user can decide whether he or she wishes the ordinal indicators to be baseline or superscript, while being sure of stable rendering. > > So I really think that is far better > than: > > * using preencoded superscript letters (they don't map all the > necessary repertoire of clusters where the abbreviation is needed, > it now just covers Basic Latin, ten digits, plus and minus signs, and > the dot or comma, plus a few other letters like stops; As seen in this thread, preformatted superscripts are standardized and implemented to get combining diacritics, eg "??", "??". Encoding any more precomposed letters that can be represented as combining sequences is out of scope, and that is the reason why no accented letters will ever be encoded as preformatted superscripts. Correct display of the "S???" abbreviation for French "Soci?t?" ("Company") is already working in browsers, depending on the fonts present on the machine and set in the settings, unless a correct webfont is downloaded and installed ad hoc. > it's impossible to rencode the full Unicode repertoire and its allowed > combining sequences or extended default grapheme clusters!), This persistent and passionate refrain boils down, as already pointed by others and me in this thread, to a continuum bias strawman fight, (ie the refrain is repeated to fight a strawman constructed using the continuum bias, which consists in using a continuum to move someone?s position to an extreme position that is ultimately off-topic). > > * or using variation selectors to make them appear as a superscript > (does not work with all clusters containing other diacritics like > accents), > > * or using rich-text styling (from which you cannot safely > infer any semantic (there no warranty that Mr in HTML is > actually an abbreviation of "Mister"; in HTML this is encoded > elsewhere as Mr or > Mr (the semantic of the abbreviation has to > be looked a possible container element and the meaning of the > abbreviation is to look inside its title attribute, so obviously this > requires complex preprocessing before we can infer a plaintext > version (suitable for example in > plain-text searches where you don't want to match a mathematical > object M, like a matrix, elevated to the power r, or a single > plaintext M followed by a footnote call noted by the letter "r"). Indeed HTML is a powerful language to provide rich and meaningful content with many features, so that in comparison, plain text could seem unreadable because it contains all those abbreviations and symbols you need to know. By contrast, plain text in any natural language is to contain just enough information that it is readable for a native reader, and that is the purpose of Unicode. Therefore, dismissing superscript abbreviation indicators to higher level protocols is like looking at a language from outside and telling: ?These are abbreviations anyway, so you probably should also add tooltips for people to learn the meaning.? > > It solves all practical problems: legacy encoding using the > preencoded superscript Latin letters (aka "modifier letters") should > have never been used or needed (not even for IPA usage which could > have used an explicit for its > superscripted symbols, or for its distinctive "a" and "g"). We should > not have needed to encode the variants for "a" and "g": these were > old hacks that broke the Unicode character encoding model since the > beginning. The principle of Unicode is to encode anything that is semantically distinctive in plain text, so encoding IPA letters is totally OK. > However only roundtrip compatibility with legacy non UCS > charsets milited only for keeping the ordinal feminine or ordinal > masculine mark, or the "Numero" cluster (actually made of two > letters, the second one followed by an implicit abbreviation mark, > but transformed in the legacy charset to be treated as a single > unbreakable cluster containing only one symbol; even Unicode > considers the abbreviated Numero as being only "compatibility > equivalent" to the letter N followed by the masculine ordinal symbol, > the latter being also only "compatibility equivalent" to a letter o > with an implicit superscript, but also with an optional combining > underline). These pre-Unicode charsets are a proof that superscripts are required. > > All these superscripts in Unicode (as well as Mathematical "styled" > letters, which were also completely unnecessary and will necessarily > be incomplete for the intended usage) are now to be treated only as > legacy practices, they should be deprecated in favor of the more > semantic and logical character encoding model, deprecating complelely > the legacy visual encoding. Mathematicians like them, and even not being a mathematician, I feel that there are really a lot of styled ?alphabets to choose from?, as Ken Whistler advised on this list in 2015. What uncovered usages are you referring to? > > Only precombined characters, recognized by canonical equivalences are > part of the standard and may be kept as "non"-legacy: they still fit > in the logical encoding. As well the extended default grapheme > clusters include the precomposed Hangul LVT and LV syllables, and CGJ > used before combining marks with non-zero combining class, and > variation selectors used only after base letters with the zero > combining class and that start the extended default graphgeme > clusters. > > Let's return to the root of the far better logical encoding which > remains the recommended practice. All the rest is legacy (some of > them came from decision taken to preserve roundtrip compatibility > with legacy charsets, including prepended letters in Thai, and so we > have a few compatibility characters (which are not the recommended > practive), but the rest was bad decisions made by Unicode and ISO WG > to break the logical character encoding model. That criticism only applies to presentation forms, that Unicode was forced to take in at setup, and whose use Unicode ever discouraged, as seen also in this thread. So all languages using superscript to indicate abbreviations are still better served with preformatted superscript letters. The new turn is that many languages, eg Italian, Polish, Portuguese and Spanish, need variation sequences for single or double underscoring, which whill work with OpenType fonts having the appropriate glyph sets, while the variation selector is ignorable for most other machine processing purposes. Best regards, Marcel From unicode at unicode.org Fri Nov 2 15:10:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Nov 2018 20:10:30 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com> References: <20181102083845.665a7a7059d7ee80bb4d670165c8327d.8c4ea08da3.wbe@email03.godaddy.com> Message-ID: <20181102201030.5d0fa3a6@JRWUBU2> On Fri, 02 Nov 2018 08:38:45 -0700 Doug Ewell via Unicode wrote: > Do we have any other evidence of this usage, besides a single > handwritten postcard? What, beyond some of us actually employing it ourselves? I'm sure I've seen 'William' abbreviated in print to 'W?' with some mark below, but I couldn't lay my hands on an example. Richard. From unicode at unicode.org Fri Nov 2 16:27:37 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 2 Nov 2018 14:27:37 -0700 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: Message-ID: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: > I was replying not about the notational repreentation of the DUCET > data table (using [.0000...] unnecessarily) but about the text of > UTR#10 itself. Which remains highly confusive, and contains completely > unnecesary steps, and just complicates things with absoiluytely no > benefit at all by introducing confusion about these "0000". Sorry, Philippe, but the confusion that I am seeing introduced is what you are introducing to the unicode list in the course of this discussion. > UTR#10 still does not explicitly state that its use of "0000" does not > mean it is a valid "weight", it's a notation only No, it is explicitly a valid weight. And it is explicitly and normatively referred to in the specification of the algorithm. See UTS10-D8 (and subsequent definitions), which explicitly depend on a definition of "A collation weight whose value is zero." The entire statement of what are primary, secondary, tertiary, etc. collation elements depends on that definition. And see the tables in Section 3.2, which also depend on those definitions. > (but the notation is used for TWO distinct purposes: one is for > presenting the notation format used in the DUCET It is *not* just a notation format used in the DUCET -- it is part of the normative definitional structure of the algorithm, which then percolates down into further definitions and rules and the steps of the algorithm. > itself to present how collation elements are structured, the other one > is for marking the presence of a possible, but not always required, > encoding of an explicit level separator for encoding sort keys). That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It is not part of the *notation* for collation elements, but instead is a magic value chosen for the level separator precisely because zero values from the collation elements are removed during sort key construction, so that zero is then guaranteed to be a lower value than any remaining weight added to the sort key under construction. This part of the algorithm is not rocket science, by the way! > > UTR#10 is still needlessly confusive. O.k., if you think so, you then know what to do: https://www.unicode.org/review/pri385/ and https://www.unicode.org/reporting.html > Even the example tables can be made without using these "0000" (for > example in tables showing how to build sort keys, it can present the > list of weights splitted in separate columns, one column per level, > without any "0000". The implementation does not necessarily have to > create a buffer containing all weight values in a row, when separate > buffers for each level is far superior (and even more efficient as it > can save space in memory). The UCA doesn't *require* you to do anything particular in your own implementation, other than come up with the same results for string comparisons. That is clearly stated in the conformance clause of UTS #10. https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance > The step "S3.2" in the UCA algorithm should not even be there (it is > made in favor an specific implementation which is not even efficient > or optimal), That is a false statement. Step S3.2 is there to provide a clear statement of the algorithm, to guarantee correct results for string comparison. Section 9 of UTS #10 provides a whole lunch buffet of techniques that implementations can choose from to increase the efficiency of their implementations, as they deem appropriate. You are free to implement as you choose -- including techniques that do not require any level separators. You are, however, duly warned in: https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators that "While this technique is relatively easy to implement, it can interfere with other compression methods." > it complicates the algorithm with absoluytely no benefit at all); you > can ALWAYS remove it completely and this still generates equivalent > results. No you cannot ALWAYS remove it completely. Whether or not your implementation can do so, depends on what other techniques you may be using to increase performance, store shorter keys, or whatever else may be at stake in your optimization. If you don't like zeroes in collation, be my guest, and ignore them completely. Take them out of your tables, and don't use level separators. Just make sure you end up with conformant result for comparison of strings when you are done. And in the meantime, if you want to complain about the text of the specification of UTS #10, then provide carefully worded alternatives as suggestions for improvement to the text, rather than just endlessly ranting about how the standard is confusive because the collation weight 0000 is "unnecessary". --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Nov 2 17:32:29 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 2 Nov 2018 22:32:29 +0000 Subject: use vs mention (was: second attempt) In-Reply-To: <20181101074640.2866a022@JRWUBU2> References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> <20181101074640.2866a022@JRWUBU2> Message-ID: <20181102223229.2b593ffa@JRWUBU2> On Thu, 1 Nov 2018 07:46:40 +0000 Richard Wordingham via Unicode wrote: > On Wed, 31 Oct 2018 23:35:06 +0100 > Piotr Karocki via Unicode wrote: > > > These are only examples of changes in meaning with or , > > not all of these examples can really exist - but, then, another > > question: can we know what author means? And as carbon and iodine > > cannot exist, then of course CI should be interpreted as carbon on > > first oxidation? > > Are you sure about the non-existence? Some pretty weird > chemical species exist in interstellar space. It's not interstellar, but CI is the empirical formula for diiodoethyne and its isomer iodoiodanuidylethyne, and the CI? ion has Pubchem CID 59215341. Richard. From unicode at unicode.org Fri Nov 2 20:34:58 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 3 Nov 2018 01:34:58 +0000 Subject: UCA unnecessary collation weight 0000 In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> References: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> Message-ID: <20181103013458.3e0a968d@JRWUBU2> On Fri, 2 Nov 2018 14:27:37 -0700 Ken Whistler via Unicode wrote: > On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: > > UTR#10 still does not explicitly state that its use of "0000" does > > not mean it is a valid "weight", it's a notation only > > No, it is explicitly a valid weight. And it is explicitly and > normatively referred to in the specification of the algorithm. See > UTS10-D8 (and subsequent definitions), which explicitly depend on a > definition of "A collation weight whose value is zero." The entire > statement of what are primary, secondary, tertiary, etc. collation > elements depends on that definition. And see the tables in Section > 3.2, which also depend on those definitions. The definition is defective in that it doesn't handle 'large weight values' well. There is the anomaly that a mapping of collating element to [1234.0000.0000][0200.020.002] may be compatible with WF1, but the exactly equivalent mapping to [1234.020.002][0200.0000.0000] makes the table ill-formed. The fractional weight definitions for UCA eliminate this '0000' notion quite well, and I once expected the UCA to move to the CLDRCA (CLDR Collation Algorithm) fractional weight definition. The definition of the CLDRCA does a much better job of explaining 'large weight values'. It turns them from something exceptional to a normal part of its functioning. > > (but the notation is used for TWO distinct purposes: one is for > > presenting the notation format used in the DUCET > > It is *not* just a notation format used in the DUCET -- it is part of > the normative definitional structure of the algorithm, which then > percolates down into further definitions and rules and the steps of > the algorithm. It's not needed for the CLDRCA! The statement of the UCA algorithm does depend on its notation, but it can be recast to avoid these zero weights. Richard. From unicode at unicode.org Sat Nov 3 14:41:54 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 20:41:54 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 02/11/2018 17:45, Philippe Verdy via Unicode wrote: > [quoted mail] > > > > Using variation selectors is only appropriate for these existing > > (preencoded) superscript letters ? and ? so that they display the > > appropriate (underlined or not underlined) glyph. > > And it is for forcing the display of DIGIT ZERO with a short stroke: > 0030 FE00; short diagonal stroke form; # DIGIT ZERO > https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt > > From that it becomes unclear why that isn?t applied to 4, 7, z and Z > mentioned in this thread, to be displayed open or with a short bar. > > > It is not a solution for creating superscripts on any letters and > > mark that it should be rendered as superscript (notably, the base > > letter to transform into superscript may also have its own combining > > diacritics, that must be encoded explicitly, and if you use the > > varaition selector, it should allow variation on the presence or > > absence of the underline (which must then be encoded explicitly as a > > combining character. > > I totally agree that abbreviation indicating superscript should not be > encoded using variation selectors, as already stated I don?t prefer it. > > > > So finally what we get with variation selectors is: > variation selector, combining diacritic> and > precombined with the diacritic, variation selector> which is NOT > > canonically equivalent. > > That seems to me like a flaw in canonical equivalence. Variations must > be canonically equivalent, and the variation selector position should > be handled or parsed accordingly. Personally I?m unaware of this rule. > > > > Using a combining character avoids this caveat: > combining diacritic, combining abbreviation mark> and > precombined with the diacritic, combining abbreviation mark> which > > ARE canonically equivalent. And this explicitly states the semantic > > (something that is lost if we are forced to use presentational > > superscripts in a higher level protocol like HTML/CSS for rich text > > format, and one just extracts the plain text; using collation will > > not help at all, except if collators are built with preprocessing > > that will first infer the presence of a > > to insert after each combining sequence of the plain-text enclosed in > > a italic style). > > That exactly outlines my concern with calls for relegating superscript > as an abbreviation indicator to higher level protocols like HTML/CSS. > That's exactlky my concern that this relation to HTML/CSS should NOT occur at all ! It's really not the solution, HTML/CSS styles have NO semantic at all (I demonstrated it in the message you are quoting). > > There's little risk: if the is not > > mapped in fonts (or not recognized by text renderers to create > > synthetic superscript scripts from existing recognized clusters), it > > will render as a visible .notdef (tofu). But normally text renderers > > recognize the basic properties of characters in the UCD and can see > > that has a combining mark general > > property (it also knows that it has a 0 combinjing class, so > > canonical equivalences are not broken) to render a better symbols > > than the .notdef "tofu": it should better render a dotted circle. > > Even if this tofu or dotted circle is rendered, it still explicitly > > marks the presence of the abbreviation mark, so there's less > > confusion about what is preceding it (the combining sequence that was > > supposed to be superscripted). > > The problem with the you are proposing > is that it contradicts streamlined implementation as well as easy > input of current abbreviations like ordinal indicators in French and, > optionally, in English. Preformatted superscripts are already widely > implemented, and coding of "4?" only needs two characters, input > using only three fingers in two times (thumb on AltGr, press key > E04 then E12) with an appropriately programmed layout driver. I?m > afraid that the solution with would be > much less straightforward. > This is not a real concern: this is legacy old practives that should no longer be recommanded as it is ambiguous (nothing says that "4?" is an abbreviated ordinal, it can as well be 4 elevated to the power e, or various other things). Also the keys to press on a keyboard is absolutely not a concern: the same key presses you propose can as well generate the letter followed by the combining abbreviation mark. In fact what you propose is even less practical because it uses complex input for all characters and requires mapping keys on the whole alphabet (so it uses precious space on the key layout). It's just simpler for everyone to press "4", "e", followed by a combination (like AltGr+".") to produce the ! And these legacy superscript characters still are not warrantied to not have any underline (the variation may as well be significant), and there will never be enough superscript characters for the many superscript notations (not just abbreviations) that should still be encoded the normal letters (including in clusters, with diacritics, ligatures and so on): Unicode will never accept to reencode all existing letters (plus all the infinite set of clusters that can be formed with them) just to turn them into superscript/subscript variants. These encodings that found their way from the need of roundtrip compatibility of legacy charsets (before the UCS) should have never occured at all: these should have not even been tolerated for IPA symbols, for mathematical symbols (monospace, bold, italic...). The variation selector solution is also not suitable when the intent is only to add semantic to the encoded text and not drive the choice between glyph variants (when the default glyph without the variant selector can FREELY vary into forms that are UNACCEPTABLE in some contexts, then the variation does not really encode the semantic but encodes the visual rendering intent: it is too easily abuse to do something else). But a single *semantic* combining mark does not encode any visual rendering intent like what variation selectors do. They still allow glyphic variations as long as the the semantic is kept, and they have the correct fallbacks (there's no obscuring of the encoding of the clusters to which the semantic combining mark applies: they are still part of the same general encoding as normal letters, and rendering abbreviation mark does not necessarily means that the base cluster MUST be rendered differently than normal letters: it is permitted as well to render the combining mark for example as a dot, or as a true diacritic on top of the letters). And if needed the following can control the visual appearence: > > > > The can also have its own > selector> to select other styles when they are optional, such as > > adding underlines to the superscripted letter, or rendering the > > letter instead as underscript, or as a small baseline letter with a > > dot after it: this is still an explicit abbreviation mark, and the > > meaning of the plein text is still preserved: the variation selector > > is only suitable to alter the rendering of a cluster when it has > > effectively several variants and the default rendering is not > > universal, notably across font styles initially designed for specific > > markets with their own local preferences: the variation selector > > still allows the same fonts to map all known variants distinctly, > > independantly of the initial arbitrary choice of the default glyph > > used when the variation selector is missing). > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 15:02:23 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 21:02:23 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: As well the separate encoding of mathematical variants could have been completely avoided (we know that this encoding is not sufficient, so much that even LaTeX renderers simply don't need it or use it !). We could have just encoded a single to use after any base cluster, and the whole set was covered ! The additional distinction of visual variants (monospace, bold, italic...) would have been encoded using variation selectors after the : the semantic as a mathematical symbols was still preserved including the additional semantic for distinguishing some symbols in maths notations like "f(f)=f" where the 3 "f" must be distinguished (between the function in a set of functions, the source belonging to one set of values or being a variable, and the result in another set which may be a value or variable. Once again this covered all the needs without using this duplicate encoding (that was NEVER needed for roundtrip compatibility with legacy non-UCS charsets). All I ask is reasonnable: it's just a SINGLE code point to encode the combining mark itself, semantically, NOT visually. The visual appearance can be controlled by an additional variation selector to cancel the effect of glyph variations allowed for ALL characters in the UCS, where there's just a **non-mandatory** form generally used by default in fonts and matching more or less the "representative glyph" shown in the Unicode and ISO 10646 charts, which cannot show all allowed variations (if there's a need to detail them, Unicode offers the possibility to ask to register known "variation sequences" which can feed a supplementary chart showing more representative glyphs, one for each accepted "variation sequence", but without even needing to modify the "representative glyph" shown in the base chart. Note that even if Unicode requires registration of variation sequences prior to using them, the published charts still omit to add the additional charts (just below the existing base chart) showing representative glyphs for accepted sequences, with one small chart per base character, listing them simply ordered by "VSn" value. All what Unicode publishes is only a mere data list with some names (not enough for most users to be ware that variations can be encoded explicitly and compliantly) Le sam. 3 nov. 2018 ? 20:41, Philippe Verdy a ?crit : > > > Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode < > unicode at unicode.org> a ?crit : > >> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote: >> [quoted mail] >> > >> > Using variation selectors is only appropriate for these existing >> > (preencoded) superscript letters ? and ? so that they display the >> > appropriate (underlined or not underlined) glyph. >> >> And it is for forcing the display of DIGIT ZERO with a short stroke: >> 0030 FE00; short diagonal stroke form; # DIGIT ZERO >> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt >> >> From that it becomes unclear why that isn?t applied to 4, 7, z and Z >> mentioned in this thread, to be displayed open or with a short bar. >> >> > It is not a solution for creating superscripts on any letters and >> > mark that it should be rendered as superscript (notably, the base >> > letter to transform into superscript may also have its own combining >> > diacritics, that must be encoded explicitly, and if you use the >> > varaition selector, it should allow variation on the presence or >> > absence of the underline (which must then be encoded explicitly as a >> > combining character. >> >> I totally agree that abbreviation indicating superscript should not be >> encoded using variation selectors, as already stated I don?t prefer it. >> > >> > So finally what we get with variation selectors is: > > variation selector, combining diacritic> and > > precombined with the diacritic, variation selector> which is NOT >> > canonically equivalent. >> >> That seems to me like a flaw in canonical equivalence. Variations must >> be canonically equivalent, and the variation selector position should >> be handled or parsed accordingly. Personally I?m unaware of this rule. >> > >> > Using a combining character avoids this caveat: > > combining diacritic, combining abbreviation mark> and > > precombined with the diacritic, combining abbreviation mark> which >> > ARE canonically equivalent. And this explicitly states the semantic >> > (something that is lost if we are forced to use presentational >> > superscripts in a higher level protocol like HTML/CSS for rich text >> > format, and one just extracts the plain text; using collation will >> > not help at all, except if collators are built with preprocessing >> > that will first infer the presence of a >> > to insert after each combining sequence of the plain-text enclosed in >> > a italic style). >> >> That exactly outlines my concern with calls for relegating superscript >> as an abbreviation indicator to higher level protocols like HTML/CSS. >> > > That's exactlky my concern that this relation to HTML/CSS should NOT occur > at all ! It's really not the solution, HTML/CSS styles have NO semantic at > all (I demonstrated it in the message you are quoting). > > >> > There's little risk: if the is not >> > mapped in fonts (or not recognized by text renderers to create >> > synthetic superscript scripts from existing recognized clusters), it >> > will render as a visible .notdef (tofu). But normally text renderers >> > recognize the basic properties of characters in the UCD and can see >> > that has a combining mark general >> > property (it also knows that it has a 0 combinjing class, so >> > canonical equivalences are not broken) to render a better symbols >> > than the .notdef "tofu": it should better render a dotted circle. >> > Even if this tofu or dotted circle is rendered, it still explicitly >> > marks the presence of the abbreviation mark, so there's less >> > confusion about what is preceding it (the combining sequence that was >> > supposed to be superscripted). >> >> The problem with the you are proposing >> is that it contradicts streamlined implementation as well as easy >> input of current abbreviations like ordinal indicators in French and, >> optionally, in English. Preformatted superscripts are already widely >> implemented, and coding of "4?" only needs two characters, input >> using only three fingers in two times (thumb on AltGr, press key >> E04 then E12) with an appropriately programmed layout driver. I?m >> afraid that the solution with would be >> much less straightforward. >> > > This is not a real concern: this is legacy old practives that should no > longer be recommanded as it is ambiguous (nothing says that "4?" is an > abbreviated ordinal, it can as well be 4 elevated to the power e, or > various other things). > > Also the keys to press on a keyboard is absolutely not a concern: the same > key presses you propose can as well generate the letter followed by the > combining abbreviation mark. In fact what you propose is even less > practical because it uses complex input for all characters and requires > mapping keys on the whole alphabet (so it uses precious space on the key > layout). It's just simpler for everyone to press "4", "e", followed by a > combination (like AltGr+".") to produce the ! > > And these legacy superscript characters still are not warrantied to not > have any underline (the variation may as well be significant), and there > will never be enough superscript characters for the many superscript > notations (not just abbreviations) that should still be encoded the normal > letters (including in clusters, with diacritics, ligatures and so on): > Unicode will never accept to reencode all existing letters (plus all the > infinite set of clusters that can be formed with them) just to turn them > into superscript/subscript variants. These encodings that found their way > from the need of roundtrip compatibility of legacy charsets (before the > UCS) should have never occured at all: these should have not even been > tolerated for IPA symbols, for mathematical symbols (monospace, bold, > italic...). > > The variation selector solution is also not suitable when the intent is > only to add semantic to the encoded text and not drive the choice between > glyph variants (when the default glyph without the variant selector can > FREELY vary into forms that are UNACCEPTABLE in some contexts, then the > variation does not really encode the semantic but encodes the visual > rendering intent: it is too easily abuse to do something else). > But a single *semantic* combining mark does not encode any visual > rendering intent like what variation selectors do. They still allow glyphic > variations as long as the the semantic is kept, and they have the correct > fallbacks (there's no obscuring of the encoding of the clusters to which > the semantic combining mark applies: they are still part of the same > general encoding as normal letters, and rendering abbreviation mark does > not necessarily means that the base cluster MUST be rendered differently > than normal letters: it is permitted as well to render the combining mark > for example as a dot, or as a true diacritic on top of the letters). And if > needed the following can control the visual appearence: > >> > >> > The can also have its own > > selector> to select other styles when they are optional, such as >> > adding underlines to the superscripted letter, or rendering the >> > letter instead as underscript, or as a small baseline letter with a >> > dot after it: this is still an explicit abbreviation mark, and the >> > meaning of the plein text is still preserved: the variation selector >> > is only suitable to alter the rendering of a cluster when it has >> > effectively several variants and the default rendering is not >> > universal, notably across font styles initially designed for specific >> > markets with their own local preferences: the variation selector >> > still allows the same fonts to map all known variants distinctly, >> > independantly of the initial arbitrary choice of the default glyph >> > used when the variation selector is missing). >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 15:45:40 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 21:45:40 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: As an additional remark, I find that Unicode is slowly abandoning its initial goals of encoding texts logically and semantically. This was contrasting to the initial ISO 106464 which wanted to produce a giant visual encoding, based only on code charts (without any character properties except glyph names and an almost mandatory "representative glyph" which allowed in fact no variation at all). The initial ISO 10646 goal failed to reach a global adoption. What proved to be extremely successful (and allowed easier processing of text, without limiting the variation of glyph designs needed and wanted for the orthography of human languages) was the Unicode character encoding model, based on logical semantic encoding. This drove the worldwide adoption (and now the fast abandon of legacy charsets, all based on visual appearance and basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO standards, or other national standards, including in China, Japan, Europe, or made and promoted by private hardware manufacturers or software providers, frequently as well with legal restrictions such as MacRoman with its well known Apple logo) It is desesperating to see that Unicode does not resist to that, and even now refuses the idea of adding just a few simple combining characters (that fit perfectly in its character encoding model, and still allows efficient text processing, and rendering with reasonnable fallbacks) that will explicitly encode the semantics (a good example in Latin: look at why the lower case eth letter seems to have three codes: this is because theiy have different semantics but also map to different uppercase letters, and being able to transform letter cases, and being able to use collation for plain-text search is an extremely useful feature possible only because of Unicode character properties, but impossible to do with just the visual encoding and charts of ISO 10646; the same is true about Latin A versus Cyrillic A and Greek ALPHA: the semantics is the first goal to respect, thanks to Unicode character properties and the Unicode character model, but the visual encoding is definitely not a goal). So before encoding characters in Unicode, the glyph variation is not enough (this occurs everywhere in humane languages): you need a proof with contrasting pairs, showing that the glyph difference makes a semantic difference and requires different processing (different character properties). Unicode has succeeded everywhere ISO 10646 has failed: efficient processing of humane languages with their wide variation of orthographies and visual appearance. The other goals (supporting technical notations, like IPA, maths, music, and now emojis!), driven by glyph requirements everywhere (mandated in their own relevant standard) is where Unicode can and even should promote the use of variation sequences, and definitely not dual encoding as this was done (Unicode abandoning its most useful goal, not resisting to the pressure of some industries: this has just created more issues, with more difficulties to correctly and efficiently process texts written in humane languages). The more Unicode evolves, the more I see that it will turn the UCS in what the ISO 10646 attempted to do (and failed): turn the UCS into a visual encoding, refusing to encode **efficiently** any semantic differences. And this will become a severe problems later with the constant evolution of humane languages. I press Unicode to maintain its "character encoding model" as the path to follow, and that it should be driven by semantic goals. It has every features needed for that : combining sequences (including CGJ because of canonical equivalences that were needed due to roundtrip compatibility with legacy non-UCS charsets), variation selectors (ONLY to optionally add some *semantic* restrictions in the largely allowed variation of glyphs and still preserve distinction between contrasting pairs, but NOT as a way to encode non-semantic styles), and character properties to allow efficient processing. Le sam. 3 nov. 2018 ? 21:02, Philippe Verdy a ?crit : > As well the separate encoding of mathematical variants could have been > completely avoided (we know that this encoding is not sufficient, so much > that even LaTeX renderers simply don't need it or use it !). > > We could have just encoded a single to use > after any base cluster, and the whole set was covered ! > > The additional distinction of visual variants (monospace, bold, italic...) > would have been encoded using variation selectors after the mathematical symbol>: the semantic as a mathematical symbols was still > preserved including the additional semantic for distinguishing some symbols > in maths notations like "f(f)=f" where the 3 "f" must be distinguished > (between the function in a set of functions, the source belonging to one > set of values or being a variable, and the result in another set which may > be a value or variable. > > Once again this covered all the needs without using this duplicate > encoding (that was NEVER needed for roundtrip compatibility with legacy > non-UCS charsets). > > All I ask is reasonnable: it's just a SINGLE code point to encode the > combining mark itself, semantically, NOT visually. > > The visual appearance can be controlled by an additional variation > selector to cancel the effect of glyph variations allowed for ALL > characters in the UCS, where there's just a **non-mandatory** form > generally used by default in fonts and matching more or less the > "representative glyph" shown in the Unicode and ISO 10646 charts, which > cannot show all allowed variations (if there's a need to detail them, > Unicode offers the possibility to ask to register known "variation > sequences" which can feed a supplementary chart showing more representative > glyphs, one for each accepted "variation sequence", but without even > needing to modify the "representative glyph" shown in the base chart. > > Note that even if Unicode requires registration of variation sequences > prior to using them, the published charts still omit to add the additional > charts (just below the existing base chart) showing representative glyphs > for accepted sequences, with one small chart per base character, listing > them simply ordered by "VSn" value. All what Unicode publishes is only a > mere data list with some names (not enough for most users to be ware that > variations can be encoded explicitly and compliantly) > > > Le sam. 3 nov. 2018 ? 20:41, Philippe Verdy a ?crit : > >> >> >> Le ven. 2 nov. 2018 ? 20:01, Marcel Schneider via Unicode < >> unicode at unicode.org> a ?crit : >> >>> On 02/11/2018 17:45, Philippe Verdy via Unicode wrote: >>> [quoted mail] >>> > >>> > Using variation selectors is only appropriate for these existing >>> > (preencoded) superscript letters ? and ? so that they display the >>> > appropriate (underlined or not underlined) glyph. >>> >>> And it is for forcing the display of DIGIT ZERO with a short stroke: >>> 0030 FE00; short diagonal stroke form; # DIGIT ZERO >>> https://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt >>> >>> From that it becomes unclear why that isn?t applied to 4, 7, z and Z >>> mentioned in this thread, to be displayed open or with a short bar. >>> >>> > It is not a solution for creating superscripts on any letters and >>> > mark that it should be rendered as superscript (notably, the base >>> > letter to transform into superscript may also have its own combining >>> > diacritics, that must be encoded explicitly, and if you use the >>> > varaition selector, it should allow variation on the presence or >>> > absence of the underline (which must then be encoded explicitly as a >>> > combining character. >>> >>> I totally agree that abbreviation indicating superscript should not be >>> encoded using variation selectors, as already stated I don?t prefer it. >>> > >>> > So finally what we get with variation selectors is: >> > variation selector, combining diacritic> and >> > precombined with the diacritic, variation selector> which is NOT >>> > canonically equivalent. >>> >>> That seems to me like a flaw in canonical equivalence. Variations must >>> be canonically equivalent, and the variation selector position should >>> be handled or parsed accordingly. Personally I?m unaware of this rule. >>> > >>> > Using a combining character avoids this caveat: >> > combining diacritic, combining abbreviation mark> and >> > precombined with the diacritic, combining abbreviation mark> which >>> > ARE canonically equivalent. And this explicitly states the semantic >>> > (something that is lost if we are forced to use presentational >>> > superscripts in a higher level protocol like HTML/CSS for rich text >>> > format, and one just extracts the plain text; using collation will >>> > not help at all, except if collators are built with preprocessing >>> > that will first infer the presence of a >>> > to insert after each combining sequence of the plain-text enclosed in >>> > a italic style). >>> >>> That exactly outlines my concern with calls for relegating superscript >>> as an abbreviation indicator to higher level protocols like HTML/CSS. >>> >> >> That's exactlky my concern that this relation to HTML/CSS should NOT >> occur at all ! It's really not the solution, HTML/CSS styles have NO >> semantic at all (I demonstrated it in the message you are quoting). >> >> >>> > There's little risk: if the is not >>> > mapped in fonts (or not recognized by text renderers to create >>> > synthetic superscript scripts from existing recognized clusters), it >>> > will render as a visible .notdef (tofu). But normally text renderers >>> > recognize the basic properties of characters in the UCD and can see >>> > that has a combining mark general >>> > property (it also knows that it has a 0 combinjing class, so >>> > canonical equivalences are not broken) to render a better symbols >>> > than the .notdef "tofu": it should better render a dotted circle. >>> > Even if this tofu or dotted circle is rendered, it still explicitly >>> > marks the presence of the abbreviation mark, so there's less >>> > confusion about what is preceding it (the combining sequence that was >>> > supposed to be superscripted). >>> >>> The problem with the you are proposing >>> is that it contradicts streamlined implementation as well as easy >>> input of current abbreviations like ordinal indicators in French and, >>> optionally, in English. Preformatted superscripts are already widely >>> implemented, and coding of "4?" only needs two characters, input >>> using only three fingers in two times (thumb on AltGr, press key >>> E04 then E12) with an appropriately programmed layout driver. I?m >>> afraid that the solution with would be >>> much less straightforward. >>> >> >> This is not a real concern: this is legacy old practives that should no >> longer be recommanded as it is ambiguous (nothing says that "4?" is an >> abbreviated ordinal, it can as well be 4 elevated to the power e, or >> various other things). >> >> Also the keys to press on a keyboard is absolutely not a concern: the >> same key presses you propose can as well generate the letter followed by >> the combining abbreviation mark. In fact what you propose is even less >> practical because it uses complex input for all characters and requires >> mapping keys on the whole alphabet (so it uses precious space on the key >> layout). It's just simpler for everyone to press "4", "e", followed by a >> combination (like AltGr+".") to produce the ! >> >> And these legacy superscript characters still are not warrantied to not >> have any underline (the variation may as well be significant), and there >> will never be enough superscript characters for the many superscript >> notations (not just abbreviations) that should still be encoded the normal >> letters (including in clusters, with diacritics, ligatures and so on): >> Unicode will never accept to reencode all existing letters (plus all the >> infinite set of clusters that can be formed with them) just to turn them >> into superscript/subscript variants. These encodings that found their way >> from the need of roundtrip compatibility of legacy charsets (before the >> UCS) should have never occured at all: these should have not even been >> tolerated for IPA symbols, for mathematical symbols (monospace, bold, >> italic...). >> >> The variation selector solution is also not suitable when the intent is >> only to add semantic to the encoded text and not drive the choice between >> glyph variants (when the default glyph without the variant selector can >> FREELY vary into forms that are UNACCEPTABLE in some contexts, then the >> variation does not really encode the semantic but encodes the visual >> rendering intent: it is too easily abuse to do something else). >> But a single *semantic* combining mark does not encode any visual >> rendering intent like what variation selectors do. They still allow glyphic >> variations as long as the the semantic is kept, and they have the correct >> fallbacks (there's no obscuring of the encoding of the clusters to which >> the semantic combining mark applies: they are still part of the same >> general encoding as normal letters, and rendering abbreviation mark does >> not necessarily means that the base cluster MUST be rendered differently >> than normal letters: it is permitted as well to render the combining mark >> for example as a dot, or as a true diacritic on top of the letters). And if >> needed the following can control the visual appearence: >> >>> > >>> > The can also have its own >> > selector> to select other styles when they are optional, such as >>> > adding underlines to the superscripted letter, or rendering the >>> > letter instead as underscript, or as a small baseline letter with a >>> > dot after it: this is still an explicit abbreviation mark, and the >>> > meaning of the plein text is still preserved: the variation selector >>> > is only suitable to alter the rendering of a cluster when it has >>> > effectively several variants and the default rendering is not >>> > universal, notably across font styles initially designed for specific >>> > markets with their own local preferences: the variation selector >>> > still allows the same fonts to map all known variants distinctly, >>> > independantly of the initial arbitrary choice of the default glyph >>> > used when the variation selector is missing). >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 16:55:17 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 22:55:17 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: I can give other interesting examples about why the Unicode "character encoding model" is the best option Just consider how the Hangul alphabet is (now) encoded: its consonnant letters are encoded "twice" (leading and trailing jamos) because they carry semantic distinctions for efficient processing of Korean text where syllable boundaries are significant to disambiguate text ; this apparent "double encoding" also has a visual model (still currently employed) to *preferably* (not mandatorily) render syllables in a well defined square layout. But the square layout causes significant rendering issues (notably at small font sizes), so it is also possible to render the syllable by aligning letters horizontally. This was done in the "compatibility jamos" used in old terminals/printers (but unfortunately without marking the syllable boundaries explicitly before groups of consonnants, or after them, or in the middle of the group); due to the need to preserve roundtrip compatiblity with the non-UCS encodings, the "compatibility jamos" had to be encoded separately, even if their use is no longer recommanded for normal Korean texts that should explicitly encode syllabic boundaries by distinguishing leading and trailing consonnants (this is equivalent to the distinction of letter case in Latin: leading jamos in Hangul are exactly like our Latin capital consonnants, trailing jamos in Hangul are exactly like our latin small letters, the vowel jamos in Hangul however are unicameral... for now) But Hangul is still a true alphabet (it is in fact much simpler than Greek or Cyrillic, and Latin is the most complex script of the world!). Thanks to this new (recommanded) encoding of Hangul, which adopts a **semantic** and **logical** model, it is possible to process Korean text very efficiently (and in fact very simply). The earlier attempts of encoding Korean was done while ISO 10646 goals were thought to be enough (so it was a **visual** encoding: it failed even when this earlier encoding entered in the first versions of Unicode, and has created a severe precedent where preserving the stability of Unicode (and upward compatibility) was broken. I can also cite the case of Egyptian hieroglyphs: there's still no way to render them correctly because we lack the development of a stable orthography that would drive the encoding of the missing **semantic** characters (for this reason Egyptian hieroglyphs still require an upper layer protocol, as there's still no accepted orthographic norm that successfully represents all possible semantic variations, but alsop because the research on old Egyptian hieroglyphs is still aphic very incomplete). The same can be saif about Mayan hieroglyphs. And because there's still no semantic encoding of real texts, it's almost impossible to process text in this script: the characters encoded are ONLY basic glyphs (we don't know what can be their allowed variations, so we cannot use them safely to compose combining sequences: they are merely a collection of symbols, not a humane script). In my opinion, there was absolutely no emergency to encode them in the UCS (except by not resisting to the pressure of allowing fonts containing these glyphs to be interchanged, but it remains impossible to encode and compose complete text with only these fonts: you still need an orthographic convention and there's still no concensus about it; as well the standard higher level protocols like HTML/CSS cannot compose them correctly and efficiently). This encoding was not necessary as these fonts containing collection of glyphs could have remained encoded with a private use convention, i.e. with PUAs required by only the attempted (but not agreed) protocols. I think on the opposite that VisibleSpeech, or Duploy? shorthands will reach a point where they have developed a stable orthographic convention: there will be a standard, and this standard will request to Unicode to encode the missing **semantic** characters. This path should also be followed now for encoding emojis (there's a early development of an orthography for them, it is done by Unicode itself, but I'm not sure this is part of its mission: Emoji orthographic conventions should be made by a separate commity). Unfortunately Unicode is starting to create this orthography without developing what should come with it : its integration in the Unicode "character encoding model" (which should then be reviewed to meet the goals wanted for the composition of emoji sequences): a clear set of character properties for emojis needs to be developed, and then the emojis subcommittee can work with it (like what the IRC does for ideographic scripts). But for now any revision of emojis adds new incompatibilities an inefficiencies to process text correctly (for example it's nearly imposssible to define the boundaries between clusters of emojis). Just consider what is also still missing for Egyptian and Mayan hieroglyphs or VisibleSpeech, or Duploy? Shorthands: please resist to pressures, and stop complexifying rules within Emojis. We need rules and these rules must be integrated in the character encoding model, and the first chapters of the Unicode standard ! But please don't resist so much to legitimate goals of adding a few simple semantic characters that can greatly increase the usability and "universality" of the UCS: this can be done without continuous adding new duplicate encodings. The duplicate encodings can be kept, but should be considered only like legacy, i.e. like other "compatiility characters", no longer recommanded but still usable. This should be just like the Hangul compatiblity "half-width" jamos in the last block of the BMP, in which T and L consonnants are not distinguished (only L consonnants are encoded and are ambiguously reused for T consonnants) and only TL clusters are unambiguous (but cannot be safely associated with surrounding T compatiblity jamos, so it's impossible to compose them safely in syllabic squares, and impossible to determine some semantic differences if syllabic boundaries can only be "guessed" with an heuristic and some dictionary lookup to find only the most probable meaning). These legacy characters (introduced by Unicode itself, but for bad reasons or because the UTC did not resist to some commercial pressures) have just polluted the UCS needlessly and complexified everything (and for long): they remain there as apparent duplicates but with no clear semantics and cause various problems (including security problems): most of these "compatibility characters" are now strongly discouraged, or even now forbidden in uses where security is an issue. And this is the case for almost all superscript/subscripts (not justified by roundtrip compatibility with legacy standard). But now Unicode must keep these characters there in its own standard to preserve the roundtrip compatiblity with its own initial versions ! But this does not mean that these characters cannot be deprecated and treated later as "compatibility characters", even if they are not part of the current standard normalizations NFKD and NFKC (which have limited legacy use). These NFKC and NFKD forms should now be replaced by two more convenient "Legacy Normalization Forms", that I would abbreviate as "NFLC" and "NFLD" very useful for example for default collations in the DUCET or CLDR "root" locale, except that it will not be frozen like existing NFKC and NFKD by the very limited "compatibility mappings" found in the historic main file of the UCD and that cannot follow the evolution of the recommanded best practices. Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset based on MUTABLE character properties (this can also be "decompositions mappings" except that once a character is added to the new property file, they won't be removed, and can have some stability as well, where the decision to "deprecate" old encodings can only be done if there's a new recommandation, and that if ever this recommandation changes and is deprecated, the previous "legacy decomposition mappings" can still be decomposed again to the new decompositions recommanded): unlike NFKC, and NFKD, a "legacy decomposition" is not "final" in all future versions, and a future version may remap them by just adding new entries for the new characters considered to be "legacy" and no longer recommended. This new properties file would allow evolution and adaptation to humane languages, and will allow correcting past errors in the standard. This file should have this form: # deprecated codepoint(s) ; new preferred sequence ; Unicode version ins which it was deprecated 101234 ; 101230 0300... ; 10.0 This file can also be used to deprecate some old variation sequences, or some old clusters made of multiple characters that are isolately not deprecated. Thanks. Le sam. 3 nov. 2018 ? 21:45, Philippe Verdy a ?crit : > As an additional remark, I find that Unicode is slowly abandoning its > initial goals of encoding texts logically and semantically. This was > contrasting to the initial ISO 106464 which wanted to produce a giant > visual encoding, based only on code charts (without any character > properties except glyph names and an almost mandatory "representative > glyph" which allowed in fact no variation at all). > > The initial ISO 10646 goal failed to reach a global adoption. What proved > to be extremely successful (and allowed easier processing of text, without > limiting the variation of glyph designs needed and wanted for the > orthography of human languages) was the Unicode character encoding model, > based on logical semantic encoding. This drove the worldwide adoption (and > now the fast abandon of legacy charsets, all based on visual appearance and > basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO > standards, or other national standards, including in China, Japan, Europe, > or made and promoted by private hardware manufacturers or software > providers, frequently as well with legal restrictions such as MacRoman with > its well known Apple logo) > > It is desesperating to see that Unicode does not resist to that, and even > now refuses the idea of adding just a few simple combining characters (that > fit perfectly in its character encoding model, and still allows efficient > text processing, and rendering with reasonnable fallbacks) that will > explicitly encode the semantics (a good example in Latin: look at why the > lower case eth letter seems to have three codes: this is because theiy have > different semantics but also map to different uppercase letters, and being > able to transform letter cases, and being able to use collation for > plain-text search is an extremely useful feature possible only because of > Unicode character properties, but impossible to do with just the visual > encoding and charts of ISO 10646; the same is true about Latin A versus > Cyrillic A and Greek ALPHA: the semantics is the first goal to respect, > thanks to Unicode character properties and the Unicode character model, but > the visual encoding is definitely not a goal). > > So before encoding characters in Unicode, the glyph variation is not > enough (this occurs everywhere in humane languages): you need a proof with > contrasting pairs, showing that the glyph difference makes a semantic > difference and requires different processing (different character > properties). > > Unicode has succeeded everywhere ISO 10646 has failed: efficient > processing of humane languages with their wide variation of orthographies > and visual appearance. The other goals (supporting technical notations, > like IPA, maths, music, and now emojis!), driven by glyph requirements > everywhere (mandated in their own relevant standard) is where Unicode can > and even should promote the use of variation sequences, and definitely not > dual encoding as this was done (Unicode abandoning its most useful goal, > not resisting to the pressure of some industries: this has just created > more issues, with more difficulties to correctly and efficiently process > texts written in humane languages). > > The more Unicode evolves, the more I see that it will turn the UCS in what > the ISO 10646 attempted to do (and failed): turn the UCS into a visual > encoding, refusing to encode **efficiently** any semantic differences. And > this will become a severe problems later with the constant evolution of > humane languages. > > I press Unicode to maintain its "character encoding model" as the path to > follow, and that it should be driven by semantic goals. It has every > features needed for that : combining sequences (including CGJ because of > canonical equivalences that were needed due to roundtrip compatibility with > legacy non-UCS charsets), variation selectors (ONLY to optionally add some > *semantic* restrictions in the largely allowed variation of glyphs and > still preserve distinction between contrasting pairs, but NOT as a way to > encode non-semantic styles), and character properties to allow efficient > processing. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 17:36:36 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 23:36:36 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: > > Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset > based on MUTABLE character properties (this can also be "decompositions > mappings" except that once a character is added to the new property file, > they won't be removed, and can have some stability as well, where the > decision to "deprecate" old encodings can only be done if there's a new > recommandation, and that if ever this recommandation changes and is > deprecated, the previous "legacy decomposition mappings" can still be > decomposed again to the new decompositions recommanded): unlike NFKC, and > NFKD, a "legacy decomposition" is not "final" in all future versions, and a > future version may remap them by just adding new entries for the new > characters considered to be "legacy" and no longer recommended. This new > properties file would allow evolution and adaptation to humane languages, > and will allow correcting past errors in the standard. This file should > have this form: > > # deprecated codepoint(s) ; new preferred sequence ; Unicode version in > which it was deprecated > 101234 ; 101230 0300... ; 10.0 > > This file can also be used to deprecate some old variation sequences, or > some old clusters made of multiple characters that are isolately not > deprecated. > Another note: - this new decomposition mapping file for NFLC and NFLD, where NFLC is defined to be NFC(NFLD), has some stability requirements and it must be warrantied that NFD(NFLD) = NFD: the "legacy mapping forms" must be a conforming process respecting the canonical equivalences: - Unlike in the main UCD file for canonical decompositions, the decompositions listed there are not limited to map one character to one or two characters. - The first column should be given in NFC form; the NFD form may also be used, this does not change the result. It is NOT required that the 1st column is in NFKC or NFKD forms (so the decompositions previously recommanded by a "compatibility mapping" in the main UCD can be ignored: it was just a suggestion and a requirement only for NFKC and NFKD). This allows NFLC and NFLD to correct past errors in the frozen permanently NFKC and NFKD decompositions. - the mapping done here is permanent but versioned (by the first version of Unicode deprecating a character or sequence). Being permanent means that the deprecation cannot be removed, but it can still be changed if the target string (preferably listed in NFC form) contains some newly deprecated characters (that will be added separately. - if the target of the mapping contains other deprecated characters or sequences (added to the same file), the decompositions listed there becomes recursive: a derived datafile can be produced listing only the new recommended mappings. - if a source string "SATB" is canonically equivalent to "SBTA", and "SA" is listed as a legacy sequence mapped to be replaced by "X" in this file, then the NFLD process will not just decompose "SATB" into NFD("XTB"), but will also decompose "SBTA" into NBT("XBT"). - if a source string "SATB" is NOT canonically equivalent to "SBTA", and "SA" is listed as a legacy sequence mapped to be replaced by "X" in this file, then the NFLD process will not decompose "SATB" into NFD("XTB"), but will not automatically decompose "SBTA" into NBT("XBT") Then the CLDR project can use NFL(C/D) as a better source for deriving collation elements (in the DUCET or root locale) instead of NFK(C/D) which will follow the new recommandations and will correctly adapt the collation orders for legacy encodings. Tailored collations (per-locale) are not required to use compatibility mappings in the main UCD file, or in this file, they'll use it only if they are based on the DUCET or the collation order of the "root" locale. For that purpose, tailored collations may specify an alternate set of "compatibility or legacy mappings" (to apply after NFC or NFD normalization which is still required). May be the CLDR projects would like to have these derived collation elements to be orderable (so that it can infer and order the new relative weights needed for ordering strings containing "legacy characters") but it may require another column in the legacy mappings datafile (in my opinion the "Unicode version" field already offers by default a suitable relative ordering) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 17:38:24 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 3 Nov 2018 23:38:24 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: Le sam. 3 nov. 2018 ? 23:36, Philippe Verdy a ?crit : > - this new decomposition mapping file for NFLC and NFLD, where NFLC is >> defined to be NFC(NFLD), has some stability requirements and it must be >> warrantied that NFD(NFLD) = NFD >> > Oops! fix my typo: it must be warrantied that NFD(NFLD) = NFLD -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 17:50:52 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 3 Nov 2018 22:50:52 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> When the topic being discussed no longer matches the thread title, somebody should start a new thread with an appropriate thread title. From unicode at unicode.org Sat Nov 3 18:05:39 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 00:05:39 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> <86lg6djlpz.fsf_-_@mimuw.edu.pl> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: It should be noted that the algorithmic complexity for this NFLD normalization ("legacy") is exactly the same as for NFKD ("compatibility"). However NFLD is versioned (like also NFLC), so NFLD can take a second parameter: the maximum Unicode version which can be used to filter which decomposition mappings are usable (they indicate the first minimal version where the mapping applies). It is even possible to allow a "legacy" normalization to be changed in a later version for the same source string: # deprecated codepoint(s) ; new preferred sequence ; Unicode version in which it was deprecated 101234 ; 101230 0300... ; 10.0 101234 ; 101240 0301... ; 11.0 It is also possible to add other filters to these recommanded new encodings, for example a language (or a BCP 47 locale identifier): 101234 ; 101230 0300 ; 10.0 ; fr 101234 ; 101240 0301... ; 10.0 (here starting in the same version 10.0, the new recommandation is to replace <101234> by <101240 0301> in all languages except French (BCP47 rules) where <101230 0300> should be used instead). In that case, the NFKD normalization can be viewed as if it was an historic version of NFLD, or a specialisation of NFLD for a "compatibility locale" (using "u-nfk" as a BCP 47 locale identifier???), independant of the unicode version (you can specify any version in the parameters of the NFLD or NFLC functions, and the locale identifier can be set to "u-nkf"). The complete parameters for NFLD (or NFLC) are : NFLD(text, version, locale) -> returns a text in NFD form NFLC(text, version, locale) -> returns a text in NFC form The default version is the latest supported version of Unicode, the default locale is "root" (in CLDR) or the same as the DUCET in Unicode, but should not be "u-nfk". And so: NFKD(text) = NFLD(text, 8.0, "u-nfk") = NFLD(text, 12.0, "u-nfk") = NFLD(text, "u-nfk") = NFD(NFLD(text, "u-nfk")) NFKC(text) = NFLC(text, 8.0, "u-nfk") = NFLC(text, 12.0, "u-nfk") = NFLC(text, "u-nfk") = NFC(NFLC(text, "u-nfk")) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 19:03:30 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 01:03:30 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> References: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> Message-ID: Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit : > > On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: > > I was replying not about the notational repreentation of the DUCET data > table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. > Which remains highly confusive, and contains completely unnecesary steps, > and just complicates things with absoiluytely no benefit at all by > introducing confusion about these "0000". > > Sorry, Philippe, but the confusion that I am seeing introduced is what you > are introducing to the unicode list in the course of this discussion. > > > UTR#10 still does not explicitly state that its use of "0000" does not > mean it is a valid "weight", it's a notation only > > No, it is explicitly a valid weight. And it is explicitly and normatively > referred to in the specification of the algorithm. See UTS10-D8 (and > subsequent definitions), which explicitly depend on a definition of "A > collation weight whose value is zero." The entire statement of what are > primary, secondary, tertiary, etc. collation elements depends on that > definition. And see the tables in Section 3.2, which also depend on those > definitions. > Ok is is a valid "weight" when taken *isolately*, but it is invalid as a weight at any level. This does not change the fact because weights are always relative to a specific level for which they are defined, and 0000 does not belong to any one. This weight is completely artificial and introduced completely needlessly: all levels are completely defined by a closed range of weights, all of them being non-0000, and all ranges being numerically separated (with the primary level using the largest range). I can reread again and again (even the sections you cite), but there's absolutely NO need of this articificial "0000" anywhere (any clause introducing it or using it to define something can be safely removed) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 3 19:46:37 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 4 Nov 2018 00:46:37 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: <14422e40-f80a-92ee-1ae8-441c98988393@gmail.com> Possible new thread titles include: Re: NFKD vs. NFLD (was Re: ...) Re: Man's inhumanity to humane scripts (was Re: ...) Re: Mayan and Egyptian hieroglyphs prove emoji pollute the character encoding model (was Re: ...) Re: Polynomials and the decline of western civilization (was Re: ...) From unicode at unicode.org Sat Nov 3 20:33:32 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 02:33:32 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> References: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> Message-ID: Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit : > > On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: > > I was replying not about the notational repreentation of the DUCET data > table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. > Which remains highly confusive, and contains completely unnecesary steps, > and just complicates things with absoiluytely no benefit at all by > introducing confusion about these "0000". > > Sorry, Philippe, but the confusion that I am seeing introduced is what you > are introducing to the unicode list in the course of this discussion. > > > UTR#10 still does not explicitly state that its use of "0000" does not > mean it is a valid "weight", it's a notation only > > No, it is explicitly a valid weight. And it is explicitly and normatively > referred to in the specification of the algorithm. See UTS10-D8 (and > subsequent definitions), which explicitly depend on a definition of "A > collation weight whose value is zero." The entire statement of what are > primary, secondary, tertiary, etc. collation elements depends on that > definition. And see the tables in Section 3.2, which also depend on those > definitions. > > (but the notation is used for TWO distinct purposes: one is for presenting > the notation format used in the DUCET > > It is *not* just a notation format used in the DUCET -- it is part of the > normative definitional structure of the algorithm, which then percolates > down into further definitions and rules and the steps of the algorithm. > I insist that this is NOT NEEDED at all for the definition, it is absolutely NOT structural. The algorithm still guarantees the SAME result. It is ONLY used to explain the format of the DUCET and the fact the this format does NOT use 0000 as a valid weight, ans os can use it as a notation (in fact only a presentational feature). > itself to present how collation elements are structured, the other one is > for marking the presence of a possible, but not always required, encoding > of an explicit level separator for encoding sort keys). > > That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It > is not part of the *notation* for collation elements, but instead is a > magic value chosen for the level separator precisely because zero values > from the collation elements are removed during sort key construction, so > that zero is then guaranteed to be a lower value than any remaining weight > added to the sort key under construction. This part of the algorithm is not > rocket science, by the way! > Here again you make a confusion: a sort key MAY use them as separators if it wants to compress keys by reencoding weights per level: that's the only case where you may want to introduce an encoding pattern starting with 0, while the rest of the encoding for weights in that level must using patterns not starting by this 0 (the number of bits to encode this 0 does not matter: it is only part of the encoding used on this level which does not necessarily have to use 16-bit code units per weight. > > Even the example tables can be made without using these "0000" (for > example in tables showing how to build sort keys, it can present the list > of weights splitted in separate columns, one column per level, without any > "0000". The implementation does not necessarily have to create a buffer > containing all weight values in a row, when separate buffers for each level > is far superior (and even more efficient as it can save space in memory). > > The UCA doesn't *require* you to do anything particular in your own > implementation, other than come up with the same results for string > comparisons. > Yes I know, but the algorithm also does not require me to use these invalid 0000 pseudo-weights, that the algorithm itself will always discard (in a completely needless step)! > That is clearly stated in the conformance clause of UTS #10. > > https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance > > The step "S3.2" in the UCA algorithm should not even be there (it is made > in favor an specific implementation which is not even efficient or optimal), > > That is a false statement. Step S3.2 is there to provide a clear statement > of the algorithm, to guarantee correct results for string comparison. > You're wrong, this statement is completely useless in all cases. There is still the correct results for string comparison without them: a string comparison can only compare valid weights for each level, it will not compare any weight past the end of the text in any one of the two compared strings, nowhere it will compare weights with one of them being 0, unless this 0 is used as a "guard value" for the end of text and your compare loop still continues scanning the longer string when the other string has already ended (this case should be detected much earlier before determineing the next collection boundary in the string and then computing its weights for each level. > Section 9 of UTS #10 provides a whole lunch buffet of techniques that > implementations can choose from to increase the efficiency of their > implementations, as they deem appropriate. You are free to implement as you > choose -- including techniques that do not require any level separators. > You are, however, duly warned in: > > > https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators > > that "While this technique is relatively easy to implement, it can > interfere with other compression methods." > > it complicates the algorithm with absoluytely no benefit at all); you can > ALWAYS remove it completely and this still generates equivalent results. > > No you cannot ALWAYS remove it completely. Whether or not your > implementation can do so, depends on what other techniques you may be using > to increase performance, store shorter keys, or whatever else may be at > stake in your optimization > I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However you MAY ADD them ONLY when generating and encoding the sort keys, if the encoding used really does compress the weights into smaller values: this is the only case where you want to ADD a separator, internally only in the binary key encoder, but but as part of the algorithm itself. If your key generation does not use any compression (in the simplest implementations), then it can simply an directly concatenate all weights with the same code units size (16-bit in the DUCET), without inserting any additional 0000 code unit to separate them: your resulting sort key will still not contain any 0000 code unit in any part for any level because the algorithm already has excluded them. Finally this means that sort keys can be stored in C-strings (terminated by null code units, instead of being delimited by a separately encoded length property, but for C-strings where code units are 8-bit, i.e. "char" in C, you still need an encoder to convert the 16-bit binary weights into sequences of bytes not containing any 00 byte: if this encoder is used, still you don't need any 00 separator between encoded levels!). As all these 0000 weigths are unnecessary, then the current UCA algorithm trying to introduce them needlessly is REALLY introducing unnecessary confusion: values of weights NEVER need to be restricted. The only conditions that matter is that: - all weights are *comparable* (sign does not even matter, they are not even restricted to be numbers or even just integers) and that - they are **fully ordered**, and that the fully ordered set of weights (not necessarily an enumerable set or a discrete set, as this can the continuous set of real numbers) - and that the full set of weights is **fully partitioned** into distinct, intervals (with no intersection between intervals, so intervals are also comparable) - that the highest interval will be used by weights in the primary level: each partition is numbered (by the level: a positive integer between 1 and L): you can compare the level numbers assigned to the partition in which the weight is a member: if level(weight1) > level(weight2) (this is an comparison of positive integers), then necessarily you may have weight1 < weight2 (this is only comparing weights encoded arbitrarily and which can still use a 0 value if you wish to use it to encode a valid weight for a valid collation element at any level 1 to N; this is also the only condition needed to respect rule WF2 in UCA). --- Notes about encodings for weights in sort keys: If weights are chosen to be rational numbers, e.g any rational numbers in the (0.0, 1.0) open interval, and because your collation algorithm will only recognize a finite set of distinct collation elements with necessarily a finite number N of distinct weights w(i), for i in 0..(N-1), allows the collation weights to be represented by choosing them **arbitrarily** within this open interval: - this can be done simply by partitionning the (0.0 1.0) into N half-open intervals [w(i), w(i+1)); - and then encoding a weight w(i) by any **arbitrarily chosen rational** inside one of these intervals (for example this can be done for using compression with arithmetic coding). A weight encoding using a finite discrete set (of binary integers between 0 and M-1) is what you need to use classic Huffman coding: this is equivalent to multiplying the previous rationals and truncating them to the nearest floor integer, but as this limits the choice of rational numbers above so that distinct weights remain distinct with the binary encoding, you need to keep more significant bits with Huffman coding than with Arithmetic coding (i.e. you need a higher value of M; where M is typically a power of 2 using 1-bit code units, or power of 256 for the simpler encodings using 8-bit code units, or a power of 65536 for an uncompressed encoding of 16-bit weight values). Arithmetic coding is in fact equivalent to Huffman coding, except that M is not necessarily a positive integer but can be any positive rational and can then represent each weigh value with a rational number of bits on average, instead of a static integer number of bits. You can say as well that Huffman coding is a restriction of Arithmetic coding where M must be an integer, or that Arithmetic coding is a generalization of Huffman coding. Both the Huffman and Arithmetic codings are wellknown examples of "prefix coding" (the latter offering a bit more compression, for the same statistical distribution of encoded values). The open interval (0.0, w(0)) is still not used at all to encode weights, but can still have a statistic distribution, usable with the prefix encoding to represent the end of string. But here again this does not represent the artificial 0000 weight which is NEVER encoded anywhere. --- Ask to a mathematician you trust, he will confirm that these rules speaking about the pseudo-weight 0000 in UCA are completely unnecessary (i.e. removing them from the algorithm does not change the result for comparing strings, or for generating sort keys) And as a conclusion, attempting to introduce them in the standard creates more confusion than it helps (in fact it is most probably a relict of a former bogous *implementation*, that still relied on them because other well-formness conditions were not satistified, or not well defined in the earlier attempts to define the UCA...). That this is not even needed for computing "composite weights" (which is not defining new weights, but an attempt to encode them in a larger space: this can be done completely outside the standard algorithm itself: just allow weights to be rational numbers, it is then easy to extend the number of encodable weights as a single number without increasing the numeric range in which they are defined; then leave the encoder of the sort key generator store them with a convenient "prefix coding", using one or more code units of arbitrary length). Philippe. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 02:24:57 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 4 Nov 2018 09:24:57 +0100 Subject: Encoding (was: Re: A sign/abbreviation for "magister") In-Reply-To: <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> Message-ID: On 03/11/2018 23:50, James Kass via Unicode wrote: > > When the topic being discussed no longer matches the thread title, > somebody should start a new thread with an appropriate thread title. > Yes, that is what also the OP called for, but my last reply though taking me some time to write was sent without checking the new mail, so unfortunately it didn?t acknowledge. So let?s start this new thread to account for Philippe Verdy?s proposal to encode a new format control. But all what I can add so far prior to probably stepping out of this discussion is that the industry does not seem to be interested in this initiative. Why do I think so? As already discussed on this List, even the long-existing FRACTION SLASH U+2044 has not been implemented by major vendors, except that HarfBuzz does implement it and makes its specified behavior available in environments using HarfBuzz, among which some major vendors? products are actually available with HarfBuzz support. As a result, the Polish abbreviation of Magister as found on the postcard, and all other abbreviations using superscript that have been put into parallel in the parent thread, cannot be reliably encoded without using preformatted superscript, so far as the goal is a plain text backbone being in the benefit of reliable rendering support, rather than a semantic-centered coding that may be easier to parse by special applications but lacks wider industrial support. If nevertheless, is encoded and will gain traction, or rather reversely: if it gains traction and will be encoded (I don?t know which way around to put it, given U+2044 has been encoded but one still cannot seem to be able to call it widely implemented), I would surely add it on keyboard layouts if I will still be maintaining any in that era. Best regards, Marcel From unicode at unicode.org Sun Nov 4 02:27:05 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 4 Nov 2018 09:27:05 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> Message-ID: Philippe, I agree that we could have structured the UCA differently. It does make sense, for example, to have the weights be simply decimal values instead of integers. But nobody is going to go through the substantial work of restructuring the UCA spec and data file unless there is a very strong reason to do so. It takes far more time and effort than people realize to change in the algorithm/data while making sure that everything lines up without inadvertent changes being introduced. It is just not worth the effort. There are so, so, many things we can do in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher benefit. You can continue flogging this horse all you want, but I'm muting this thread (and I suspect I'm not the only one). Mark On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode < unicode at unicode.org> wrote: > Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit : > >> >> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: >> >> I was replying not about the notational repreentation of the DUCET data >> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. >> Which remains highly confusive, and contains completely unnecesary steps, >> and just complicates things with absoiluytely no benefit at all by >> introducing confusion about these "0000". >> >> Sorry, Philippe, but the confusion that I am seeing introduced is what >> you are introducing to the unicode list in the course of this discussion. >> >> >> UTR#10 still does not explicitly state that its use of "0000" does not >> mean it is a valid "weight", it's a notation only >> >> No, it is explicitly a valid weight. And it is explicitly and normatively >> referred to in the specification of the algorithm. See UTS10-D8 (and >> subsequent definitions), which explicitly depend on a definition of "A >> collation weight whose value is zero." The entire statement of what are >> primary, secondary, tertiary, etc. collation elements depends on that >> definition. And see the tables in Section 3.2, which also depend on those >> definitions. >> >> (but the notation is used for TWO distinct purposes: one is for >> presenting the notation format used in the DUCET >> >> It is *not* just a notation format used in the DUCET -- it is part of the >> normative definitional structure of the algorithm, which then percolates >> down into further definitions and rules and the steps of the algorithm. >> > > I insist that this is NOT NEEDED at all for the definition, it is > absolutely NOT structural. The algorithm still guarantees the SAME result. > > It is ONLY used to explain the format of the DUCET and the fact the this > format does NOT use 0000 as a valid weight, ans os can use it as a notation > (in fact only a presentational feature). > > >> itself to present how collation elements are structured, the other one is >> for marking the presence of a possible, but not always required, encoding >> of an explicit level separator for encoding sort keys). >> >> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It >> is not part of the *notation* for collation elements, but instead is a >> magic value chosen for the level separator precisely because zero values >> from the collation elements are removed during sort key construction, so >> that zero is then guaranteed to be a lower value than any remaining weight >> added to the sort key under construction. This part of the algorithm is not >> rocket science, by the way! >> > > Here again you make a confusion: a sort key MAY use them as separators if > it wants to compress keys by reencoding weights per level: that's the only > case where you may want to introduce an encoding pattern starting with 0, > while the rest of the encoding for weights in that level must using > patterns not starting by this 0 (the number of bits to encode this 0 does > not matter: it is only part of the encoding used on this level which does > not necessarily have to use 16-bit code units per weight. > >> >> Even the example tables can be made without using these "0000" (for >> example in tables showing how to build sort keys, it can present the list >> of weights splitted in separate columns, one column per level, without any >> "0000". The implementation does not necessarily have to create a buffer >> containing all weight values in a row, when separate buffers for each level >> is far superior (and even more efficient as it can save space in memory). >> >> The UCA doesn't *require* you to do anything particular in your own >> implementation, other than come up with the same results for string >> comparisons. >> > Yes I know, but the algorithm also does not require me to use these > invalid 0000 pseudo-weights, that the algorithm itself will always discard > (in a completely needless step)! > > >> That is clearly stated in the conformance clause of UTS #10. >> >> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance >> >> The step "S3.2" in the UCA algorithm should not even be there (it is made >> in favor an specific implementation which is not even efficient or optimal), >> >> That is a false statement. Step S3.2 is there to provide a clear >> statement of the algorithm, to guarantee correct results for string >> comparison. >> > > You're wrong, this statement is completely useless in all cases. There is > still the correct results for string comparison without them: a string > comparison can only compare valid weights for each level, it will not > compare any weight past the end of the text in any one of the two compared > strings, nowhere it will compare weights with one of them being 0, unless > this 0 is used as a "guard value" for the end of text and your compare loop > still continues scanning the longer string when the other string has > already ended (this case should be detected much earlier before > determineing the next collection boundary in the string and then computing > its weights for each level. > >> Section 9 of UTS #10 provides a whole lunch buffet of techniques that >> implementations can choose from to increase the efficiency of their >> implementations, as they deem appropriate. You are free to implement as you >> choose -- including techniques that do not require any level separators. >> You are, however, duly warned in: >> >> >> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators >> >> that "While this technique is relatively easy to implement, it can >> interfere with other compression methods." >> >> it complicates the algorithm with absoluytely no benefit at all); you can >> ALWAYS remove it completely and this still generates equivalent results. >> >> No you cannot ALWAYS remove it completely. Whether or not your >> implementation can do so, depends on what other techniques you may be using >> to increase performance, store shorter keys, or whatever else may be at >> stake in your optimization >> > I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However > you MAY ADD them ONLY when generating and encoding the sort keys, if the > encoding used really does compress the weights into smaller values: this is > the only case where you want to ADD a separator, internally only in the > binary key encoder, but but as part of the algorithm itself. > > If your key generation does not use any compression (in the simplest > implementations), then it can simply an directly concatenate all weights > with the same code units size (16-bit in the DUCET), without inserting any > additional 0000 code unit to separate them: your resulting sort key will > still not contain any 0000 code unit in any part for any level because the > algorithm already has excluded them. Finally this means that sort keys can > be stored in C-strings (terminated by null code units, instead of being > delimited by a separately encoded length property, but for C-strings where > code units are 8-bit, i.e. "char" in C, you still need an encoder to > convert the 16-bit binary weights into sequences of bytes not containing > any 00 byte: if this encoder is used, still you don't need any 00 separator > between encoded levels!). > > As all these 0000 weigths are unnecessary, then the current UCA algorithm > trying to introduce them needlessly is REALLY introducing unnecessary > confusion: values of weights NEVER need to be restricted. > > The only conditions that matter is that: > - all weights are *comparable* (sign does not even matter, they are not > even restricted to be numbers or even just integers) and that > - they are **fully ordered**, and that the fully ordered set of weights > (not necessarily an enumerable set or a discrete set, as this can the > continuous set of real numbers) > - and that the full set of weights is **fully partitioned** into distinct, > intervals (with no intersection between intervals, so intervals are also > comparable) > - that the highest interval will be used by weights in the primary level: > each partition is numbered (by the level: a positive integer between 1 and > L): you can compare the level numbers assigned to the partition in which > the weight is a member: if level(weight1) > level(weight2) (this is an > comparison of positive integers), then necessarily you may have weight1 < > weight2 (this is only comparing weights encoded arbitrarily and which can > still use a 0 value if you wish to use it to encode a valid weight for a > valid collation element at any level 1 to N; this is also the only > condition needed to respect rule WF2 in UCA). > > --- > Notes about encodings for weights in sort keys: > > If weights are chosen to be rational numbers, e.g any rational numbers in > the (0.0, 1.0) open interval, and because your collation algorithm will > only recognize a finite set of distinct collation elements with necessarily > a finite number N of distinct weights w(i), for i in 0..(N-1), allows the > collation weights to be represented by choosing them **arbitrarily** within > this open interval: > - this can be done simply by partitionning the (0.0 1.0) into N half-open > intervals [w(i), w(i+1)); > - and then encoding a weight w(i) by any **arbitrarily chosen rational** > inside one of these intervals (for example this can be done for using > compression with arithmetic coding). > > A weight encoding using a finite discrete set (of binary integers between > 0 and M-1) is what you need to use classic Huffman coding: this is > equivalent to multiplying the previous rationals and truncating them to the > nearest floor integer, but as this limits the choice of rational numbers > above so that distinct weights remain distinct with the binary encoding, > you need to keep more significant bits with Huffman coding than with > Arithmetic coding (i.e. you need a higher value of M; where M is typically > a power of 2 using 1-bit code units, or power of 256 for the simpler > encodings using 8-bit code units, or a power of 65536 for an uncompressed > encoding of 16-bit weight values). > > Arithmetic coding is in fact equivalent to Huffman coding, except that M > is not necessarily a positive integer but can be any positive rational and > can then represent each weigh value with a rational number of bits on > average, instead of a static integer number of bits. You can say as well > that Huffman coding is a restriction of Arithmetic coding where M must be > an integer, or that Arithmetic coding is a generalization of Huffman coding. > > Both the Huffman and Arithmetic codings are wellknown examples of "prefix > coding" (the latter offering a bit more compression, for the same > statistical distribution of encoded values). The open interval (0.0, w(0)) > is still not used at all to encode weights, but can still have a statistic > distribution, usable with the prefix encoding to represent the end of > string. But here again this does not represent the artificial 0000 weight > which is NEVER encoded anywhere. > > --- > > Ask to a mathematician you trust, he will confirm that these rules > speaking about the pseudo-weight 0000 in UCA are completely unnecessary > (i.e. removing them from the algorithm does not change the result for > comparing strings, or for generating sort keys) > And as a conclusion, attempting to introduce them in the standard creates > more confusion than it helps (in fact it is most probably a relict of a > former bogous *implementation*, that still relied on them because other > well-formness conditions were not satistified, or not well defined in the > earlier attempts to define the UCA...). That this is not even needed for > computing "composite weights" (which is not defining new weights, but an > attempt to encode them in a larger space: this can be done completely > outside the standard algorithm itself: just allow weights to be rational > numbers, it is then easy to extend the number of encodable weights as a > single number without increasing the numeric range in which they are > defined; then leave the encoder of the sort key generator store them with a > convenient "prefix coding", using one or more code units of arbitrary > length). > > Philippe. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 10:45:08 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 17:45:08 +0100 Subject: Encoding (was: Re: A sign/abbreviation for "magister") In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> Message-ID: Note that I actually propose not just one rendering for the but two possible variants (that would be equally valid withou preference). Use it after any base cluster (including with diacritics if needed, like combining underlines). - the first one can be to render the previous cluster as superscript (very easy to do implement synthetically by any text renderer) - the second one can be to render it as an abbreviation dot (also very easy to) Fonts can provide their own mapping (e.g. to offer alternate glyph forms or kerning for the superscript, they can also reuse the leter forms used for other existing and encoded superscript letters, or to position the abbreviation dot with negative kerning, for example after a T), in which case the renderer does not have to synthetize the rendering for the sequence combining sequence not mapped in the font. Allowing this variation from the start will: - allow renderers to support it fast (so a rapid adoption for encoding texts in humane languages, instead of the few legacy superscript letters). - allow font designers to develop and provide reasonnable mappings if needed (to adjust the position or size of the superscript) in updated fonts (no requirement for them to add new glyphs if it's just to map the same glyphs used by existing superscript letters) - also prohibit the abuse of this mark for every text that one would would to write in superscript (these cases can still uses the few existing superscript letters/digits/signs that are already encoded), so this is not suitable for example for marking mathematical exponents (e.g. "x?", if it's encoded as could validly be rendered as "x2."): exponents must use the superscript (either the already encoded ones, or using external styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", both as a style, but also some intended semantic of an exponent and certainly not the intended semantic of an abbreviation) Le dim. 4 nov. 2018 ? 09:34, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 03/11/2018 23:50, James Kass via Unicode wrote: > > > > When the topic being discussed no longer matches the thread title, > > somebody should start a new thread with an appropriate thread title. > > > > Yes, that is what also the OP called for, but my last reply though > taking me some time to write was sent without checking the new mail, > so unfortunately it didn?t acknowledge. So let?s start this new thread > to account for Philippe Verdy?s proposal to encode a new format control. > > But all what I can add so far prior to probably stepping out of this > discussion is that the industry does not seem to be interested in this > initiative. Why do I think so? As already discussed on this List, even > the long-existing FRACTION SLASH U+2044 has not been implemented by > major vendors, except that HarfBuzz does implement it and makes its > specified behavior available in environments using HarfBuzz, among > which some major vendors? products are actually available with > HarfBuzz support. > > As a result, the Polish abbreviation of Magister as found on the > postcard, and all other abbreviations using superscript that have > been put into parallel in the parent thread, cannot be reliably > encoded without using preformatted superscript, so far as the goal > is a plain text backbone being in the benefit of reliable rendering > support, rather than a semantic-centered coding that may be easier > to parse by special applications but lacks wider industrial support. > > If nevertheless, is encoded and will > gain traction, or rather reversely: if it gains traction and will be > encoded (I don?t know which way around to put it, given U+2044 has > been encoded but one still cannot seem to be able to call it widely > implemented), I would surely add it on keyboard layouts if I will > still be maintaining any in that era. > > Best regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 11:34:29 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 4 Nov 2018 18:34:29 +0100 Subject: Encoding In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> Message-ID: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> On 04/11/2018 17:45, Philippe Verdy wrote: > > Note that I actually propose not just one rendering for the > but two possible variants (that would > be equally valid withou preference). Use it after any base cluster > (including with diacritics if needed, like combining underlines). > > - the first one can be to render the previous cluster as superscript > (very easy to do implement synthetically by any text renderer) > > - the second one can be to render it as an abbreviation dot (also > very easy to) > > Fonts can provide their own mapping (e.g. to offer alternate glyph > forms or kerning for the superscript, they can also reuse the leter > forms used for other existing and encoded superscript letters, or to > position the abbreviation dot with negative kerning, for example > after a T), in which case the renderer does not have to synthetize > the rendering for the sequence combining sequence not mapped in the > font. > > Allowing this variation from the start will: > > - allow renderers to support it fast (so a rapid adoption for > encoding texts in humane languages, instead of the few legacy > superscript letters). > > - allow font designers to develop and provide reasonnable mappings if > needed (to adjust the position or size of the superscript) in updated > fonts (no requirement for them to add new glyphs if it's just to map > the same glyphs used by existing superscript letters) > > - also prohibit the abuse of this mark for every text that one would > would to write in superscript (these cases can still uses the few > existing superscript letters/digits/signs that are already encoded), > so this is not suitable for example for marking mathematical > exponents (e.g. "x?", if it's encoded as mark> could validly be rendered as "x2."): exponents must use the > superscript (either the already encoded ones, or using external > styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", > both as a style, but also some intended semantic of an exponent and > certainly not the intended semantic of an abbreviation) Unicode always (or in principle) aims at polyvalence, making characters reusable and repurposable, while the combining abbreviation mark does not solve the problems around making chemicals better represented in plain text as seen in the parent thread, for example. I don?t advocate this use case, as I?m only lobbying for natural languages? support as specified in the Standard,* but it shouldn?t be forgotten given there is some point in not disfavoring chemistry compared to mathematics, that is already widely favored over chemistry when looking at the symbol blocks, while chemistry is denied three characters because they are subscript forms of already encoded letters. Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it needs OpenType support to work, while direct encoding of preformatted superscripts and use as abbreviation indicators for an interoperable digital representation of natural languages does not. Best regards, Marcel * As already repeatedly stated, I?m taking the one bit where TUS states that all natural languages shall be given a semantically unambiguous (ie not introducing new ambiguity) and interoperable digital representation. From unicode at unicode.org Sun Nov 4 11:42:22 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 4 Nov 2018 18:42:22 +0100 Subject: Encoding (was: Re: A sign/abbreviation for "magister") In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> Message-ID: Sorry, I didn?t truncate the subject line, it was my mail client. On 04/11/2018 17:45, Philippe Verdy wrote: > > Note that I actually propose not just one rendering for the > but two possible variants (that would > be equally valid withou preference). Use it after any base cluster > (including with diacritics if needed, like combining underlines). > > - the first one can be to render the previous cluster as superscript > (very easy to do implement synthetically by any text renderer) > > - the second one can be to render it as an abbreviation dot (also > very easy to) > > Fonts can provide their own mapping (e.g. to offer alternate glyph > forms or kerning for the superscript, they can also reuse the leter > forms used for other existing and encoded superscript letters, or to > position the abbreviation dot with negative kerning, for example > after a T), in which case the renderer does not have to synthetize > the rendering for the sequence combining sequence not mapped in the > font. > > Allowing this variation from the start will: > > - allow renderers to support it fast (so a rapid adoption for > encoding texts in humane languages, instead of the few legacy > superscript letters). > > - allow font designers to develop and provide reasonnable mappings if > needed (to adjust the position or size of the superscript) in updated > fonts (no requirement for them to add new glyphs if it's just to map > the same glyphs used by existing superscript letters) > > - also prohibit the abuse of this mark for every text that one would > would to write in superscript (these cases can still uses the few > existing superscript letters/digits/signs that are already encoded), > so this is not suitable for example for marking mathematical > exponents (e.g. "x?", if it's encoded as mark> could validly be rendered as "x2."): exponents must use the > superscript (either the already encoded ones, or using external > styles like in HTML/CSS, or in LaTeX which uses the notation "x^2", > both as a style, but also some intended semantic of an exponent and > certainly not the intended semantic of an abbreviation) Unicode always (or in principle) aims at polyvalence, making characters reusable and repurposable, while the combining abbreviation mark does not solve the problems around making chemicals better represented in plain text as seen in the parent thread, for example. I don?t advocate this use case, as I?m only lobbying for natural languages? support as specified in the Standard,* but it shouldn?t be forgotten given there is some point in not disfavoring chemistry compared to mathematics, that is already widely favored over chemistry when looking at the symbol blocks, while chemistry is denied three characters because they are subscript forms of already encoded letters. Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it needs OpenType support to work, while direct encoding of preformatted superscripts and use as abbreviation indicators for an interoperable digital representation of natural languages does not. Best regards, Marcel * As already repeatedly stated, I?m taking the one bit where TUS states that all natural languages shall be given a semantically unambiguous (ie not introducing new ambiguity) and interoperable digital representation. From unicode at unicode.org Sun Nov 4 12:54:37 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 19:54:37 +0100 Subject: Encoding In-Reply-To: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> Message-ID: Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a ?crit : > On 04/11/2018 17:45, Philippe Verdy wrote: > Beyond that, the problem with *COMBINING ABBREVIATION MARK is that it > needs OpenType support to work, while direct encoding of preformatted > superscripts and use as abbreviation indicators for an interoperable > digital representation of natural languages does not. > No OpenScript is required. I already propose that a correct rendering of this mark is a simple dot added to the right of the cluster (if this cluster is LTR) or to the left (if the cluster is RTL). It just has to convey the fact that it occurs to mean an abbreviation. The mark to render (when not rendering the superscript) is left to each font design (a font made for another script than Latin, Greek, Cyrillic can use another convenient abbreviation mark suitable for that script and that avoids the confusion with other dot-like combining marks used in that script, and it may be placed elsewhere than to the right or left of the cluster that it modifies) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 13:19:55 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 20:19:55 +0100 Subject: Encoding In-Reply-To: <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> Message-ID: Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a ?crit : > On 04/11/2018 17:45, Philippe Verdy wrote: > Marcel > * As already repeatedly stated, I?m taking the one bit where TUS states > that all natural languages shall be given a semantically unambiguous (ie > not introducing new ambiguity) and interoperable digital representation. > I also support the sermantically unambiguous digital representation of all natural languages. Interoperability is always limited, even for existing script (including Latin), that's why text renderers (and fonts) constantly need new developments (but that does not need that these developments will be deployed). That's why we have to document reasonnable fallbacks for rendering on limited platforms, each time this is possible (and in this case this is clearly possible with extremely low efforts). Even the mere fallback to render the as a dotted circle (total absence of support) will not block completely reading the abbreviation: * you'll see "2e?" (which is still better than only "2e", with minimal impact) instead of * "2?" (which is worse ! this is still what already happens when you use the legacy encoded which is also semantically ambiguous for text processing), or * "2e." (which is acceptable for rendering but ambiguous semantically for text processing) So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE than using (which is also impossible for noting all abbrevations as it is limited to just a few letters, and most of the time limited to only the few lowercase IPA symbols). It puts an end to the pressure to encode superscript letters. If you want to support other notations (e.g. in chemical or mathematics notations, where both superscript and subscript must be present and stack together, and where the allowed varaition using a dot or similar) you need another encoding and the existing legacy are not suitable as well. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 13:51:33 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 20:51:33 +0100 Subject: Encoding In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> Message-ID: Note also that some other scripts have their own dedicated "abbreviation mark" encoded, but as distinctive punctuations or modifier letters: they are NOT combining. I do not advocate changing these scripts at all. As well I don't propose to instruct authors to use an after Latin/Greek/Letters/Arabic/Hebrew letters used in abbreviations. This would be non-sense, including visually, even if you can infer some semantics, as meaning this is effectively an abbreviation for text processing (this is still non-senses because this breaks existing segregations of scripts, delimitation of clusters, line breaking opportunities, and so on; and this approach would break because these can legally occur in isolation, without being necessarily attached to the previous cluster to modify it: the previous cluster, before the could be for example a whitespace, or a quotation mark) I don't propose the as being suitable for mathematics exponents and Chemical notations (they still need something else to allow their superscript and subscripts to stack below each other, and the variation of explicitly permitting it to be rendered as a dot or another suitable mark, depending on the base character of the combining sequence, is NOT suitable for these mathematics or chemical notations). Once again you need something else for these technical notations, but NOT the proposed , and NOT EVEN the existing "modifier letters" , which were in fact first introduced only for IPA lowercase symbols, with some of them being then turned as "plain lowercase letters" in alphabets of some natural languages that have been recently romanized by borrowing IPA symbols (notably in Africa, where the initial letters borrowed from IPA, or some new specific letter variants with additional hooks, opening or strokes, were then followed by the addition of separate capital letters: these letters are NOT conveying any semantic of an abbreviation, and this is also NOT the case for their usage as IPA symbols). There's NO interoperability at all when taking **abusively** the existing "modifier letters" or for use in abbreviations (or even in technical notations in maths or chemical formulas, where they DON'T work the way they should when used with subscripts, and cannot represent multiple layers of subscripts, e.g. for expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters" or or for use as plain letters or plain digits or plain punctuation or plain symbols (including IPA) in natural languages. Anything else is abusive ans hould be considered only as "legacy" encoding, not recommended at all in natural languages. Le dim. 4 nov. 2018 ? 20:19, Philippe Verdy a ?crit : > > > Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a > ?crit : > >> On 04/11/2018 17:45, Philippe Verdy wrote: >> Marcel >> * As already repeatedly stated, I?m taking the one bit where TUS states >> that all natural languages shall be given a semantically unambiguous (ie >> not introducing new ambiguity) and interoperable digital representation. >> > > I also support the sermantically unambiguous digital representation of all > natural languages. > Interoperability is always limited, even for existing script (including > Latin), that's why text renderers (and fonts) constantly need new > developments (but that does not need that these developments will be > deployed). > That's why we have to document reasonnable fallbacks for rendering on > limited platforms, each time this is possible (and in this case this is > clearly possible with extremely low efforts). > > Even the mere fallback to render the as a > dotted circle (total absence of support) will not block completely reading > the abbreviation: > * you'll see "2e?" (which is still better than only "2e", with minimal > impact) instead of > * "2?" (which is worse ! this is still what already happens when you use > the legacy encoded which is also semantically ambiguous for > text processing), or > * "2e." (which is acceptable for rendering but ambiguous semantically for > text processing) > > So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE > than using (which is also impossible for > noting all abbrevations as it is limited to just a few letters, and most of > the time limited to only the few lowercase IPA symbols). It puts an end to > the pressure to encode superscript letters. > > If you want to support other notations (e.g. in chemical or > mathematics notations, where both superscript and subscript must be present > and stack together, and where the allowed varaition using a dot or similar) > you need another encoding and the existing legacy letters> are not suitable as well. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 14:59:08 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 21:59:08 +0100 Subject: Encoding In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> Message-ID: I can take another example about what I call "legacy encoding" (which really means that such encoding is just an "approximation" from which no semantic can be clearly infered, except by using a non-determinist heuristic, which can frequently make "false guesses"). Consider the case of the legacy Hangul "half-width" jamos: they were kept in Unicode (as compatibility characters) not recommended for encoding natural Korean text, because their semantic is not clear when they are used in sequences: it's impossible to know clearly where semantically significant syllable breaks occur, because they don't distinguish the "leading" and "trailing consonants", and so it is not even possible to clearly infer that any Hangul "half-width" vowel jamos is logically attached to the same syllable as the "half-width" consonnant (or consonnant+vowel) jamo that is encoded just before it. As a consequence, you cannot safely convert Korean texts using these "half-width" jamos into normal jamos: only an heuristic attempts to detertemine the syllable breaks and then infer the "leading" or "trailing" semantic of consonnants. This last semantic ("leading" or "trailing" is exactly like a letter case distinction in Latin, so it can be said that the Korean alphabet is bicameral for consonnants, but only monocameral for vowels, where each Hangul syllable normally starts by an "uppercase-like" consonnant, or by a consonnant filler which is also "uppercase-like", and that all other consonnants and all vowels are "lowercase-like": the heuristic that transforms the legacy "half-width" jamos into normal jamos just does the same thing as the heuristic used in Latin that attempts to capitalize some leading letters in words: it works frequently, but this also fails and that heuristic is also lossy in Latin, just like it is lossy in Korean!). The same can be said about the heuristics that attempt to infer an abbreviation semantic from existing superscript letters (either encoded in Unicode, or encoded as plain letters modified by superscripting style in CSS or HTML, or in word processors for example): it fails to give the correct guess most of the time if there's no user to confirm the actual intended meaning Such confirmation is the job of spell correctors in word processors: they must clearly inform the user and let them decide, all what spell checkers can do is to provide visual hints to the user editing the document, such as the common wavy underline in red, that several interpretations are possible, or this is not the preferrred encoding to use to convey the correct semantic. A spell checker may be instructed to do the conversion automatically, while typing text, but there must be a way for the user to cancel this transform and make his own decision about the real meaning if canceling the automatic transform causes the "wavy red underline" to appear; the user may type "Mr." then the wavy line will appear under these 3 characters, the spell checker will propose to encode it as an abbreviation "Mr" or leave "Mr." unchanged (and no longer signaled) in which case the dot remains a regular punctuation, and the "r" is not modified. Then the user may choose to style the "r" with superscripting or underlining, and a new wavy red underline will appear below the three characters "M.", proposing to only transform the as or and even when the user accepts one of these suggestions it will remain "M." or "M." where it is still possible to infer the semantics of an abbreviation (propose to replace or keep the dot after it), or doing nothing else and cancel these suggestions (to hide the wavy red underline hint, added by the spell checker), or instruct the spell checker that the meaning of the superscript r is that of a mathematical exponent, or a chemical a notation. In all cases, the user/author has full control of the intended meaning of his text and an informed decision is made where all cases are now distinguished. "Legacy" encoding can be kept as is (in Unicode), even if it's no longer recommended, just like Unicode has documented that half-width Hangul is deprecated (it just offers a "compatibility decomposition" for NFKD or NFKC, but this is lossy and cannot be done automatically without a human decision). And the user/author can now freely and easily compose any abbreviation he wishes in natural languages, without being limited by the reduced "legacy" set of encoded in Unicode (which should no longer be extended, except for use as distinct plain letters needed in alphabets of actual natural languages, or as possibly new IPA symbols), and without using the styling tricks (of HTML/CSS, or of word processor documents, spreadsheets, presentation documents allowing "'rich text" formats on top of "plain text") which are best suitable for "free styling" of any human text, without any additional semantics, (or as a legacy but insufficient trick for maths and chemical notations). Le dim. 4 nov. 2018 ? 20:51, Philippe Verdy a ?crit : > Note also that some other scripts have their own dedicated "abbreviation > mark" encoded, but as distinctive punctuations or modifier letters: they > are NOT combining. I do not advocate changing these scripts at all. > > As well I don't propose to instruct authors to use an mark> after Latin/Greek/Letters/Arabic/Hebrew letters used in > abbreviations. This would be non-sense, including visually, even if you can > infer some semantics, as meaning this is effectively an abbreviation for > text processing (this is still non-senses because this breaks existing > segregations of scripts, delimitation of clusters, line breaking > opportunities, and so on; and this approach would break because these > can legally occur in isolation, without being > necessarily attached to the previous cluster to modify it: the previous > cluster, before the could be for example a > whitespace, or a quotation mark) > > I don't propose the as being suitable for > mathematics exponents and Chemical notations (they still need something > else to allow their superscript and subscripts to stack below each other, > and the variation of explicitly permitting it > to be rendered as a dot or another suitable mark, depending on the base > character of the combining sequence, is NOT suitable for these mathematics > or chemical notations). > > Once again you need something else for these technical notations, but NOT > the proposed , and NOT EVEN the existing > "modifier letters" , which were in fact first > introduced only for IPA lowercase symbols, with some of them being then > turned as "plain lowercase letters" in alphabets of some natural languages > that have been recently romanized by borrowing IPA symbols (notably in > Africa, where the initial letters borrowed from IPA, or some new specific > letter variants with additional hooks, opening or strokes, were then > followed by the addition of separate capital letters: these letters are NOT > conveying any semantic of an abbreviation, and this is also NOT the case > for their usage as IPA symbols). > > There's NO interoperability at all when taking **abusively** the existing > "modifier letters" or for use in > abbreviations (or even in technical notations in maths or chemical > formulas, where they DON'T work the way they should when used with > subscripts, and cannot represent multiple layers of subscripts, e.g. for > expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters" > or or for use as plain > letters or plain digits or plain punctuation or plain symbols (including > IPA) in natural languages. Anything else is abusive ans hould be considered > only as "legacy" encoding, not recommended at all in natural languages. > > > > Le dim. 4 nov. 2018 ? 20:19, Philippe Verdy a ?crit : > >> >> >> Le dim. 4 nov. 2018 ? 18:34, Marcel Schneider a >> ?crit : >> >>> On 04/11/2018 17:45, Philippe Verdy wrote: >>> Marcel >>> * As already repeatedly stated, I?m taking the one bit where TUS states >>> that all natural languages shall be given a semantically unambiguous (ie >>> not introducing new ambiguity) and interoperable digital representation. >>> >> >> I also support the sermantically unambiguous digital representation of >> all natural languages. >> Interoperability is always limited, even for existing script (including >> Latin), that's why text renderers (and fonts) constantly need new >> developments (but that does not need that these developments will be >> deployed). >> That's why we have to document reasonnable fallbacks for rendering on >> limited platforms, each time this is possible (and in this case this is >> clearly possible with extremely low efforts). >> >> Even the mere fallback to render the as a >> dotted circle (total absence of support) will not block completely reading >> the abbreviation: >> * you'll see "2e?" (which is still better than only "2e", with minimal >> impact) instead of >> * "2?" (which is worse ! this is still what already happens when you use >> the legacy encoded which is also semantically ambiguous for >> text processing), or >> * "2e." (which is acceptable for rendering but ambiguous semantically for >> text processing) >> >> So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE >> than using (which is also impossible for >> noting all abbrevations as it is limited to just a few letters, and most of >> the time limited to only the few lowercase IPA symbols). It puts an end to >> the pressure to encode superscript letters. >> >> If you want to support other notations (e.g. in chemical or >> mathematics notations, where both superscript and subscript must be present >> and stack together, and where the allowed varaition using a dot or similar) >> you need another encoding and the existing legacy > letters> are not suitable as well. >> >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 15:51:24 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 4 Nov 2018 22:51:24 +0100 Subject: UCA unnecessary collation weight 0000 In-Reply-To: References: <268fedd9-25b6-2957-ffa8-ede11495451c@att.net> Message-ID: So you finally admit that I was right... And that the specs include requirements that are not even needed to make UCA work, and that not even used by wellknown implementations. These are old artefacts which are now really confusive (instructing programmers to adopt the old deprecated behavior, before realizing that this was a bad advice which jut complicated their task). UCA can be implemented **conformingly** without these, even for the simplest implementations (where using complex packages like ICU is not an option and rewriting it is not one as well for much simpler goals) where these incorrect requirements are in fact suggesting to be more inefficient than really needed. There's not a lot of work to edit and to fix the specs without these polluting 0000 "pseudo-weights". Le dim. 4 nov. 2018 ? 09:27, Mark Davis ?? a ?crit : > Philippe, I agree that we could have structured the UCA differently. It > does make sense, for example, to have the weights be simply decimal values > instead of integers. But nobody is going to go through the substantial > work of restructuring the UCA spec and data file unless there is a very > strong reason to do so. It takes far more time and effort than people > realize to change in the algorithm/data while making sure that everything > lines up without inadvertent changes being introduced. > > It is just not worth the effort. There are so, so, many things we can do > in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher > benefit. > > You can continue flogging this horse all you want, but I'm muting this > thread (and I suspect I'm not the only one). > > Mark > > > On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> Le ven. 2 nov. 2018 ? 22:27, Ken Whistler a ?crit : >> >>> >>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote: >>> >>> I was replying not about the notational repreentation of the DUCET data >>> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself. >>> Which remains highly confusive, and contains completely unnecesary steps, >>> and just complicates things with absoiluytely no benefit at all by >>> introducing confusion about these "0000". >>> >>> Sorry, Philippe, but the confusion that I am seeing introduced is what >>> you are introducing to the unicode list in the course of this discussion. >>> >>> >>> UTR#10 still does not explicitly state that its use of "0000" does not >>> mean it is a valid "weight", it's a notation only >>> >>> No, it is explicitly a valid weight. And it is explicitly and >>> normatively referred to in the specification of the algorithm. See UTS10-D8 >>> (and subsequent definitions), which explicitly depend on a definition of "A >>> collation weight whose value is zero." The entire statement of what are >>> primary, secondary, tertiary, etc. collation elements depends on that >>> definition. And see the tables in Section 3.2, which also depend on those >>> definitions. >>> >>> (but the notation is used for TWO distinct purposes: one is for >>> presenting the notation format used in the DUCET >>> >>> It is *not* just a notation format used in the DUCET -- it is part of >>> the normative definitional structure of the algorithm, which then >>> percolates down into further definitions and rules and the steps of the >>> algorithm. >>> >> >> I insist that this is NOT NEEDED at all for the definition, it is >> absolutely NOT structural. The algorithm still guarantees the SAME result. >> >> It is ONLY used to explain the format of the DUCET and the fact the this >> format does NOT use 0000 as a valid weight, ans os can use it as a notation >> (in fact only a presentational feature). >> >> >>> itself to present how collation elements are structured, the other one >>> is for marking the presence of a possible, but not always required, >>> encoding of an explicit level separator for encoding sort keys). >>> >>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It >>> is not part of the *notation* for collation elements, but instead is a >>> magic value chosen for the level separator precisely because zero values >>> from the collation elements are removed during sort key construction, so >>> that zero is then guaranteed to be a lower value than any remaining weight >>> added to the sort key under construction. This part of the algorithm is not >>> rocket science, by the way! >>> >> >> Here again you make a confusion: a sort key MAY use them as separators if >> it wants to compress keys by reencoding weights per level: that's the only >> case where you may want to introduce an encoding pattern starting with 0, >> while the rest of the encoding for weights in that level must using >> patterns not starting by this 0 (the number of bits to encode this 0 does >> not matter: it is only part of the encoding used on this level which does >> not necessarily have to use 16-bit code units per weight. >> >>> >>> Even the example tables can be made without using these "0000" (for >>> example in tables showing how to build sort keys, it can present the list >>> of weights splitted in separate columns, one column per level, without any >>> "0000". The implementation does not necessarily have to create a buffer >>> containing all weight values in a row, when separate buffers for each level >>> is far superior (and even more efficient as it can save space in memory). >>> >>> The UCA doesn't *require* you to do anything particular in your own >>> implementation, other than come up with the same results for string >>> comparisons. >>> >> Yes I know, but the algorithm also does not require me to use these >> invalid 0000 pseudo-weights, that the algorithm itself will always discard >> (in a completely needless step)! >> >> >>> That is clearly stated in the conformance clause of UTS #10. >>> >>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance >>> >>> The step "S3.2" in the UCA algorithm should not even be there (it is >>> made in favor an specific implementation which is not even efficient or >>> optimal), >>> >>> That is a false statement. Step S3.2 is there to provide a clear >>> statement of the algorithm, to guarantee correct results for string >>> comparison. >>> >> >> You're wrong, this statement is completely useless in all cases. There is >> still the correct results for string comparison without them: a string >> comparison can only compare valid weights for each level, it will not >> compare any weight past the end of the text in any one of the two compared >> strings, nowhere it will compare weights with one of them being 0, unless >> this 0 is used as a "guard value" for the end of text and your compare loop >> still continues scanning the longer string when the other string has >> already ended (this case should be detected much earlier before >> determineing the next collection boundary in the string and then computing >> its weights for each level. >> >>> Section 9 of UTS #10 provides a whole lunch buffet of techniques that >>> implementations can choose from to increase the efficiency of their >>> implementations, as they deem appropriate. You are free to implement as you >>> choose -- including techniques that do not require any level separators. >>> You are, however, duly warned in: >>> >>> >>> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators >>> >>> that "While this technique is relatively easy to implement, it can >>> interfere with other compression methods." >>> >>> it complicates the algorithm with absoluytely no benefit at all); you >>> can ALWAYS remove it completely and this still generates equivalent results. >>> >>> No you cannot ALWAYS remove it completely. Whether or not your >>> implementation can do so, depends on what other techniques you may be using >>> to increase performance, store shorter keys, or whatever else may be at >>> stake in your optimization >>> >> I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However >> you MAY ADD them ONLY when generating and encoding the sort keys, if the >> encoding used really does compress the weights into smaller values: this is >> the only case where you want to ADD a separator, internally only in the >> binary key encoder, but but as part of the algorithm itself. >> >> If your key generation does not use any compression (in the simplest >> implementations), then it can simply an directly concatenate all weights >> with the same code units size (16-bit in the DUCET), without inserting any >> additional 0000 code unit to separate them: your resulting sort key will >> still not contain any 0000 code unit in any part for any level because the >> algorithm already has excluded them. Finally this means that sort keys can >> be stored in C-strings (terminated by null code units, instead of being >> delimited by a separately encoded length property, but for C-strings where >> code units are 8-bit, i.e. "char" in C, you still need an encoder to >> convert the 16-bit binary weights into sequences of bytes not containing >> any 00 byte: if this encoder is used, still you don't need any 00 separator >> between encoded levels!). >> >> As all these 0000 weigths are unnecessary, then the current UCA algorithm >> trying to introduce them needlessly is REALLY introducing unnecessary >> confusion: values of weights NEVER need to be restricted. >> >> The only conditions that matter is that: >> - all weights are *comparable* (sign does not even matter, they are not >> even restricted to be numbers or even just integers) and that >> - they are **fully ordered**, and that the fully ordered set of weights >> (not necessarily an enumerable set or a discrete set, as this can the >> continuous set of real numbers) >> - and that the full set of weights is **fully partitioned** into >> distinct, intervals (with no intersection between intervals, so intervals >> are also comparable) >> - that the highest interval will be used by weights in the primary level: >> each partition is numbered (by the level: a positive integer between 1 and >> L): you can compare the level numbers assigned to the partition in which >> the weight is a member: if level(weight1) > level(weight2) (this is an >> comparison of positive integers), then necessarily you may have weight1 < >> weight2 (this is only comparing weights encoded arbitrarily and which can >> still use a 0 value if you wish to use it to encode a valid weight for a >> valid collation element at any level 1 to N; this is also the only >> condition needed to respect rule WF2 in UCA). >> >> --- >> Notes about encodings for weights in sort keys: >> >> If weights are chosen to be rational numbers, e.g any rational numbers in >> the (0.0, 1.0) open interval, and because your collation algorithm will >> only recognize a finite set of distinct collation elements with necessarily >> a finite number N of distinct weights w(i), for i in 0..(N-1), allows the >> collation weights to be represented by choosing them **arbitrarily** within >> this open interval: >> - this can be done simply by partitionning the (0.0 1.0) into N half-open >> intervals [w(i), w(i+1)); >> - and then encoding a weight w(i) by any **arbitrarily chosen rational** >> inside one of these intervals (for example this can be done for using >> compression with arithmetic coding). >> >> A weight encoding using a finite discrete set (of binary integers between >> 0 and M-1) is what you need to use classic Huffman coding: this is >> equivalent to multiplying the previous rationals and truncating them to the >> nearest floor integer, but as this limits the choice of rational numbers >> above so that distinct weights remain distinct with the binary encoding, >> you need to keep more significant bits with Huffman coding than with >> Arithmetic coding (i.e. you need a higher value of M; where M is typically >> a power of 2 using 1-bit code units, or power of 256 for the simpler >> encodings using 8-bit code units, or a power of 65536 for an uncompressed >> encoding of 16-bit weight values). >> >> Arithmetic coding is in fact equivalent to Huffman coding, except that M >> is not necessarily a positive integer but can be any positive rational and >> can then represent each weigh value with a rational number of bits on >> average, instead of a static integer number of bits. You can say as well >> that Huffman coding is a restriction of Arithmetic coding where M must be >> an integer, or that Arithmetic coding is a generalization of Huffman coding. >> >> Both the Huffman and Arithmetic codings are wellknown examples of "prefix >> coding" (the latter offering a bit more compression, for the same >> statistical distribution of encoded values). The open interval (0.0, w(0)) >> is still not used at all to encode weights, but can still have a statistic >> distribution, usable with the prefix encoding to represent the end of >> string. But here again this does not represent the artificial 0000 weight >> which is NEVER encoded anywhere. >> >> --- >> >> Ask to a mathematician you trust, he will confirm that these rules >> speaking about the pseudo-weight 0000 in UCA are completely unnecessary >> (i.e. removing them from the algorithm does not change the result for >> comparing strings, or for generating sort keys) >> And as a conclusion, attempting to introduce them in the standard creates >> more confusion than it helps (in fact it is most probably a relict of a >> former bogous *implementation*, that still relied on them because other >> well-formness conditions were not satistified, or not well defined in the >> earlier attempts to define the UCA...). That this is not even needed for >> computing "composite weights" (which is not defining new weights, but an >> attempt to encode them in a larger space: this can be done completely >> outside the standard algorithm itself: just allow weights to be rational >> numbers, it is then easy to extend the number of encodable weights as a >> single number without increasing the numeric range in which they are >> defined; then leave the encoder of the sort key generator store them with a >> convenient "prefix coding", using one or more code units of arbitrary >> length). >> >> Philippe. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 4 16:30:31 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 4 Nov 2018 22:30:31 +0000 Subject: Arranging Hieroglyphics (was: A sign/abbreviation for "magister") In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86d0roiufa.fsf@mimuw.edu.pl> <20181101215606.30dd6ced@JRWUBU2> <6ebfcdb9-88d6-c0c4-6eef-fbb21c2862a6@gmail.com> <21450a2e-5c99-fb22-edd2-9bfd12eeb1d2@gmail.com> <8dd95653-3d85-e619-6dd2-7099330a4d00@ix.netcom.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> Message-ID: <20181104223031.13d9cc31@JRWUBU2> On Sat, 3 Nov 2018 22:55:17 +0100 Philippe Verdy via Unicode wrote: > I can also cite the case of Egyptian hieroglyphs: there's still no > way to render them correctly because we lack the development of a > stable orthography that would drive the encoding of the missing > **semantic** characters (for this reason Egyptian hieroglyphs still > require an upper layer protocol, as there's still no accepted > orthographic norm that successfully represents all possible semantic > variations, but alsop because the research on old Egyptian > hieroglyphs is still aphic very incomplete). If you study the document register, you'll find that layout control characters are being added. I think semantic characters would have depended on the font to select the rendering consequences; this will now not happen. What we're getting is more rigorous version of the Manuel de Codage. Richard. From unicode at unicode.org Mon Nov 5 10:46:38 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 05 Nov 2018 09:46:38 -0700 Subject: Encoding (was: Re: A sign/abbreviation for "magister") Message-ID: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com> Philippe Verdy wrote: > Note that I actually propose not just one rendering for the abbrevaition mark> but two possible variants (that would be equally > valid withou preference). Actually you're not proposing them. You're talking about them (at length) on the public mailing list. If you want to propose something, you should consider writing a proposal. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Nov 5 17:12:33 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 6 Nov 2018 00:12:33 +0100 Subject: Encoding In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <2dc4370c-32ca-dc21-af71-b9787f1dcdb7@orange.fr> <508a901f-5497-3a64-56f7-9cf2222b0a1d@orange.fr> <3a0eb311-a970-c1d2-c84a-1d879d03b22c@gmail.com> <79c00929-7665-dfa9-b2f4-1e74fd684c66@orange.fr> Message-ID: On 04/11/2018 20:19, Philippe Verdy via Unicode wrote: [?] > Even the mere fallback to render the as > a dotted circle (total absence of support) will not block completely > reading the abbreviation: > > * you'll see "2e?" (which is still better than only "2e", with > minimal impact) instead of > > * "2?" (which is worse ! this is still what already happens when you > use the legacy encoded which is also semantically > ambiguous for text processing), or > > * "2e." (which is acceptable for rendering but ambiguous semantically > for text processing) I?m afraid the dotted circle instead of the .notdef box would be confusing. > > So compare things faily: the solution I propose is EVEN > MOREINTEROPERABLE than using (which is > also impossible for noting all abbrevations as it is limited to just > a few letters, and most of the time limited to only the few lowercase > IPA symbols). It puts an end to the pressure to encode superscript > letters. Actually it encompasses all Latin lowercase base letters except q. As of putting an end to that pressure, that is also possible by encoding the missing ones once and for all. As already stated, until the opposite is posted authoritatively to this List, Latin script is deemed the only one making extensive use of superscript to denote abbreviations, due to strong and longlasting medieval practice acting as a template on a few natural languages, namedly those enumerated so far, among which Polish. > > If you want to support other notations (e.g. in chemical or > mathematics notations, where both superscript and subscript must be > present and stack together, and where the allowed varaition using a > dot or similar) you need another encoding and the existing legacy > are not suitable as well. I don?t lobby to support mathematics with more superscripts, but for sure UnicodeMath would be able to use them when the set is complete. What I did for chemical notations is to remind that chemistry seems to be disfavored compared to mathematics, because instead of peculiar subscripts it uses subscript Greek small letters. Three of them, as has been reported on this List. They are being refused because they are letters of a script. If they were fancy symbols, they would be encoded, like alchemical symbols and mathematical symbols are. Further, on 04/11/2018 20:51, Philippe Verdy via Unicode wrote: [?] > Once again you need something else for these technical notations, but > NOT the proposed , and NOT EVEN the > existing "modifier letters" , which were in > fact first introduced only for IPA [?] > [?] these letters are NOT conveying any semantic of an abbreviation, > and this is also NOT the case for their usage as IPA symbols). They do convey that semantic if used in a natural language giving superscript the semantics of an abbreviation. Unicode does not encode semantics, TUS specifies. > > There's NO interoperability at all when taking **abusively** the > existing "modifier letters" or digit> for use in abbreviations [?]. The interoperabillty I mean is between formats and environments. Interoperable in that sense is what is in the plain text backbone. > Keep these "modifier letters" or or punctuation> for use as plain letters or plain digits or plain > punctuation or plain symbols (including IPA) in natural languages. That is what I?m suggesting to do: Superscript letters are plain abbreviation indicators, notably ordinal indicators and indicators in other abbreviations, used in natural languages. > Anything else is abusive ans hould be considered only as "legacy" > encoding, not recommended at all in natural languages. Put "traditional" in the place of "legacy", and you will come close to what is actually going on when coding palaeographic texts is achieved using purposely encoded Latin superscripts. The same applies to living languages, because it is interoperable and fits therefore Unicode quality standards about digitally representing the world?s languages. Finally, on 04/11/2018 21:59, Philippe Verdy via Unicode wrote: > > I can take another example about what I call "legacy encoding" (which > really means that such encoding is just an "approximation" from which > no semantic can be clearly infered, except by using a non-determinist > heuristic, which can frequently make "false guesses"). > > Consider the case of the legacy Hangul "half-width" jamos: [?] > > The same can be said about the heuristics that attempt to infer an > abbreviation semantic from existing superscript letters (either > encoded in Unicode, or encoded as plain letters modified by > superscripting style in CSS or HTML, or in word processors for > example): it fails to give the correct guess most of the time if > there's no user to confirm the actual intended meaning I don?t agree: As opposed to baseline fallbacks, Unicode superscripts allow the reader to parse the string as an abbreviation, and machines can be programmed to act likewise. > > Such confirmation is the job of spell correctors in word processors: > [?] the user may type "Mr." then the wavy line will appear under > these 3 characters, the spell checker will propose to encode it as an > abbreviation "Mr" or leave "Mr." > unchanged (and no longer signaled) in which case the dot remains a > regular punctuation, and the "r" is not modified. Then the user may > choose to style the "r" with superscripting or underlining, and a new > wavy red underline will appear below the three characters "M r>.", proposing to only transform the as > or and even when the user accepts one of > these suggestions it will remain "M." or > "M." where it is still possible to infer the > semantics of an abbreviation (propose to replace or keep the dot > after it), or doing nothing else and cancel these suggestions (to > hide the wavy red underline hint, added by the spell checker), or > instruct the spell checker that the meaning of the superscript r is > that of a mathematical exponent, or a chemical a notation. That mainly illustrates why is not interoperable. The input process seems to be too complicated. And if a base letter is to be transformed to formatted superscript, you do need OpenType, much like with U+2044 FRACTION SLASH behaving as intended, ie transforming the preceding digit string to formatted numerator digits, and the following to denominator digit glyphs. In that, U+2044 acts as a format control, and so does that you are suggesting to encode. > > In all cases, the user/author has full control of the intended > meaning of his text and an informed decision is made where all cases > are now distinguished. "Legacy" encoding can be kept as is (in > Unicode), even if it's no longer recommended, just like Unicode has > documented that half-width Hangul is deprecated (it just offers a > "compatibility decomposition" for NFKD or NFKC, but this is lossy and > cannot be done automatically without a human decision). > > And the user/author can now freely and easily compose any > abbreviation he wishes in natural languages, without being limited by > the reduced "legacy" set of encoded in Unicode So far as the full Latin lowercase alphabet, and for use in all-caps only, eventually the full Latin uppercase alphabet are encoded, I can see nothing of a limitation, given these letters have the grapheme cluster base property and therefore work with all combining diacritics. That is already working with good font support, as demonstrated in the parent thread. > (which should no longer be extended, except for use as distinct plain > letters needed in alphabets of actual natural languages, or as > possibly new IPA symbols), One should be able to overcome the pattern tagging superscripts as not being ?plain letters?, because that is irrelevant when they are used as abbreviation indicators in natural languages, and as such are plain characters, like eg the Romance ordinal indicators U+00AA and U+00BA; see also the DEGREE SIGN hijacked as a substitute of because not superscripting the o in "n?" is considered inacceptable. > and without using the styling tricks (of > HTML/CSS, or of word processor documents, spreadsheets, presentation > documents allowing "'rich text" formats on top of "plain text") which > are best suitable for "free styling" of any human text, without any > additional semantics, [?] Yes I fully agree, if ?semantics? is that required for readability in accordance with standard orthographies in use. Best regards, Marcel From unicode at unicode.org Mon Nov 5 17:32:32 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 6 Nov 2018 00:32:32 +0100 Subject: Encoding (was: Re: A sign/abbreviation for "magister") In-Reply-To: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com> References: <20181105094638.665a7a7059d7ee80bb4d670165c8327d.9d86d4e255.wbe@email03.godaddy.com> Message-ID: On 05/11/2018 17:46, Doug Ewell via Unicode wrote: > > Philippe Verdy wrote: > >> Note that I actually propose not just one rendering for the > abbrevaition mark> but two possible variants (that would be equally >> valid withou preference). > > Actually you're not proposing them. You're talking about them (at > length) on the public mailing list. If you want to propose something, > you should consider writing a proposal. The accepted meaning of "to propose" is not limited to the technical sense it is used with respect to Unicode. Also, Philippe and I are both influenced by our French locale, where "je propose" has pretty wide semantics. To conform with Unicode terminology, simply think "suggest", as in: ?Note that I actually suggest not just one rendering [?].? Thanks anyway for encouraging Philippe Verdy to submit the related encodingproposal. Best regards, Marcel From unicode at unicode.org Tue Nov 6 04:56:35 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 06 Nov 2018 11:56:35 +0100 Subject: A sign/abbreviation for "magister" - first question summary Message-ID: <86zhumwk2k.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". > > First question is: how do you interpret the symbol? For me it is > definitely the capital M followed by the superscript "r" (written in an > old style no longer used in Poland), but there is something below the > superscript. It looks like a small "z", but such an interpretation > doesn't make sense for me. I've got almost immediately two complementary answers: On Sat, Oct 27 2018 at 9:11 -0400, Robert Wheelock wrote: > It is constructed much like the symbol for numero?only with a capital > accompanied by a superscript small > having an underbar (or > double underbar). On Sat, Oct 27 2018 at 6:58 -0700, Asmus Freytag via Unicode wrote: [...] > My suspicion would be that the small "z" is rather a "=" that > acquired a connecting stroke as part of quick handwriting. A./ and on the same day this interpretation was supported by Philippe Verdy: On Sat, Oct 27 2018 at 20:35 +0200, Philippe Verdy via Unicode wrote: [...] > I have the same kind of reading, the zigzagging stroek is an > hnadwritten emphasis of the uperscript r above it (explicitly noting > it is terminating the abbreviation), jut like the small underline that > happens sometimes below the superscript o in the abbreviation of > "numero" (as well sometimes there was not just one but two small > underlines, including in some prints). > > This sample is a perfect example of fast cursive handwritting (due to > high variability of all other letter shapes, sizes and joinings, where > even the capital M is written as two unconnected strokes), and it's > not abnormal to see in such condition this cursive joining between the > two underlining strokes so that it looks like a single zigzag. Later it was summarized by James Kass: On Fri, Nov 02 2018 at 2:59 GMT, James Kass via Unicode wrote: > Alphabetic script users write things the way they are spelled and > spell things the way they are written.? The abbreviation in question > as written consists of three recognizable symbols.? An "M", a > superscript "r", and an equal sign (= two lines).? It can be printed, > handwritten, or in fraktur; it will still consist of those same three > recognizable symbols. > > We're supposed to be preserving the past, not editing it or revising > it. It was commented by Julian Bradfield: On Fri, Nov 02 2018 at 8:54 GMT, Julian Bradfield via Unicode wrote: [...] > That's not true. The squiggle under the r is a squiggle - it is a > matter of interpretation (on which there was some discussion a hundred > messages up-thread or so :) whether it was intended to be = . > Just as it is a matter of interpretation whether the superscript and > squiggle were deeply meaningful to the writer, or whether they were > just a stylistic flourish for Mr. The abbreviation in question definitely consists of three symbols: an "M", a superscript "r" and the third one, which I think was best described by Robert Wheelock as double (under)bar, with the connecting stroke mentioned first by Asmus Freytag. This third element was referred to, also by myself, as a squiggle, but after looking up the definition of the word in a dictionary a short line that has been written or drawn and that curves and twists in a way that is not regular I think this is a misnomer. Unfortunately I have no better proposal. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Tue Nov 6 04:59:23 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 06 Nov 2018 11:59:23 +0100 Subject: A sign/abbreviation for "magister" - second question summary Message-ID: <86lg66wjxw.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". [...] > The second question is: are you familiar with such or a similar symbol? > Have you ever seen it in print? Later I provided some additional information: On Sat, Oct 27 2018 at 16:09 +0200, Janusz S. Bie? via Unicode wrote: > > The postcard is from the front of the first WW written by an > Austro-Hungarian soldier. He explaines the meaning of the abbreviation > to his wife, so looks like the abbreviation was used but not very > popular. On Sat, Oct 27 2018 at 20:25 +0200, Janusz S. Bie? via Unicode wrote: [...] > In the meantime I looked up some other postcards written by the same > person i found several other abbreviation including ? 'NUMERO SIGN' > (U+2116) written in the same way, i.e. with a double instead of a single > line. The similarity to ? 'NUMERO SIGN' was mentioned quite often in the thread, there seem to be no need to quote all this mentions here. A more general observation was formulated by Richard Wordingham: On Sun, Oct 28 2018 at 8:13 GMT, Richard Wordingham via Unicode wrote: [...] > The notation is a quite widespread format for abbreviations. the > first letter is normal sized, and the subsequent letter is written in > some variety of superscript with a squiggle underneath so that it > doesn't get overlooked. Various examples of such abbreviations were also mentioned several times in the thread, but again there seem to be no need to quote all this mentions here. Nobody however reported any other occurence of the symbol in question. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Tue Nov 6 05:04:02 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Tue, 06 Nov 2018 12:04:02 +0100 Subject: A sign/abbreviation for "magister" - third question summary Message-ID: <86h8guwjq5.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". > [...] > The third and the last question is: how to encode this symbol in > Unicode? A constructive answer to my question was provided quickly by James Kass: On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: > Mr? / M=? I answered: On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bie? via Unicode wrote: [...] > For me only the latter seems acceptable. Using COMBINING LATIN SMALL > LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as > the base character. However in the lack of a better solution I can live > with it :-) > > An alternative would be to use SMALL EQUALS SIGN, but looks like fonts > supporting it are rather rare. and Philippe Verdy commented: On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote: [...] > > There's a third alternative, that uses the superscript letter r, > followed by the combining double underline, instead of the normal > letter r followed by the same combining double underline. Some comments were made also by Michael Everson: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] > I would encode this as M? if you wanted to make sure your data > contained the abbreviation mark. It would not make sense to encode it > as M=? or anything else like that, because the ?r? is not modifying a > dot or a squiggle or an equals sign. The dot or squiggle or equals > sign has no meaning at all. And I would not encode it as Mr?, firstly > because it would never render properly and you might as well encode it > as Mr. or M:r, and second because in the IPA at least that character > indicates an alveolar realization in disordered speech. (Of course it > could be used for anything.) FYI, I decided to use the encoding proposed by Philippe Verdy (if I understand him correctly): M?? i.e. 'LATIN CAPITAL LETTER M' (U+004D) 'MODIFIER LETTER SMALL R' (U+02B3) 'COMBINING DOUBLE LOW LINE' (U+0333) for purely pragmatic reasons: it is rendered quite well in my Emacs. According to the 'fc-search-codepoint" script, the sequence is supported on my computer by almost 150 fonts, so I hope to find in due time a way to render it correctly also in XeTeX. I'm also going to add it to my private named sequences list (https://bitbucket.org/jsbien/unicode4polish). The same post contained a statement which I don't accept: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] > The squiggle in your sample, Janusz, does not indicate anything; it is > only a decoration, and the abbreviation is the same without it. One of the reasons I disagree was described by me in the separate thread "use vs mention": https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html There were also some other statements which I find unacceptable: On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: [...] > The abbreviation in the postcard, rendered in plain text, is "Mr". He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at 9:38 GMT (and earlier in a private mail). I understand that both of them by "plane text" mean Unicode. On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: > You could use the various hacks you've discussed, with modifier > letters; but that is not "encoding", that is "abusing Unicode to do > markup". At least, that's the view I take! and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12 -0700. The latter elaborated his view later and I answered: On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bie? via Unicode wrote: > On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: [...] >> All else is just applying visual hacks > > I don't mind hacks if they are useful and serve the intended purpose, > even if they are visual :-) [...] >> at the possible cost of obscuring the contents. > > It's for the users of the transcription to decide what is obscuring the > text and what, to the contrary, makes the transcription more readable > and useful. Please note that it's me who makes the transcription, it's me who has a vision of the future use and users, and in consequence it's me who makes the decision which aspects of text to encode. Accusing me of "abusing Unicode" will not stop me from doing it my way. I hope that at least James Kass understands my attitude: On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote: [...] > If I were entering plain text data from an old post card, I'd try to > keep the data as close to the source as possible. Because that would > be my purpose. Others might have different purposes. There were presented also some ideas which I would call "futuristic": introducing a new combining character and using variations sequences. This ideas should be discussed in separate threads, which seems to happen now. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Nov 7 13:49:38 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 7 Nov 2018 20:49:38 +0100 Subject: Preformatted superscript in ordinary text, paleography and phonetics using Latin script (was: Re: A sign/abbreviation for "magister" - third question summary) In-Reply-To: <86h8guwjq5.fsf@mimuw.edu.pl> References: <86h8guwjq5.fsf@mimuw.edu.pl> Message-ID: On 06/11/2018 12:04, Janusz S. Bie? via Unicode wrote: > > On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bie? via Unicode wrote: >> Hi! >> >> On the over 100 years old postcard >> >> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 >> >> you can see 2 occurences of a symbol which is explicitely explained (in >> Polish) as meaning "Magister". >> > > [...] > >> The third and the last question is: how to encode this symbol in >> Unicode? > > > A constructive answer to my question was provided quickly by James Kass: > > On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: >> Mr? / M=? > > I answered: > > On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bie? via Unicode wrote: > > [...] > >> For me only the latter seems acceptable. Using COMBINING LATIN SMALL >> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as >> the base character. However in the lack of a better solution I can live >> with it :-) >> >> An alternative would be to use SMALL EQUALS SIGN, but looks like fonts >> supporting it are rather rare. > > and Philippe Verdy commented: > > On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote: > > [...] > >> >> There's a third alternative, that uses the superscript letter r, >> followed by the combining double underline, instead of the normal >> letter r followed by the same combining double underline. > > Some comments were made also by Michael Everson: > > On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: > > [...] > >> I would encode this as M? if you wanted to make sure your data >> contained the abbreviation mark. It would not make sense to encode it >> as M=? or anything else like that, because the ?r? is not modifying a >> dot or a squiggle or an equals sign. The dot or squiggle or equals >> sign has no meaning at all. And I would not encode it as Mr?, firstly >> because it would never render properly and you might as well encode it >> as Mr. or M:r, and second because in the IPA at least that character >> indicates an alveolar realization in disordered speech. (Of course it >> could be used for anything.) > > FYI, I decided to use the encoding proposed by Philippe Verdy (if I > understand him correctly): > > M?? > > i.e. > > 'LATIN CAPITAL LETTER M' (U+004D) > 'MODIFIER LETTER SMALL R' (U+02B3) > 'COMBINING DOUBLE LOW LINE' (U+0333) > > for purely pragmatic reasons: it is rendered quite well in my > Emacs. According to the 'fc-search-codepoint" script, the sequence is > supported on my computer by almost 150 fonts, so I hope to find in due > time a way to render it correctly also in XeTeX. I'm also going to add > it to my private named sequences list > (https://bitbucket.org/jsbien/unicode4polish). > > The same post contained a statement which I don't accept: > > On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: > > [...] > >> The squiggle in your sample, Janusz, does not indicate anything; it is >> only a decoration, and the abbreviation is the same without it. > > One of the reasons I disagree was described by me in the separate thread > "use vs mention": > > https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html > > There were also some other statements which I find unacceptable: > > On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: > > [...] > >> The abbreviation in the postcard, rendered in plain text, is "Mr". > > He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at > 9:38 GMT (and earlier in a private mail). > > I understand that both of them by "plane text" mean Unicode. > > > On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: > >> You could use the various hacks you've discussed, with modifier >> letters; but that is not "encoding", that is "abusing Unicode to do >> markup". At least, that's the view I take! > > and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12 > -0700. > > The latter elaborated his view later and I answered: > > On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bie? via Unicode wrote: >> On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: > > [...] > >>> All else is just applying visual hacks >> >> I don't mind hacks if they are useful and serve the intended purpose, >> even if they are visual :-) > > [...] > >>> at the possible cost of obscuring the contents. >> >> It's for the users of the transcription to decide what is obscuring the >> text and what, to the contrary, makes the transcription more readable >> and useful. > > Please note that it's me who makes the transcription, it's me who has a > vision of the future use and users, and in consequence it's me who makes > the decision which aspects of text to encode. Accusing me of "abusing > Unicode" will not stop me from doing it my way. > > I hope that at least James Kass understands my attitude: > > On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote: > > [...] > >> If I were entering plain text data from an old post card, I'd try to >> keep the data as close to the source as possible. Because that would >> be my purpose. Others might have different purposes. > > There were presented also some ideas which I would call "futuristic": > introducing a new combining character and using variations sequences. > This ideas should be discussed in separate threads, which seems to > happen now. Thank you for debriefing. So far I?m pleased to infer that the outlined outcome encounters general agreement. It?s probably safe to conjecture that the case of the Polish abbreviation for "magister" is becoming a textbook example of the reception of the discussed Unicode policy with respect to superscript. Best regards, Marcel From unicode at unicode.org Fri Nov 9 06:42:54 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Fri, 9 Nov 2018 07:42:54 -0500 Subject: Aleph-umlaut Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: fmbdjcbgbjcgdjbe.png Type: image/png Size: 24681 bytes Desc: not available URL: From unicode at unicode.org Fri Nov 9 17:25:54 2018 From: unicode at unicode.org (Marius Spix via Unicode) Date: Sat, 10 Nov 2018 00:25:54 +0100 Subject: Aleph-umlaut In-Reply-To: References: Message-ID: <20181110002554.0334d757@spixxi> Dear Mark, I found another sample here: https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf On page 86 it says that the aleph with diaresis is a number with the value 1000. See also the attached clipping. A second source is the Brown-Driver-Briggs Hebrew-English Lexicon of the Old Testament which also mentions that ??? ?means 1000, but there were no evidence of this usage in Old Testament times. See here (the very first lemma): www.biblab.com/students/dizionari/Brown-Driver-Briggs%20Hb-En%20Dic.docx Yet another usage in a mathematical context of an aleph with umlaut can be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0 HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut is used to mark the second derivative. https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter (page 28-29 or slide 41-42) However, seems that there is no real font support for these characters, though. The only font on my computer, which could render aleph + umlaut correctly on my system was Unifont and roughly Linux Libertine. Other fonts, in particular Arial, DejaVu Sans, Liberation Sans and Linux Biolinum rendered the diaeresis to much far to the left. I even found a user has a similar issue with U+0308, here: http://smontagu.org/writings/HebrewNumbers.html Maybe adding an annotation to U+0308 could sensitize font designers to be aware that this combining character is also used in the Hebrew alphabet. My suggestion is to add the annotation ?= hewbrew thousands multiplier? to U+0308 COMBINING DIAERESIS and a reference from 05B5 ?? HEBREW POINT TSERE to U+0308. Best regards, Marius On Fri, 9 Nov 2018 07:42:54 -0500 "Mark E. Shoulson via Unicode" wrote: > Noticed something really fascinating in an old pamphlet I was > reading.? It's from 1922, in Hebrew mostly but with some Yiddish at > the end.? The Yiddish spelling is not according to more modern > standardization, but seems to be significantly more faithful to the > German spellings of the same words, replacing Latin letters with > Hebrew ones more than respelling phonetically.? And there are even > places where it appears they represented a German ? with a Hebrew > aleph?with an umlaut!? Actually it looks a little more like a double > acute accent but that's surely a style choice, since it obviously is > mapping to an umlaut. > > > > (Note also the spelling ???, a calque for German "die", where modern > Yiddish would spell it phonetically as ??.) > > > I do NOT think this needs any special encoding, btw.? I would > probably encode this as simply U+05D0 U+0308 (??).? Combining symbols > do not (necessarily) belong to a specific alphabet, and the fact that > most fonts would render this badly is a different issue.? I just > thought the people here might find it interesting. > > > (Link is > http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874 > look at the last few pages.) > > > ~mark > -------------- next part -------------- A non-text attachment was scrubbed... Name: aleph_umlaut.png Type: image/png Size: 40318 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: Digitale Signatur von OpenPGP URL: From unicode at unicode.org Fri Nov 9 18:02:33 2018 From: unicode at unicode.org (Tex via Unicode) Date: Fri, 9 Nov 2018 16:02:33 -0800 Subject: Aleph-umlaut In-Reply-To: <20181110002554.0334d757@spixxi> References: <20181110002554.0334d757@spixxi> Message-ID: <000c01d47888$aedea820$0c9bf860$@xencraft.com> My notes on Hebrew numbers on http://www.i18nguy.com/unicode/hebrew-numbers.html include: "Using letters for numbers, there is the possibility of confusion as to whether a string of letters is a word or a numerical value. Therefore, when numbers are used with text, punctuation marks are added to distinguish their numerical meaning. Single character numbers (numbers less than 10) add the punctuation character geresh after the numeric character. Larger numbers insert the punctuation character gershayim before the last character in the number." So perhaps Alef with diaeresis is a collapsed form of Alef followed by Gershayim when it is used as a numeric value. I wonder if that may also occur for other values. (I am just speculating.) Tex -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marius Spix via Unicode Sent: Friday, November 9, 2018 3:26 PM To: unicode at unicode.org Cc: Mark E. Shoulson Subject: Re: Aleph-umlaut Dear Mark, I found another sample here: https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf On page 86 it says that the aleph with diaresis is a number with the value 1000. See also the attached clipping. A second source is the Brown-Driver-Briggs Hebrew-English Lexicon of the Old Testament which also mentions that ??? ?means 1000, but there were no evidence of this usage in Old Testament times. See here (the very first lemma): www.biblab.com/students/dizionari/Brown-Driver-Briggs%20Hb-En%20Dic.docx Yet another usage in a mathematical context of an aleph with umlaut can be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0 HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut is used to mark the second derivative. https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter (page 28-29 or slide 41-42) However, seems that there is no real font support for these characters, though. The only font on my computer, which could render aleph + umlaut correctly on my system was Unifont and roughly Linux Libertine. Other fonts, in particular Arial, DejaVu Sans, Liberation Sans and Linux Biolinum rendered the diaeresis to much far to the left. I even found a user has a similar issue with U+0308, here: http://smontagu.org/writings/HebrewNumbers.html Maybe adding an annotation to U+0308 could sensitize font designers to be aware that this combining character is also used in the Hebrew alphabet. My suggestion is to add the annotation ?= hewbrew thousands multiplier? to U+0308 COMBINING DIAERESIS and a reference from 05B5 ?? HEBREW POINT TSERE to U+0308. Best regards, Marius On Fri, 9 Nov 2018 07:42:54 -0500 "Mark E. Shoulson via Unicode" wrote: > Noticed something really fascinating in an old pamphlet I was reading. > It's from 1922, in Hebrew mostly but with some Yiddish at the end. > The Yiddish spelling is not according to more modern standardization, > but seems to be significantly more faithful to the German spellings of > the same words, replacing Latin letters with Hebrew ones more than > respelling phonetically. And there are even places where it appears > they represented a German ? with a Hebrew aleph?with an umlaut! > Actually it looks a little more like a double acute accent but that's > surely a style choice, since it obviously is mapping to an umlaut. > > > > (Note also the spelling ???, a calque for German "die", where modern > Yiddish would spell it phonetically as ??.) > > > I do NOT think this needs any special encoding, btw. I would probably > encode this as simply U+05D0 U+0308 (??). Combining symbols do not > (necessarily) belong to a specific alphabet, and the fact that most > fonts would render this badly is a different issue. I just thought > the people here might find it interesting. > > > (Link is > http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36 > 609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874 > look at the last few pages.) > > > ~mark > From unicode at unicode.org Sat Nov 10 00:25:36 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 10 Nov 2018 06:25:36 +0000 Subject: Aleph-umlaut In-Reply-To: <000c01d47888$aedea820$0c9bf860$@xencraft.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> Message-ID: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> In the last pages of the text linked by Mark E. Shoulson, both the gershayim and the aleph-umlaut are shown.? A quick look didn't find any other base letter with the combining umlaut. -------------- next part -------------- A non-text attachment was scrubbed... Name: YiddishUmlaut~2.PNG Type: image/png Size: 80947 bytes Desc: not available URL: From unicode at unicode.org Sat Nov 10 09:28:08 2018 From: unicode at unicode.org (Beth Myre via Unicode) Date: Sat, 10 Nov 2018 10:28:08 -0500 Subject: Aleph-umlaut In-Reply-To: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: Hi Everyone, Are we sure this is actually Yiddish? To me it looks like it could be German transliterated into the Yiddish/Hebrew alphabet. I can spend a little more time with it and put together some examples. Beth On Sat, Nov 10, 2018 at 1:28 AM James Kass via Unicode wrote: > > In the last pages of the text linked by Mark E. Shoulson, both the > gershayim and the aleph-umlaut are shown. A quick look didn't find any > other base letter with the combining umlaut. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Nov 10 18:54:15 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 10 Nov 2018 19:54:15 -0500 Subject: Aleph-umlaut In-Reply-To: <20181110002554.0334d757@spixxi> References: <20181110002554.0334d757@spixxi> Message-ID: On 11/9/18 6:25 PM, Marius Spix via Unicode wrote: > Dear Mark, > > I found another sample here: > https://www.marketscreener.com/BRILL-5240571/pdf/61308/Brill_Report.pdf > > On page 86 it says that the aleph with diaresis is a number with > the value 1000. That's true, I've heard of that, and even occasionally seen it. And sometimes in old printings things like a diaeresis or a dot above were used where later Hebrew uses a U+05F3 HEBREW PUNCTUATION GERESH or U+05F4 HEBREW PUNCTUATION GERSHAYIM.? I think what struck me about this one was that this was not just something that looked like a diaeresis/umlaut, it really WAS an umlaut, a direct transcoding of the a-umlaut in Latin letters into aleph-umlaut in Hebrew letters. > Yet another usage in a mathematical context of an aleph with umlaut can > be found here, however they used U+2135 ALEF SYMBOL instead of U+05D0 > HEBREW LETTER ALEF. This is not related to the value 1000, as the umlaut > is used to mark the second derivative. > https://de.slideshare.net/StephenAhiante/dynamics-modelling-design-of-a-quadcopter > (page 28-29 or slide 41-42) Kind of an odd usage, since ALEF SYMBOL is usually used for transfinite cardinals, as in ??, and you don't normally take time-derivatives of those.? But mathematicians love using weird symbols for whatever they like.? This is the mathematical usage of two-dots-above, as you note. ~mark From unicode at unicode.org Sat Nov 10 19:03:11 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 10 Nov 2018 20:03:11 -0500 Subject: Aleph-umlaut In-Reply-To: <000c01d47888$aedea820$0c9bf860$@xencraft.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> Message-ID: <6d22f2a1-7daa-afc1-d537-c20e6816bbe2@kli.org> On 11/9/18 7:02 PM, Tex via Unicode wrote: > My notes on Hebrew numbers on http://www.i18nguy.com/unicode/hebrew-numbers.html include: > > "Using letters for numbers, there is the possibility of confusion as to whether a string of letters is a word or a numerical value. Therefore, when numbers are used with text, punctuation marks are added to distinguish their numerical meaning. Single character numbers (numbers less than 10) add the punctuation character geresh after the numeric character. Larger numbers insert the punctuation character gershayim before the last character in the number." > > So perhaps Alef with diaeresis is a collapsed form of Alef followed by Gershayim when it is used as a numeric value. I wonder if that may also occur for other values. I don't know that it's a "collapsed" form.? I think the double-dotted form is just an alternate one, and one that was more popular in older times.? Standardized Hebrew numerical usage would be to use a GERESH (not a GERSHAYIM) after an ALEF to indicate a thousand; GERSHAYIM is used before the last letter in a number that is "large" generally in the sense of the number of letters (i.e. more than one or two).? Since GERESH is also used for single-letter numbers, this means that ?? could mean "one" (much more common) or "one thousand".? The GERESH-after becomes useful in something like the full number of the year, ??????? where it sets off the initial 5, making it 5000 (this notation is not place-value, but there is a usual ordering, so technically it would (usually) be understandable even without the punctuation marks, due to the out-of-order placement of the initial HE). Again, what interested me about this usage was that it really *was* an umlaut.? But yes, there are other situations where such a thing could happen. ~mark From unicode at unicode.org Sat Nov 10 19:17:35 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 10 Nov 2018 20:17:35 -0500 Subject: Aleph-umlaut In-Reply-To: <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: On 11/10/18 1:25 AM, James Kass via Unicode wrote: > > In the last pages of the text linked by Mark E. Shoulson, both the > gershayim and the aleph-umlaut are shown.? A quick look didn't find > any other base letter with the combining umlaut. > Indeed; there is no shortage of use of the GERSHAYIM, used as it normally is, to indicate abbreviations.? The umlaut on the alef is used specifically in the Yiddish parts, to be an umlaut (the word with the GERSHAYIM on the top line is an abbreviation for the phrase for a legal court or authority; the word on the second like transliterates apparently to "best?tigt"; someone with better German than me can make more sense of it.? The example I sent at first used the word "legalit?t", which even I can understand as "legality" or something like that.)? I think the Yiddish at the time may already not have had ? or ? sounds, so had no need to transliterate those (or maybe there just happened not to be a need for them in this text); certainly I see Yiddish spellings like ?????? ("oyf-") where German would have "auf". ~mark From unicode at unicode.org Sat Nov 10 19:49:46 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 10 Nov 2018 20:49:46 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: On 11/10/18 10:28 AM, Beth Myre via Unicode wrote: > Hi Everyone, > > Are we sure this is actually Yiddish?? To me it looks like it could be > German transliterated into the Yiddish/Hebrew alphabet. > > I can spend a little more time with it and put together some examples. > > Beth Is there really a difference?? In a very real sense, Yiddish *IS* a form of German (I'm told it's Middle High German, but I don't actually have knowledge about that), with a strong admixture of Hebrew and Russian and a few other languages, and which is usually written using Hebrew letters.? There's probably something like a continuum with "Yiddish" and "German" as ranges or points. Is the text *standard* German written with Hebrew letters?? I don't think so.? Let's see, on the next-to-last page, end of first paragraph, I see the phrase ?????????????? ????????????, which would transliterate to "oytorit?ten bekr?fting"?with umlauted "a", but "oy-" instead of "au-" at the beginning.? OK, I know in German "au" can be pronounced "oy-" sometimes (I think), but at least https://en.wiktionary.org/wiki/Autorit%C3%A4t implies that this isn't the usual/standard pronunciation (I make no claims as to expertise in German).? The text is littered with terms like ????, abbreviation for Hebrew ??? ???, "house of judgment" or legal court, pronounced in Yiddish "beisdin", or ??? (can't be German as it has no vowels!) meaning "legal decision," from Hebrew?Hebrew-derived words in Yiddish do not change their spelling, as a rule.? There are definitely German spelling features that are not found in later spellings, for example, double letters in German are written double in the Yiddish spelling too, which is quite unusual (you're used to letters in Hebrew never being silent or even geminate, but always having at least a semi-syllable sound between like letters, except in special cases, so it seems striking to see ???? for a simple two syllables). So I'm not sure if there's a *real* answer to your question, but it does look to me like this isn't "normal" German, at least.? And would it matter, anyway? ~mark From unicode at unicode.org Sat Nov 10 20:24:34 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sat, 10 Nov 2018 21:24:34 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> Message-ID: <22c4c73c-a6ef-7d0c-3b8c-2e6e3e21e57e@kli.org> Oh yeah, fun fact about this document that I linked at the outset: I found it like 10 years ago when researching something unrelated... it apparently is a ruling opposing an earlier announcement by another group of Rabbis, declaring it void.? And looking at the rabbis they say are supporting them in this decision, I see they mention Rabbi Joseph Rosen, chief Rabbi of "Wisloch".? And I think to myself, "How interesting.? I have a great-grandfather who was named Rabbi Joseph Rosen, chief Rabbi of a town called Swisloch" (with an S; presumably an error in the pamphlet.)? I checked with my father; the timing is about right, would have been shortly before he came to America.? The Internet moves in mysterious ways. ~mark From unicode at unicode.org Sun Nov 11 05:33:20 2018 From: unicode at unicode.org (Otto Stolz via Unicode) Date: Sun, 11 Nov 2018 12:33:20 +0100 Subject: Aleph-umlaut In-Reply-To: References: Message-ID: <632ae659-cf2e-65a5-64fd-cc94651c6f9f@uni-konstanz.de> Am 2018-11-09 um 13:42 schrieb Mark E. Shoulson via Unicode: > > Noticed something really fascinating in an old pamphlet I was reading > really interesting, thanks! > > > (Link is > http://rosetta.nli.org.il/delivery/DeliveryManagerServlet?dps_pid=IE36609604&_ga=2.182410660.2074158760.1541729874-1781407841.1541729874 > look at the last few pages.) > > To me, this link delivers an empty document. Please check the spelling of the URL. Best wishes, ?? Otto From unicode at unicode.org Sun Nov 11 00:03:15 2018 From: unicode at unicode.org (Beth Myre via Unicode) Date: Sun, 11 Nov 2018 01:03:15 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: Hi Mark, This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include *beit din* or *p'sak* in an English text. Here's a paragraph from page 22: [image: Paragraph.jpg] I (re-)transliterated it, and it reads: Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: a)? That's just German. Something like - We know that we can't expect repentance or insight from the other party, and that they will disregard the consequences of these rabbinical reports because: a)? I only know a little Yiddish (one semester a long time ago), but I think Yiddish word order would be very different. Also, 'we are' would be 'mir zaynen' instead of 'wir sind,' 'and' would be 'un' instead of 'und,' etc. Beth On Sat, Nov 10, 2018 at 8:51 PM Mark E. Shoulson via Unicode < unicode at unicode.org> wrote: > On 11/10/18 10:28 AM, Beth Myre via Unicode wrote: > > Hi Everyone, > > > > Are we sure this is actually Yiddish? To me it looks like it could be > > German transliterated into the Yiddish/Hebrew alphabet. > > > > I can spend a little more time with it and put together some examples. > > > > Beth > > Is there really a difference? In a very real sense, Yiddish *IS* a form > of German (I'm told it's Middle High German, but I don't actually have > knowledge about that), with a strong admixture of Hebrew and Russian and > a few other languages, and which is usually written using Hebrew > letters. There's probably something like a continuum with "Yiddish" and > "German" as ranges or points. > > Is the text *standard* German written with Hebrew letters? I don't > think so. Let's see, on the next-to-last page, end of first paragraph, > I see the phrase ?????????????? ????????????, which would transliterate > to "oytorit?ten bekr?fting"?with umlauted "a", but "oy-" instead of > "au-" at the beginning. OK, I know in German "au" can be pronounced > "oy-" sometimes (I think), but at least > https://en.wiktionary.org/wiki/Autorit%C3%A4t implies that this isn't > the usual/standard pronunciation (I make no claims as to expertise in > German). The text is littered with terms like ????, abbreviation for > Hebrew ??? ???, "house of judgment" or legal court, pronounced in > Yiddish "beisdin", or ??? (can't be German as it has no vowels!) meaning > "legal decision," from Hebrew?Hebrew-derived words in Yiddish do not > change their spelling, as a rule. There are definitely German spelling > features that are not found in later spellings, for example, double > letters in German are written double in the Yiddish spelling too, which > is quite unusual (you're used to letters in Hebrew never being silent or > even geminate, but always having at least a semi-syllable sound between > like letters, except in special cases, so it seems striking to see ???? > for a simple two syllables). > > So I'm not sure if there's a *real* answer to your question, but it does > look to me like this isn't "normal" German, at least. And would it > matter, anyway? > > ~mark > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Paragraph.jpg Type: image/jpeg Size: 181354 bytes Desc: not available URL: From unicode at unicode.org Sun Nov 11 13:42:53 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 11 Nov 2018 11:42:53 -0800 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: <1805fb55-9e34-c7ac-f5f0-ced99cd3b163@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 11 14:32:29 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sun, 11 Nov 2018 21:32:29 +0100 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: > On 11 Nov 2018, at 07:03, Beth Myre via Unicode wrote: > > Hi Mark, > > This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text. > > Here's a paragraph from page 22: Actually page 21. > > > > I (re-)transliterated it, and it reads: Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text: > Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass : From unicode at unicode.org Sun Nov 11 15:16:01 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 11 Nov 2018 13:16:01 -0800 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 11 15:37:10 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Sun, 11 Nov 2018 22:37:10 +0100 Subject: Aleph-umlaut In-Reply-To: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> Message-ID: <0532C015-564D-4451-9101-44F75DA535E8@telia.com> > On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote: > > On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: >> >>> On 11 Nov 2018, at 07:03, Beth Myre via Unicode >>> wrote: >>> >>> Hi Mark, >>> >>> This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text. >>> >>> Here's a paragraph from page 22: >>> >> Actually page 21. >> >> >>> >>> >>> I (re-)transliterated it, and it reads: >>> >> Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text: >> >> >>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: >>> >> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass : > > This agrees rather well with Beth's retranslation. > Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these pronunciations are spelled in German (with a few outliers like "izt" for "ist", where the "s" isn't voiced in German). There's also a clear convention of using "kh" for "ch" (as in English "loch" but also for other pronunciation of the German "ch"). The one apparent mismatch is "ge- gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can stand for phonetic "p". "Parthey" might be how Germans could have written "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" alternated with "ei", as in my last name). So, perhaps it's closer than it looks, superficially. > From context, "Reue" is by far the best match for "Reye" and seems to match a tendency elsewhere in the sample where the transliteration, if pronounced as German, would result in a shifted quality for the vowels (making them sound more Yiddish, for lack of a better description). > > "absch?ttelen" - here the second "e" would not be part of Standard German orthography. It's either an artifact of the transcription system or possibly reflects that the writer is familiar with a different spelling convention (to my eyes the spelling "abshittelen" looks somehow more Yiddish, but I'm really not familiar enough with that language). > > But still, the text is unquestionably intended to be in German. One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes. The German translation it gives: Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en: And in English: Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation: >From the original Hebrew script, in case someone wants to try out more possibilities: ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? : From unicode at unicode.org Sun Nov 11 17:00:06 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Sun, 11 Nov 2018 15:00:06 -0800 Subject: Aleph-umlaut In-Reply-To: <0532C015-564D-4451-9101-44F75DA535E8@telia.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <0532C015-564D-4451-9101-44F75DA535E8@telia.com> Message-ID: On 11/11/2018 1:37 PM, Hans ?berg wrote: >> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote: >> >> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: >>>> On 11 Nov 2018, at 07:03, Beth Myre via Unicode >>>> wrote: >>>> >>>> Hi Mark, >>>> >>>> This is a really cool find, and it's interesting that you might have a relative mentioned in it. After looking at it more, I'm more convinced that it's German written in Hebrew letters, not Yiddish. I think that explains the umlauts. Since the text is about Jewish subjects, it also includes Hebrew words like you mentioned, just like we would include beit din or p'sak in an English text. >>>> >>>> Here's a paragraph from page 22: >>>> >>> Actually page 21. >>> >>> >>>> >>>> >>>> I (re-)transliterated it, and it reads: >>>> >>> Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text: >>> >>> >>>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: >>>> >>> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass : >> This agrees rather well with Beth's retranslation. >> Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these pronunciations are spelled in German (with a few outliers like "izt" for "ist", where the "s" isn't voiced in German). There's also a clear convention of using "kh" for "ch" (as in English "loch" but also for other pronunciation of the German "ch"). The one apparent mismatch is "ge- gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can stand for phonetic "p". "Parthey" might be how Germans could have written "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" alternated with "ei", as in my last name). So, perhaps it's closer than it looks, superficially. >> From context, "Reue" is by far the best match for "Reye" and seems to match a tendency elsewhere in the sample where the transliteration, if pronounced as German, would result in a shifted quality for the vowels (making them sound more Yiddish, for lack of a better description). >> >> "absch?ttelen" - here the second "e" would not be part of Standard German orthography. It's either an artifact of the transcription system or possibly reflects that the writer is familiar with a different spelling convention (to my eyes the spelling "abshittelen" looks somehow more Yiddish, but I'm really not familiar enough with that language). >> >> But still, the text is unquestionably intended to be in German. > One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes. Well, the OCR does a much better job than the "translation". > The German translation it gives: > Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en: This is simply utter nonsense and does not even begin to correlate with the transliteration. > And in English: > Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation: In fact, the English translation makes somewhat more sense. For example, "Gegenpartei" in many legal contexts (which this sample isn't, by the way) can in fact be translated as "injured party", which in turn might correlate with an "injured side" as rendered. However "Seite der Verletzten" makes no sense in this context, unless there's a Hebrew word that accidentally matches and got picked up. (I'm suspicious that some of the auto translation does in fact work like many real translations which often are not direct, but involve an intermediate language - simply because it's not possible to find sufficient translators between random pairs of languages.). > > From the original Hebrew script, in case someone wants to try out more possibilities: > ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? : > > I don't know what that will tell you. You have a rendering that produces coherent text which closely matches a phonetic transliteration. What else do you hope to learn? A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 11 17:55:10 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 12 Nov 2018 00:55:10 +0100 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <0532C015-564D-4451-9101-44F75DA535E8@telia.com> Message-ID: <93724E33-20B4-4725-938F-EF6494CFF901@telia.com> > On 12 Nov 2018, at 00:00, Asmus Freytag (c) wrote: > > On 11/11/2018 1:37 PM, Hans ?berg wrote: >>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode >>> wrote: >>> >>> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: >>> >> One should not rely too much these autotranslation tools, but it may be quicker using some OCR program and then correct by hand, than entering it all by hand. The setup did not admit transliterating Hebrew script directly into German. It seems that the translator program recognizes it as Yiddish, though it might be as a result of an assumption it makes. > > Well, the OCR does a much better job than the "translation". Not so surprising, but it did not have a literal OCR. An OCR can improve transliteration by guessing the language to fill in partial recognition, so there is a fallacy already there. >> The German translation it gives: >> Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en: > > This is simply utter nonsense and does not even begin to correlate with the transliteration. > >> And in English: >> Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation: > > In fact, the English translation makes somewhat more sense. For example, "Gegenpartei" in many legal contexts (which this sample isn't, by the way) can in fact be translated as "injured party", which in turn might correlate with an "injured side" as rendered. However "Seite der Verletzten" makes no sense in this context, unless there's a Hebrew word that accidentally matches and got picked up. > (I'm suspicious that some of the auto translation does in fact work like many real translations which often are not direct, but involve an intermediate language - simply because it's not possible to find sufficient translators between random pairs of languages.). Google translation uses AI by comparing texts in both languages, the Rosetta stone method. Therefore, there is a poor result for languages where there are less available texts to compare with. Sometimes it can be better than dictionaries if it concerns more modern terms. But in other cases, it may just be gibberish. >> From the original Hebrew script, in case someone wants to try out more possibilities: >> ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? : >> > I don't know what that will tell you. You have a rendering that produces coherent text which closely matches a phonetic transliteration. What else do you hope to learn? It is up to whoever likes to try (FYI). From unicode at unicode.org Sun Nov 11 18:12:27 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 11 Nov 2018 19:12:27 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: <44cf1a26-e53b-564f-1ad8-8aaa50bc8f03@kli.org> On 11/11/18 3:32 PM, Hans ?berg via Unicode wrote: > Taking a picture in the Google Translate app, and then pasting the Hebrew character string it identifies into translate.google.com for Yiddish gives the text: > >> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: > vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass : Yeah, you have to be careful of auto-transliterating, if that's what you're using for this transliteration.? The third word is definitely not "auns"; the alef at the beginning is a "shtumer-alef", a *silent* letter used in Yiddish a little like a mater lectionis, now that I think about it: it's a nominal (but void) consonant used as a place-holder to hold the vowel? (Hebrew allows words to start with a vocalic vav, only when it's used as a conjunction, but Yiddish does not, generally.? Nor a vocalic yod or double-yod or vav-yod diphthong.)? Interesting that you have "*zya dya" there (those are silent as well; the words are just "zi di"); it looks like elsewhere in the document they spell it with a more precise transliteration, strictly using AYIN for "e", not ALEF as here. ~mark From unicode at unicode.org Sun Nov 11 18:20:08 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 11 Nov 2018 19:20:08 -0500 Subject: Aleph-umlaut In-Reply-To: <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> Message-ID: On 11/11/18 4:16 PM, Asmus Freytag via Unicode wrote: > On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: >> >>> Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass: >> vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , dass : > > > This agrees rather well with Beth's retranslation. > > Mapping "z" to "s",? "f" to "v" and "v" to "w" would match the way > these pronunciations are spelled in German (with a few outliers like > "izt" for "ist", where the "s" isn't voiced in German). There's also a > clear convention of using "kh" for "ch" (as in English "loch" but also > for other pronunciation of the German "ch"). The one apparent mismatch > is "ge- gefarthey" for "Gegenpartei". Presumably what is > transliterated as "f" can stand for phonetic "p". "Parthey" might be > how Germans could have written "Partei" in earlier centuries (when > "th" was commonly used for "t" and "ey" alternated with "ei", as in my > last name).? So, perhaps it's closer than it looks, superficially. > I think that really IS a "p"; elsewhere in the document they seem to be quite careful to put a RAFE on top of the PEH when it means "f", and not using a DAGESH to mark "p".? There definitely does seem to be usage of TET-HEH for "th"; in the Hebrew text at the beginning it talks about the ?????? community?took me a bit to work out that was an abbreviation for "Orthodox". > From context, "Reue" is by far the best match for "Reye" and seems to > match a tendency elsewhere in the sample where the transliteration, if > pronounced as German, would result in a shifted quality for the vowels > (making them sound more Yiddish, for lack of a better description). > That word is hard to read in the original, hence the "?" in the transliteration.? It isn't clear if it's YOD YOD or YOD VAV and the VAV is missing its body (the head looks different than it should if it were a YOD).? Which would match your "Reue" fairly well?except that they generally use AYIN for "e", not "YOD". > > "absch?ttelen" - here the second "e" would not be part of Standard > German orthography. It's either an artifact of the transcription > system or possibly reflects that the writer is familiar with a > different spelling convention (to my eyes the spelling "abshittelen" > looks somehow more Yiddish, but I'm really not familiar enough with > that language). > The ? is, of course, not in the text in the original; it's just "i".? German ? wound up as "i" in Yiddish, in most cases. ~mark From unicode at unicode.org Sun Nov 11 18:24:11 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 11 Nov 2018 19:24:11 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <0532C015-564D-4451-9101-44F75DA535E8@telia.com> Message-ID: <7ab3c46d-dd95-2a75-9687-f29145c59b8f@kli.org> On 11/11/18 6:00 PM, Asmus Freytag (c) via Unicode wrote: > On 11/11/2018 1:37 PM, Hans ?berg wrote: >>> On 11 Nov 2018, at 22:16, Asmus Freytag via Unicode wrote: >>> >>> On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: >>> One should not rely too much these autotranslation tools, but it may >>> be quicker using some OCR program and then correct by hand, than >>> entering it all by hand. The setup did not admit transliterating >>> Hebrew script directly into German. It seems that the translator >>> program recognizes it as Yiddish, though it might be as a result of >>> an assumption it makes. > > > Well, the OCR does a much better job than the "translation". > Agreed: > > >> The German translation it gives: >> Unsere S?nde kommt von der Seite der Verletzten, nachdem sie darauf gewartet hat, erwartet zu werden, und nachdem sie die Vorstellungen dieser rabbinischen Andachten kennengelernt haben, haben sie begonnen, mit der Motivation zu schlie?en: > > > This is simply utter nonsense and does not even begin to correlate > with the transliteration. > Yeah, that looks like word salad even to me and my tiny knowledge of German.? The first words are definitely "Wir sind," for example. > >> And in English: >> Our sin is coming out of the side of the injured side, after waiting to be expected, and having the concepts of these rabbinical devotiones, they have begun to agree with the motivation: > > > In fact, the English translation makes somewhat more sense. For > example, "Gegenpartei" in many legal contexts (which this sample > isn't, by the way) can in fact be translated as "injured party", which > in turn might correlate with an "injured side" as rendered. However > "Seite der Verletzten" makes no sense in this context, unless there's > a Hebrew word that accidentally matches and got picked up. > The pamphlet seems to be referring to forming some sort of sub-community or group as a "gegenpartei," I think. The actual content of the work is not a deep mystery, really. ~mark > (I'm suspicious that some of the auto translation does in fact work > like many real translations which often are not direct, but involve an > intermediate language - simply because it's not possible to find > sufficient translators between random pairs of languages.). > >> >From the original Hebrew script, in case someone wants to try out more possibilities: >> ???? ???? ???? ?????? ???????? ???? ????? ????? ??? ??? ????????? ?????? ???? , ??? ???????? ?? ????????? ???? ???? ???? ??? ??? ?????????????? ?????? ?????????? ???????? ????? ??? ?????????? ??????? ??? ??? ???????????? , ???? : >> >> > I don't know what that will tell you. You have a rendering that > produces coherent text which closely matches a phonetic > transliteration. What else do you hope to learn? > > A./ > From unicode at unicode.org Sun Nov 11 18:05:52 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Sun, 11 Nov 2018 19:05:52 -0500 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> Message-ID: <8471fd97-447a-74c8-0bb0-5d04f206b90d@kli.org> On 11/11/18 1:03 AM, Beth Myre via Unicode wrote: > Hi Mark, > > This is a really cool find, and it's interesting that you might have a > relative mentioned in it.? After looking at it more, I'm more > convinced that it's German written in Hebrew letters, not Yiddish.? I > think that explains the umlauts. Since the text is about Jewish > subjects, it also includes Hebrew words like you mentioned, just like > we would include /beit din/ or /p'sak/ in an English text. Again, I'm not so sure there's really a difference.? Yiddish *IS* Judeo-German.? That's what it's called.? Do you prefer to think of it as German?? OK with me, but it's more a matter of taste than fact. > > Here's a paragraph from page 22: > > Paragraph.jpg > > I (re-)transliterated it, and it reads: > > Wir sind uns dessen bewusst, dass von Seite der Gegenpartei?weder > Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen > dieser rabbinischen Gutachten von sich absch?ttelen werden mit der > Motivierung, dass: > Are you sure you're not embellishing a bit?? I note you have ?, and yet the text clearly says "abshitellen".? The ? sound did not survive into later Yiddish, usually becoming "i", and the ? sound apparently didn't either... but is still there at this particular time and place. > I only know a little Yiddish (one semester a long time ago), but I > think Yiddish word order would be very different.? Also, 'we are' > would be 'mir zaynen' instead of 'wir sind,' 'and' would be 'un' > instead of 'und,' etc. > Yiddish "and" is now spelled "un" (alef-vav-finalnun), but I have seen it spelled alef-vav-nun-geresh, indicating the elision of the final -d in older texts.? It would not surprise me at all if some dialects preserved the -d, in spelling anyway, longer than others.? "Mir zaynen" is definitely "normal" Yiddish so far as I know... but how far do I know? What is this argument over anyway?? "You claim that this animal is a mutt, but I tell you it is clearly a dog of mixed breed!" ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Paragraph.jpg Type: image/jpeg Size: 181354 bytes Desc: not available URL: From unicode at unicode.org Sun Nov 11 19:28:38 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 11 Nov 2018 17:28:38 -0800 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> Message-ID: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Nov 11 22:28:01 2018 From: unicode at unicode.org (Beth Myre via Unicode) Date: Sun, 11 Nov 2018 23:28:01 -0500 Subject: Aleph-umlaut In-Reply-To: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> Message-ID: Hi All, I wanted to clarify how I got this: *Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, dass:* As a (non-native) German speaker who knows the Hebrew alphabet, I looked at the text, and then wrote the text contained in it using conventional German spelling. I spelled absch?tteln wrong. I didn't change the word order or vocabulary. The translation into English was also my own. The spelling of the word 'Reue' surprised me and one of the letters looked odd, so I put a question mark after it. I wasn't transliterating letter-for-letter, which wouldn't be possible because certain letters written next to each other produce specific sounds. For example, the Hebrew letters yud-yud make the German sound 'ei,' and the letters vav-vav make the German sound 'w.' The Hebrew alphabet just provides different material to work with than the Latin alphabet. Speaking of, it will soon be Chanukah/Hanukkah/Hanukah! :) The transliteration created by a computer program in one of the previous emails makes the text look more Yiddish-y than it is, probably because it was expecting Yiddish. It also made several clear errors. A few examples, some of which Mark mentioned: - The Hebrew letter aleph was always transliterated as 'a.' However, whenever it had the small vowel symbol that looks like a 'T' underneath it, it should have been an 'o.' And in several locations it's a 'shtumer aleph' (a.k.a. silent aleph) that's basically just carrying the letter used for 'u,' so it shouldn't be included at all. - It was inconsistent in how it transliterated the Hebrew letter 'yud,' sometimes making it an 'i' but more often a 'y.' The 'y' makes it look like Yiddish, but they're both valid. It's also used in other parts of the text for the German 'j.' - It skipped the 'n' in 'Gegenpartei,' although it's definitely present in the text. There's also an 'h' after the 't,' so the word is basically spelled "Gegenparthei." - It missed the difference between the 'f' sound and the 'p' sound, which is represented in the text by the presence or absence of a small line over the same Hebrew letter. Mark, you asked why I brought up the question of whether this is Yiddish or German. They're two separate but related languages, and I thought this text was really interesting because it turned out not to be what I was expecting. I'm not a scholar, and I didn't realize that anyone ever wrote in German using Hebrew letters. It's a struggle for me to understand Yiddish and my Hebrew is limited. Being able to understand entire paragraphs written in Hebrew letters is a rare treat for me. Beth On Sun, Nov 11, 2018 at 8:31 PM Asmus Freytag via Unicode < unicode at unicode.org> wrote: > On 11/11/2018 4:20 PM, Mark E. Shoulson via Unicode wrote: > > On 11/11/18 4:16 PM, Asmus Freytag via Unicode wrote: > > On 11/11/2018 12:32 PM, Hans ?berg via Unicode wrote: > > > Wir sind uns dessen bewusst, dass von Seite der Gegenpartei weder Reue(?), > noch Einsicht zu erwarten ist und dass sie die Konsequenzen dieser > rabbinischen Gutachten von sich absch?ttelen werden mit der Motivierung, > dass: > > vir zind auns dessen bevaust dass fon zeyte der ge- gefarthey veder reye , > nakh eynzikht tsu ervarten izt aund dast zya dya kansekventsen dyezer > rabbinishen gutakhten fon zikh abshittelen verden mit der motivirung , > dass : > > > > This agrees rather well with Beth's retranslation. > > Mapping "z" to "s", "f" to "v" and "v" to "w" would match the way these > pronunciations are spelled in German (with a few outliers like "izt" for > "ist", where the "s" isn't voiced in German). There's also a clear > convention of using "kh" for "ch" (as in English "loch" but also for other > pronunciation of the German "ch"). The one apparent mismatch is "ge- > gefarthey" for "Gegenpartei". Presumably what is transliterated as "f" can > stand for phonetic "p". "Parthey" might be how Germans could have written > "Partei" in earlier centuries (when "th" was commonly used for "t" and "ey" > alternated with "ei", as in my last name). So, perhaps it's closer than it > looks, superficially. > > I think that really IS a "p"; elsewhere in the document they seem to be > quite careful to put a RAFE on top of the PEH when it means "f", and not > using a DAGESH to mark "p". There definitely does seem to be usage of > TET-HEH for "th"; in the Hebrew text at the beginning it talks about the > ?????? community?took me a bit to work out that was an abbreviation for > "Orthodox". > > From context, "Reue" is by far the best match for "Reye" and seems to > match a tendency elsewhere in the sample where the transliteration, if > pronounced as German, would result in a shifted quality for the vowels > (making them sound more Yiddish, for lack of a better description). > > That word is hard to read in the original, hence the "?" in the > transliteration. It isn't clear if it's YOD YOD or YOD VAV and the VAV is > missing its body (the head looks different than it should if it were a > YOD). Which would match your "Reue" fairly well?except that they generally > use AYIN for "e", not "YOD". > > > "absch?ttelen" - here the second "e" would not be part of Standard German > orthography. It's either an artifact of the transcription system or > possibly reflects that the writer is familiar with a different spelling > convention (to my eyes the spelling "abshittelen" looks somehow more > Yiddish, but I'm really not familiar enough with that language). > > The ? is, of course, not in the text in the original; it's just "i". > German ? wound up as "i" in Yiddish, in most cases. > > > I agree with Beth that the text reads like a transcription of a standard > German text, not like a transcription of Yiddish, small infidelities in > vowel/consonant renderings notwithstanding. These are either because the > transcription conventions deliberately make some substitutions (presumably > there's no Hebrew letter that would directly match an "?", so they picked > "i") or because the writer, while trying to capture standard German in this > instance, is aware of a different orthography. The result, before Beth > tweaked it, would resemble a bit a phonetic transcription of someone > speaking standard German with a Yiddish accent. The fact that there are no > differences in grammar and the phrasing is absolutely natural for written > German is what confirms the identification as German, rather than Yiddish > text. > > Just because Yiddish is closely related to German doesn't mean that you > can simply write the former with standard German phonetics and have it > match a text in standard German to the point where there's no distinction. > I think the sample is long enough and involved enough to give quite decent > confidence in discriminating between these two Germanic languages. Grammar, > phrasing and word choice are in that sense much better indicators than pure > spelling; just as people trying to assume some foreign accent will give > themselves away by faithfully maintaining the underlying structure of the > language - that even works if the "accent" includes a few selected bits of > "foreign" word order or grammar. In those artificial examples, there's > rarely the kind of subtle mistake that a true non-native will make. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Nov 12 19:48:39 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 12 Nov 2018 20:48:39 -0500 Subject: Aleph-umlaut In-Reply-To: <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> Message-ID: You know, you're right (as is Beth), and I don't know why I'm arguing the point.? It's something I've been working on: I shouldn't defend a position JUST because it's _my_ position, and yet that's just what I did. So, yes, it certainly does seem essentially German.? I couldn't say why they chose to write this part in German, or why they chose to transcribe it in Hebrew letters, really.? I assumed Yiddish probably because of the context and the alphabet used, but there's no reason for it not to be German.? Now, the pamphlet originated from Kloizenberg, i.e. https://en.wikipedia.org/wiki/Cluj-Napoca which is in Romania, but German was probably enough of a lingua franca (after all, Yiddish developed from it for that reason).? And the text being basically German would explain the aleph-umlaut which was the start of all this, though it doesn't so much need an "explanation": it's interesting enough that it's _there_.? Also interesting that no other umlauted letters were considered distinct enough to be transcribed so (or else they just happened not to show up).? There are probably mildly interesting things (depending on your interests) to be gleaned from studying how the transliterations, how they seemed to use ? for word-final "e" in "die" in some places but ? in others, etc. Anyway, still interesting, I thought. ~mark On 11/11/18 8:28 PM, Asmus Freytag via Unicode wrote: > > I agree with Beth that the text reads like a transcription of a > standard German text, not like a transcription of Yiddish, small > infidelities in vowel/consonant renderings notwithstanding. These are > either because the transcription conventions deliberately make some > substitutions (presumably there's no Hebrew letter that would directly > match an "?", so they picked "i") or because the writer, while trying > to capture standard German in this instance, is aware of a different > orthography. The result, before Beth tweaked it, would resemble a bit > a phonetic transcription of someone speaking standard German with a > Yiddish accent. The fact that there are no differences in grammar and > the phrasing is absolutely natural for written German is what confirms > the identification as German, rather than Yiddish text. > > Just because Yiddish is closely related to German doesn't mean that > you can simply write the former with standard German phonetics and > have it match a text in standard German to the point where there's no > distinction. I think the sample is long enough and involved enough to > give quite decent confidence in discriminating between these two > Germanic languages. Grammar, phrasing and word choice are in that > sense much better indicators than pure spelling; just as people trying > to assume some foreign accent will give themselves away by faithfully > maintaining the underlying structure of the language - that even works > if the "accent" includes a few selected bits of "foreign" word order > or grammar. In those artificial examples, there's rarely the kind of > subtle mistake that a true non-native will make. > > A./ > From unicode at unicode.org Mon Nov 12 20:59:53 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 12 Nov 2018 18:59:53 -0800 Subject: Aleph-umlaut In-Reply-To: References: <20181110002554.0334d757@spixxi> <000c01d47888$aedea820$0c9bf860$@xencraft.com> <328f222d-aa42-7c1d-618b-9440a8d6a0e0@gmail.com> <3ed45d57-d4ef-cb95-5511-4e757e11e734@ix.netcom.com> <7f48207a-f47d-a005-3b8e-7b46f5afbe78@ix.netcom.com> Message-ID: <93370887-36fd-db7e-fa94-aced566a8fa2@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Nov 20 14:57:57 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Tue, 20 Nov 2018 20:57:57 +0000 (GMT) Subject: The encoding of the Welsh flag Message-ID: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> In Unicode? Technical Standard #51 Unicode Emoji there is the encoding for the Welsh flag. This is in the section http://www.unicode.org/reports/tr51/#Sample_Valid_Emoji_Tag_Sequences In the Status section near the start of the document is the following. quote A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. end quote My questions are as follows please. Is that encoding for the Welsh flag included in both The Unicode Standard and ISO/IEC 10646 or is it only encoded in The Unicode Standard or is it in neither The Unicode Standard nor ISO/IEC 10646? Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system? William Overington Tuesday 20 November 2018 From unicode at unicode.org Tue Nov 20 15:50:25 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 20 Nov 2018 13:50:25 -0800 Subject: The encoding of the Welsh flag In-Reply-To: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> Message-ID: <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> On 11/20/2018 12:57 PM, William_J_G Overington via Unicode wrote: > quote > > A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. > > end quote > > My questions are as follows please. > > Is that encoding for the Welsh flag included > > in both The Unicode Standard and ISO/IEC 10646 > > or is it only encoded in The Unicode Standard > > or is it in neither The Unicode Standard nor ISO/IEC 10646? Neither. A flag emoji is represented via a character sequence -- in this particular case by an emoji tag sequence, as specified in UTS #51. The representation of flag emoji via emoji tag sequences is *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. If you find that hard to understand, consider another example. The spelling of the word "emoji" as the sequence of Unicode characters <0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. Neither standard specifies English spelling rules; nor does either standard specify flag emoji "spelling rules". > > Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system? One declares conformance to UTS #51 and declares the version of emoji that one's application supports -- including the RGI (recommended for general interchange) list of emoji one has input and display support for. If the declaration states support for the flags of England, Scotland, and Wales, then one must do so via the specified emoji tag sequences. Your interoperability derives from that. --Ken From unicode at unicode.org Wed Nov 21 10:00:36 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 21 Nov 2018 16:00:36 +0000 (GMT) Subject: The encoding of the Welsh flag In-Reply-To: <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> Message-ID: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> Ken Whistler wrote as follows. > A flag emoji is represented via a character sequence -- in this particular case by an emoji tag sequence, as specified in UTS #51. > The representation of flag emoji via emoji tag sequences is *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. > If you find that hard to understand, consider another example. The spelling of the word "emoji" as the sequence of Unicode characters <0065, 006D, 006F, 006A, 0069> is also *OUT OF SCOPE* for both the Unicode Standard and for ISO/IEC 10646. Neither standard specifies English spelling rules; nor does either standard specify flag emoji "spelling rules". It seems to me that the two examples are fundamentally different each from the other. The word emoji can be looked up in a dictionary and there one can find the sequence of glyphs that one needs to express that particular word. https://en.oxforddictionaries.com/definition/emoji If one then wishes to find the encoding of those glyphs, such that that particular word can become encoded as text characters in a message in an electronic system in an interoperable format, one can look in either The Unicode Standard or The ISO/IEC 10646 Standard and find code numbers. As the two standards are in synchronization one may, as I understand it, look in either. The Welsh flag can be looked up in a list of flags and the desired glyph can be found. If one then wishes to find the encoding of that glyph, such that that the glyph for that particular flag can become encoded as text characters in a message in an electronic system in an interoperable manner, then, as far as I am aware, that encoding cannot at this time be found in an International Standard. Also, whereas there are many languages there is only one collection of flags, as flags are intended to be mutually distinguishable from any other flag. WJGO >> Unless the answer is the first listed possibility, how does that work as regards interoperability of sending and receiving a Welsh flag on an electronic communication system? > One declares conformance to UTS #51 and declares the version of emoji that one's application supports -- including the RGI (recommended for general interchange) list of emoji one has input and display support for. If the declaration states support for the flags of England, Scotland, and Wales, then one must do so via the specified emoji tag sequences. Your interoperability derives from that. Yet the interoperability does not derive from an International Standard. Widening the discussion somewhat, are the encodings that are formed for glyphs, such as for Astronaut, that are not using tag characters yet are using a sequence of characters including one or more ZWJ characters listed in both The Unicode Standard and The ISO/IEC 10646 Standard? It seems to me that tag sequences offer great possibilities for encoding, in effect a vast additional encoding space, yet for those encodings to be able to be used interoperably I opine they need to be listed in an International Standard, the International Standard in which they are listed may, but need not, being The ISO/IEC 10646 Standard. William Overington Wednesday 21 November 2018 From unicode at unicode.org Wed Nov 21 10:31:32 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 21 Nov 2018 08:31:32 -0800 Subject: The encoding of the Welsh flag In-Reply-To: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> Message-ID: On 11/21/2018 8:00 AM, William_J_G Overington via Unicode wrote: > Yet the interoperability does not derive from an International Standard. The interoperability that enabled your mail to be delivered to me derives in part from the MIME standard (RFC 2045 et seq.) which is not an International Standard, but is instead maintained by the Networking Working Group of IETF. The interoperability that enabled me to read the content of your mail derives from the HTML standard, which is not an International Standard, but is instead maintained by the W3C (a consortium). The interoperability of any flag emoji embedded in that content derives from Unicode Technical Standard #51, which is not an International Standard, but is instead maintained by the Unicode Consortium. These standards are all widely used *internationally*, but they are not an International Standard, which is effectively a moniker claimed by ISO for itself and its standards. But in this day and age, expecting all technology, including technology related to computational processing, distribution, interchange, and rendering of text, to wait around for any related standard to be canonized as an International Standard is just silly. The world of technology does not work that way, and frankly, folks should be damn glad that it doesn't. --Ken From unicode at unicode.org Wed Nov 21 11:38:42 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Wed, 21 Nov 2018 17:38:42 +0000 Subject: The encoding of the Welsh flag In-Reply-To: <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> Message-ID: What really annoys me about this is that there is no flag for Northern Ireland. The folks at CLDR did not think to ask either the UK or the Irish representatives to SC2 about this. Yes, there is no ?official flag? for Northern Ireland. But there is one _universally_ used in sport, and that should have been made into an emoji at the same time when flags for Scotland, Wales, and England were made. And it still should. Michael Everson From unicode at unicode.org Wed Nov 21 12:00:56 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 21 Nov 2018 10:00:56 -0800 Subject: The encoding of the Welsh flag In-Reply-To: References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> Message-ID: <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net> Michael, On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote: > What really annoys me about this is that there is no flag for Northern Ireland. The folks at CLDR did not think to ask either the UK or the Irish representatives to SC2 about this. Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non sequitur. If you or Andrew West or anyone else is interested in pursuing an emoji tag sequence for an emoji flag for Northern Ireland, then that should be done by submitting a proposal, with justification, to the Emoji Subcommittee, which *does* have jurisdiction. https://unicode.org/emoji/proposals.html See in particular, Section M of the selection criteria. --Ken From unicode at unicode.org Wed Nov 21 12:50:59 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 21 Nov 2018 19:50:59 +0100 Subject: The encoding of the Welsh flag In-Reply-To: <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net> References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net> Message-ID: We have gotten requests for this, but the stumbling block is the lack of an official N. Ireland document describing what the official flag is and should look like. ?However, whilst England (St George?s Cross) Scotland (St Andrew?s Cross) and Wales (The Dragon) have individual regional flags, the Flags Institute in London confirms that Northern Ireland has no official regional flag.? https://www.newsletter.co.uk/news/new-northern-ireland-flag-should-be-created-says-lord-kilclooney-1-5753950 Should the N. Irish decide on a flag, I don't foresee any problem adding it. Mark On Wed, Nov 21, 2018 at 7:04 PM Ken Whistler via Unicode < unicode at unicode.org> wrote: > Michael, > > On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote: > > What really annoys me about this is that there is no flag for Northern > Ireland. The folks at CLDR did not think to ask either the UK or the Irish > representatives to SC2 about this. > > Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non > sequitur. > > If you or Andrew West or anyone else is interested in pursuing an emoji > tag sequence for an emoji flag for Northern Ireland, then that should be > done by submitting a proposal, with justification, to the Emoji > Subcommittee, which *does* have jurisdiction. > > https://unicode.org/emoji/proposals.html > > See in particular, Section M of the selection criteria. > > --Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 22 04:12:16 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 22 Nov 2018 12:12:16 +0200 Subject: Can NFKC turn valid UAX 31 identifiers into non-identifiers? In-Reply-To: References: Message-ID: On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ?? wrote: > > > That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use? > > One could parse for certain sequences, putting characters into a number of broad categories. Very approximately: > > junk ~= [[:cn:][:cs:][:co:]]+ > whitespace ~= [[:z:][:c:]-junk]+ > syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators > identifiers ~= [all-else]+ > > UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the focus there is on immutability. > > So an implementation could choose to follow that course, rather than the more narrowly defined identifiers in http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, one can conform to the Default Identifiers but declare a profile that expands the allowable characters. One could take a Swiftian approach, for example... Thank you and sorry about my slow reply. Why is excluding junk important? > On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode wrote: >> >> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen wrote: >> > Considering that ruling out too much can be a problem later, but just >> > treating anything above ASCII as opaque hasn't caused trouble (that I >> > know of) for HTML other than compatibility issues with XML's stricter >> > stance, why should a programming language, if it opts to support >> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the >> > complexity of UAX #31 instead of allowing everything above ASCII in >> > identifiers? In other words, what problem does making a programming >> > language conform to UAX #31 solve? >> >> After refreshing my memory of XML history, I realize that mentioning >> XML does not helpfully illustrate my question despite the mention of >> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please >> ignore the XML part. >> >> Trying to rephrase my question more clearly: >> >> Let's assume that we are designing a computer-parseable syntax where >> tokens consisting of user-chosen characters can't occur next to each >> other and, instead, always have some syntax-reserved characters >> between them. That is, I'm talking about syntaxes that look like this >> (could be e.g. Java): >> >> ab.cd(); >> >> Here, ab and cd are tokens with user-chosen characters whereas space >> (the indent), period, parenthesis and the semicolon are >> syntax-reserved. We know that ab and cd are distinct tokens, because >> there is a period between them, and we know the opening parethesis >> ends the cd token. >> >> To illustrate what I'm explicitly _not_ talking about, I'm not talking >> about a syntax like this: >> >> ????? >> >> Here ?? and ?? are user-named variable names and ? is a user-named >> operator and the distinction between different kinds of user-named >> tokens has to be known somehow in order to be able to tell that there >> are three distinct tokens: ??, ?, and ??. >> >> My question is: >> >> When designing a syntax where tokens with the user-chosen characters >> can't occur next to each other without some syntax-reserved characters >> between them, what advantages are there from limiting the user-chosen >> characters according to UAX #31 as opposed to treating any character >> that is not a syntax-reserved character as a character that can occur >> in user-named tokens? >> >> I understand that taking the latter approach allows users to mint >> tokens that on some aesthetic measure don't make sense (e.g. minting >> tokens that consist of glyphless code points), but why is it important >> to prescribe that this is prohibited as opposed to just letting users >> choose not to mint tokens that are inconvenient for them to work with >> given the behavior that their plain text editor gives to various >> characters? That is, why is conforming to UAX #31 worth the risk of >> prohibiting the use of characters that some users might want to use? >> The introduction of XID after ID and the introduction of Extended >> Hashtag Identifiers after XID is indicative of over-restriction having >> been a problem. >> >> Limiting user-minted tokens to UAX #31 does not appear to be necessary >> for security purposes considering that HTML and CSS exist in a >> particularly adversarial environment and get away with taking the >> approach that any character that isn't a syntax-reserved character is >> collected as part of a user-minted identifier. (Informally, both treat >> non-ASCII characters the same as an ASCII underscore. HTML even treats >> non-whitespace, non-U+0000 ASCII controls that way.) >> >> -- >> Henri Sivonen >> hsivonen at hsivonen.fi >> https://hsivonen.fi/ >> > -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Thu Nov 22 04:27:31 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 22 Nov 2018 12:27:31 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ?? wrote: > > * The Python 3.3 model mentions the disadvantages of memory usage >> cliffs but doesn't mention the associated perfomance cliffs. It would >> be good to also mention that when a string manipulation causes the >> storage to expand or contract, there's a performance impact that's not >> apparent from the nature of the operation if the programmer's >> intuition works on the assumption that the programmer is dealing with >> UTF-32. >> > > The focus was on immutable string models, but I didn't make that clear. > Added some text. > Thanks. > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM >> text node storage in Gecko, (I believe but am not 100% sure) V8 and, >> optionally, HotSpot >> ( >> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A >> ). >> That is, text has UTF-16 semantics, but if the high half of every code >> unit in a string is zero, only the lower half is stored. This has >> properties analogous to the Python 3.3 model, except non-BMP doesn't >> expand to UTF-32 but uses UTF-16 surrogate pairs. >> > > Thanks, will add. > V8 source code shows it has a OneByteString storage option: https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494 . From hearsay, I'm convinced that it means Latin1, but I've failed to find a clear quotable statement from a V8 developer to that affect. > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers >> have a different type in the type system than byte buffers. To go from >> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data >> has been tagged as valid UTF-8, the validity is trusted completely so >> that iteration by code point does not have "else" branches for >> malformed sequences. If data that the type system indicates to be >> valid UTF-8 wasn't actually valid, it would be nasal demon time. The >> language has a default "safe" side and an opt-in "unsafe" side. The >> unsafe side is for performing low-level operations in a way where the >> responsibility of upholding invariants is moved from the compiler to >> the programmer. It's impossible to violate the UTF-8 validity >> invariant using the safe part of the language. >> > > Added a quote based on this; please check if it is ok. > Looks accurate. Thanks. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 22 05:08:30 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 22 Nov 2018 13:08:30 +0200 Subject: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences Message-ID: Context: https://github.com/whatwg/encoding/issues/115 Unicode Security Considerations say: "3.6.2 Some Output For All Input Character encoding conversion must also not simply skip an illegal input byte sequence. Instead, it must stop with an error or substitute a replacement character (such as U+FFFD ( ? ) REPLACEMENT CHARACTER) or an escape sequence in the output. (See also Section 3.5 Deletion of Code Points.) It is important to do this not only for byte sequences that encode characters, but also for unrecognized or "empty" state-change sequences. For example: [...] ISO-2022 shift sequences without text characters before the next shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants require at least one character in a text segment between shift sequences. Security software written to the formal specification may not detect malicious text (for example, "delete" with a shift-to-double-byte then an immediate shift-to-ASCII in the middle)." (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input) The WHATWG Encoding Standard bakes this requirement by the means of "ISO-2022-JP output flag" (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its ISO-2022-JP decoder algorithm (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder). encoding_rs (https://github.com/hsivonen/encoding_rs) implements the WHATWG spec. After Gecko switched to encoding_rs from an implementation that didn't implement this U+FFFD generation behavior (uconv), a bug has been logged in the context of decoding Japanese email in Thunderbird: https://bugzilla.mozilla.org/show_bug.cgi?id=1508136 Ken Lunde also recalls seeing such email: https://github.com/whatwg/encoding/issues/115#issuecomment-440661403 The root problem seems to be that the requirement gives ISO-2022-JP the unusual and surprising property that concatenating two ISO-2022-JP outputs from a conforming encoder can result in a byte sequence that is non-conforming as input to a ISO-2022-JP decoder. Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape sequence is immediately followed by another ISO-2022-JP escape sequence. Chrome and Safari do, but their implementations of ISO-2022-JP aren't independent of each other. Moreover, Chrome's decoder implementations generally are informed by the Encoding Standard (though the ISO-2022-JP decoder specifically might not be yet), and I suspect that Safari's implementation (ICU) is either informed by Unicode Security Considerations or vice versa. The example given as rationale in Unicode Security Considerations, obfuscating the ASCII string "delete", could be accomplished by alternating between the ASCII and Roman states to that every other character is in the ASCII state and the rest of the Roman state. Is the requirement to generate U+FFFD when there is no content between ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII transitions or useless transitions between ASCII and Roman are not also required to generate U+FFFD? Would it even be feasible (in terms of interop with legacy encoders) to make useless transitions between ASCII and Roman generate U+FFFD? -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Thu Nov 22 05:24:49 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 22 Nov 2018 12:24:49 +0100 Subject: Unicode String Models In-Reply-To: References: Message-ID: Thanks for the review! In case you're interested, I'd also welcome feedback on Locale Identifiers Mark On Thu, Nov 22, 2018 at 11:27 AM Henri Sivonen wrote: > On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ?? wrote: > >> >> * The Python 3.3 model mentions the disadvantages of memory usage >>> cliffs but doesn't mention the associated perfomance cliffs. It would >>> be good to also mention that when a string manipulation causes the >>> storage to expand or contract, there's a performance impact that's not >>> apparent from the nature of the operation if the programmer's >>> intuition works on the assumption that the programmer is dealing with >>> UTF-32. >>> >> >> The focus was on immutable string models, but I didn't make that clear. >> Added some text. >> > > Thanks. > > >> * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM >>> text node storage in Gecko, (I believe but am not 100% sure) V8 and, >>> optionally, HotSpot >>> ( >>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A >>> ). >>> That is, text has UTF-16 semantics, but if the high half of every code >>> unit in a string is zero, only the lower half is stored. This has >>> properties analogous to the Python 3.3 model, except non-BMP doesn't >>> expand to UTF-32 but uses UTF-16 surrogate pairs. >>> >> >> Thanks, will add. >> > > V8 source code shows it has a OneByteString storage option: > https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494 > . From hearsay, I'm convinced that it means Latin1, but I've failed to find > a clear quotable statement from a V8 developer to that affect. > > >> 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers >>> have a different type in the type system than byte buffers. To go from >>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data >>> has been tagged as valid UTF-8, the validity is trusted completely so >>> that iteration by code point does not have "else" branches for >>> malformed sequences. If data that the type system indicates to be >>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The >>> language has a default "safe" side and an opt-in "unsafe" side. The >>> unsafe side is for performing low-level operations in a way where the >>> responsibility of upholding invariants is moved from the compiler to >>> the programmer. It's impossible to violate the UTF-8 validity >>> invariant using the safe part of the language. >>> >> >> Added a quote based on this; please check if it is ok. >> > > Looks accurate. Thanks. > > -- > Henri Sivonen > hsivonen at hsivonen.fi > https://hsivonen.fi/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 22 09:27:09 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 22 Nov 2018 16:27:09 +0100 (CET) Subject: The encoding of the Welsh flag In-Reply-To: References: <10682341.58607.1542747477117.JavaMail.defaultUser@defaultHost> <20fc44d4-2eb1-2de5-1ce8-a601a3b71bc6@att.net> <26010173.33088.1542816036570.JavaMail.defaultUser@defaultHost> <5f740475-2283-6d8f-6474-a6a4976e45a3@att.net> Message-ID: <1201142793.143999.1542900429179@ox.hosteurope.de> Mark Davis ??: > > We have gotten requests for this, but the stumbling block is the lack of an > official N. Ireland document describing what the official flag is and > should look like. Such documents are lacking for several of the RIS flag emojis as well, though, e.g. for ???? from ISO 3166-1 code `UM` (United States Outlying Islands), resulting in unknown or duplicate flags, hence confusion. The solution there would have been to exclude codes for dependent territories becoming RGI emojis. ISO 3166 provides that property. The fundamental problem of flag emojis, however, is that the most requested ones are those that have no appropriate ISO code element, simply because the people requesting them need them for representing their strive for independence from another entity, or for supranational communities. From unicode at unicode.org Thu Nov 22 03:23:11 2018 From: unicode at unicode.org (- - via Unicode) Date: Thu, 22 Nov 2018 04:23:11 -0500 (EST) Subject: Compatibility Casefold Equivalence Message-ID: <1251703928.316122.1542878591512@email.ionos.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Nov 22 13:18:48 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 22 Nov 2018 12:18:48 -0700 Subject: The encoding of the Welsh flag Message-ID: <20BCF7D39DF643869175869F333C7DC1@DougEwell> Ken Whistler replied to Michael Everson: >> What really annoys me about this is that there is no flag for >> Northern Ireland. The folks at CLDR did not think to ask either the >> UK or the Irish representatives to SC2 about this. [...] > If you or Andrew West or anyone else is interested in pursuing an > emoji tag sequence for an emoji flag for Northern Ireland, then that > should be done by submitting a proposal, with justification, to the > Emoji Subcommittee, which *does* have jurisdiction. There is, of course, an encoding for the flag of Northern Ireland: 1F3F4 E0067 E0062 E006E E0069 E0072 E007F where the tag characters are "gbnir" followed by TAG CANCEL. What I suspect Michael means is that this sequence is not RGI, or "recommended for general interchange," a status which applies for flag emoji only to England, Scotland, and Wales, and not to any of the thousands of other subdivisions worldwide. The terminology currently in UTS #51 is definitely an improvement over early drafts, which explicitly labeled such sequences "not recommended," but it still leads practically everyone. evidently including Michael, to believe the sequences are invalid or non-existent. I would certainly like to use the flag of Colorado, whose visual appearance is very much standardized, but the vicious circle of vendor support and UTS #51 categorization means no system will offer glyph support, and some systems may even reject it as invalid. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Nov 22 13:29:29 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 22 Nov 2018 12:29:29 -0700 Subject: The encoding of the Welsh flag Message-ID: Christoph P?per wrote: >> We have gotten requests for this, but the stumbling block is the lack >> of an official N. Ireland document describing what the official flag >> is and should look like. > > Such documents are lacking for several of the RIS flag emojis as well, > though, e.g. for ???? from ISO 3166-1 code `UM` (United States > Outlying > Islands), resulting in unknown or duplicate flags, hence confusion. > The solution there would have been to exclude codes for dependent > territories becoming RGI emojis. ISO 3166 provides that property. That's neither the problem nor the solution, IMHO. Even for RIS sequences, you have no guarantee of exactly how the flag will be depicted. For flags that have been recently changed, you might get the old or the new. For UM, you might get the US flag or one of the unofficially adopted flags. For Northern Ireland (if it were RGI-blessed), you might get either the Ulster Banner or St. Patrick's Saltire. This situation is described, and explicitly so for the UM flags, in Annex B of UTS #51 under "Caveats." -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Nov 22 13:58:51 2018 From: unicode at unicode.org (Carl via Unicode) Date: Thu, 22 Nov 2018 14:58:51 -0500 (EST) Subject: Compatibility Casefold Equivalence In-Reply-To: <1251703928.316122.1542878591512@email.ionos.com> References: <1251703928.316122.1542878591512@email.ionos.com> Message-ID: <1626926067.211518.1542916731686@email.ionos.com> (It looks like my HTML email got scrubbed, sorry for the double post) Hi, In Chapter 3 Section 13, the Unicode spec defines D146: "A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))" I am trying to understand the "if and only if" part of this.? ?Specifically, why is the outermost NFKD necessary?? Could it also be a NFKC normalization?? ?Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay? My use case is that I am trying to store user-provided tags in a database.? I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.? However, because decomposition can result in much larger strings, I would prefer to keep? the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above). Thanks, Carl From unicode at unicode.org Sat Nov 24 16:33:15 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 24 Nov 2018 14:33:15 -0800 Subject: Compatibility Casefold Equivalence In-Reply-To: <1626926067.211518.1542916731686@email.ionos.com> References: <1251703928.316122.1542878591512@email.ionos.com> <1626926067.211518.1542916731686@email.ionos.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Nov 27 01:46:06 2018 From: unicode at unicode.org (Carl via Unicode) Date: Tue, 27 Nov 2018 02:46:06 -0500 (EST) Subject: Compatibility Casefold Equivalence In-Reply-To: References: <1251703928.316122.1542878591512@email.ionos.com> <1626926067.211518.1542916731686@email.ionos.com> Message-ID: <2125255467.459496.1543304766546@email.ionos.com> Thanks for the reply. Responses inline: > On November 24, 2018 at 5:33 PM Asmus Freytag via Unicode wrote: > > > On 11/22/2018 11:58 AM, Carl via Unicode wrote: > > (It looks like my HTML email got scrubbed, sorry for the double post) > > > > Hi, > > > > > > In Chapter 3 Section 13, the Unicode spec defines D146: > > > > > > "A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))" > > > > > > I am trying to understand the "if and only if" part of this.? ?Specifically, why is the outermost NFKD necessary?? Could it also be a NFKC normalization?? ?Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay? > > > > > > My use case is that I am trying to store user-provided tags in a database.? I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.? However, because decomposition can result in much larger strings, I would prefer to keep? the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above). > > > Carl, > > > you may find that some of the complications are limited to a small number of code points. In particular, classical (polytonic) Greek has some gnarly behavior wrt case; and some compatibility characters have odd edge cases. > > I suspected that the number of edge cases would be small, but I lack a way of enumerating them. (i.e. I don't know what I don't know) > I'm personally not a fan of allowing every single Unicode code point in things like usernames (or other types of identifiers). Especially, if including some code points makes the "general case" that much more complex, my personal recommendation would be to simply disallow / reject a small set of troublesome characters; especially if they aren't part of some widespread modern orthography. > > > While Unicode is about being able to digitally represent all written text, identifiers don't follow the same rules. The main reason why people often allow "anything" is because it's easy in terms of specification. Sometimes, you may not have control over what to accept; for example if tags are generated from headers in a document, it would require some transform to handle disallowed code points. > > The identifiers doc was what I had originally planned on using, but some of the rules there are too much. For example, IIUC variation selectors are not allowed (scrubbed?), which prevents use of some emoji sequences. Also, the ID_Start and XID_Start properties are too strict (since I'm not using this in a programming language or otherwise secure environment), as they forbid leading numbers. Hashtags are close to what I want, but again, they specify a leading "#". Really the problem for me is that I don't know what liberties I can take with restricting/allowing certain characters. Being too restrictive might be culturally insensitive, but being too lax might open the system for abuse. Would it be overkill to render the tag text to a picture, hash the picture, and store that instead? It seems like it would force visually identical strings to the same set of bytes. > Case is also only one of the types of duplication you may encounter. In many South and South East Asian scripts you may encounter cases where two sequences of characters, while different, will normally render identical. Arabic also has instances of that. Finally, you may ask yourself whether your system should treat simplified and traditional Chinese ideographs as separate or as a variant not unlike the way you treat case. > > Ideally I would like the same kind of matching as my browser does when I press Ctrl-F. If simplified and traditional Chinese match, that's probably good enough. > About storing your tag data: you can obviously store them as NFC, if you like: in that case, you will have to run the operations both on the stored and on the new tag. > > > Finally, there are some cases where you can tell that two string are identical without actually carrying out the full set of operations: > > > Y = X > > > NFC(Y) = NFC(X) > > > and so on. (If these conditions are true, the full condition above must also be true). For example, let's apply > > NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) > > > on both sides of > > > NFC(Y) = NFC(X) > > > First: > > > NFD(NFC(Y)) = NFD(NFC(X)) > > > Because the two sides are equal, applying toCaseFold results in equal strings, and so on all the way to the outer NFKD. As a minor followup, TR 15 section 7 says: "NFKC(NFKD(x)) == NFKC(x)" which implies that the outer NFKD can be replaced: NFKC(toCasefold(NFKD(toCasefold(NFD(X))))) > > > In other words, you can stop the comparison at any point where the two sides are equal. From that point on, the outer operations cannot add anything. That's a good point. In my case, since one side of the equation will be stored in a DB, I believe I need to do the full transform. That said, It would be useful for in-memory comparisons. > > > A./