From unicode at unicode.org  Mon Oct  1 05:23:47 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 1 Oct 2018 11:23:47 +0100 (BST)
Subject: Teletext graphics characters
Message-ID: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost>

In the minutes of the recent meeting of the Unicode Technical Committee, document http://www.unicode.org/L2/L2018/18272.htm there is the following.

quote

E.2 Proposal to add characters from legacy computers and teletext to the UCS [Ewell, et al, L2/18-275R]

On phone: Doug Ewell.

Discussion. UTC took no action at this time.

end quote

Could someone possibly say please why the teletext graphics characters have still not been encoded as the change requested to their proposed encoding by the Unicode Technical Committee had been made and a revised document submitted before at least two of the previous meetings of the Unicode Technical Committee took place?

Do the teletext graphics characters need to be resubmitted in a proposal document on their own for them to become encoded?

As teletext is a great United Kingdom invention, does it need the United Kingdom National Body to propose their inclusion directly to the International Standards Organization?

William Overington

Monday 1 October 2018


From unicode at unicode.org  Mon Oct  1 10:49:51 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Mon, 1 Oct 2018 08:49:51 -0700
Subject: Teletext graphics characters
In-Reply-To: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost>
References: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost>
Message-ID: <CAH=y87Y54s9PKjVxvpWRP=yWr75pMq-sHPD0yrjmJiAyofBw=g@mail.gmail.com>

There hasn't even been a response yet from the UTC members regarding the
evidence they requested for encoding FOUR-BY-FOUR CHECKER BOARD as a
distinct character from MEDIUM SHADE. They are most likely busy with other
Unicode business and/or their personal lives. These things take time. Be
patient.

-- Rebecca Bettencourt


On Mon, Oct 1, 2018 at 8:02 AM William_J_G Overington via Unicode <
unicode at unicode.org> wrote:

> In the minutes of the recent meeting of the Unicode Technical Committee,
> document http://www.unicode.org/L2/L2018/18272.htm there is the following.
>
> quote
>
> E.2 Proposal to add characters from legacy computers and teletext to the
> UCS [Ewell, et al, L2/18-275R]
>
> On phone: Doug Ewell.
>
> Discussion. UTC took no action at this time.
>
> end quote
>
> Could someone possibly say please why the teletext graphics characters
> have still not been encoded as the change requested to their proposed
> encoding by the Unicode Technical Committee had been made and a revised
> document submitted before at least two of the previous meetings of the
> Unicode Technical Committee took place?
>
> Do the teletext graphics characters need to be resubmitted in a proposal
> document on their own for them to become encoded?
>
> As teletext is a great United Kingdom invention, does it need the United
> Kingdom National Body to propose their inclusion directly to the
> International Standards Organization?
>
> William Overington
>
> Monday 1 October 2018
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181001/c0956d6c/attachment.html>

From unicode at unicode.org  Tue Oct  2 02:45:31 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Tue, 2 Oct 2018 16:45:31 +0900
Subject: Dealing with Georgian capitalization in programming languages
Message-ID: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>

Since the last discussion on Georgian (Mtavruli) on this mailing list, I 
have been looking into how to implement it in the Programming language Ruby.

Ruby has four case-conversion operations for its class String:

upcase:   convert all characters to upper case
downcase: convert all characters to lower case
swapcase: switch upper to lower and lower to upper case
capitalize:  uppercase (or title-case) the first character of the 
string, lowercase the rest

'upcase' and 'downcase' don't pose problems. 'swapcase' doesn't cause 
problems assuming the input doesn't have any problems. The only 
operation that can cause problems is 'capitalize'.

When I say "cause problems", I mean producing mixed-case output. I 
originally thought that 'capitalize' would be fine. It is fine for 
lowercase input: I stays lowercase because Unicode Data indicates that 
titlecase for lowercase Georgian letters is the letter itself. But it 
will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.

My questions here are:
- Has this been considered when Georgian Mtavruli was discussed in the
   UTC?
- How have any other implementers (ICU,...) addressed this, in
   particular the operation that's called 'capitalize' in Ruby?

Many thanks in advance for your input,

Regards,   Martin.

From unicode at unicode.org  Tue Oct  2 07:03:22 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:03:22 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
Message-ID: <CAJ2xs_FEv0G49AskBVMXJoT18V53tEoirHpvTUr0u4ntctakjA@mail.gmail.com>

Thanks to all for comments. Just revised the text in https://goo.gl/neguxb.

Mark


On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ?? <mark at macchiato.com> wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> Mark
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/7783cae5/attachment.html>

From unicode at unicode.org  Tue Oct  2 07:03:25 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:03:25 +0200
Subject: Unicode String Models
In-Reply-To: <CAD2gp_T-f4tO_xHRCTQFR0BSnj6ZwhQuG5jLOPFte_iHtwnVSA@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAD2gp_T-f4tO_xHRCTQFR0BSnj6ZwhQuG5jLOPFte_iHtwnVSA@mail.gmail.com>
Message-ID: <CAJ2xs_ELXD+9G+e15xKD3fRUbekWbC3DU0SEGrhCCVcgFUqtZQ@mail.gmail.com>

Thanks, added a quote from you on that; see if it looks ok.

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan <cowan at ccil.org> wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ?? via Unicore <
> unicore at unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/ab1e21e1/attachment.html>

From unicode at unicode.org  Tue Oct  2 07:03:38 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:03:38 +0200
Subject: Unicode String Models
In-Reply-To: <20180909085929.2d4ff0d2@JRWUBU2>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
Message-ID: <CAJ2xs_EOhQNfqAoiMZ162KYEpQCu0rDbrbfWKa4HhyHTUjHtfw@mail.gmail.com>

Mark


On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>
> *There may be some creep, but it doesn't matter for strings that can be
> stored within a galaxy.
>
> Of course, the coefficients implied by big-oh notation also matter.
> For example, it can be very easy to forget that a bubble sort is often
> the quickest sorting algorithm.
>

Thanks, added a quote from you on that; see if it looks ok.


> You keep muttering that a a sequence of 8-bit code units can contain
> invalid sequences, but often forget that that is also true of sequences
> of 16-bit code units.  Do emoji now ensure that confusion between
> codepoints and code units rapidly comes to light?
>

I didn't neglect that, had a [TBD] for it.

While UTF16 invalid unpaired surrogates don't complicate processing much if
they are treated as unassigned characters, allowing UTF8 invalid sequences
are more troublesome. See, for example, the convolutions needed in ICU
methods that allow ill-formed UTF8.


> You seem to keep forgetting that grapheme clusters are not how some
> people people work.  Does the English word 'caf?' contain the letter
> 'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
> one might want to look for the letter '?' in Vietnamese and find it
> whatever the associated tone mark is.
>

I'm pretty familiar with the situation, thanks for asking.

Often you want to find out more about the components of grapheme clusters,
so you always need to be able to iterate through the code points it
contains. One might think that iterating by grapheme cluster is hiding
features of the text. For example, with *fox?* (fox\u{301}) it is easy to
find that the text contains an *x* by iterating through code points. But
code points often don't reveal their components: does the word
*tambi?n* contain
the letter *e*? A reasonable question, but iterating by code point rather
than grapheme cluster doesn't help, since it is typically encoded as a
single U+00E9. And even decomposing to NFD doesn't always help, as with
cases like *r?dgr?d*.

>
> You didn't discuss substrings.


I did. But if you mean a definition of substring that lets you access
internal components of substrings, I'm afraid that is quite a specialized
usage. One could do it, but it would burden down the general use case.

> I'm interested in how subsequences of
> strings are defined, as the concept of 'substring' isn't really Unicode
> compliant.  Again, expressing '?' as a subsequence of the Vietnamese
> word 'n?ng' ought to be possible, whether one is using NFD (easier) or
> NFC.  (And there are alternative normalisations that are compatible
> with canonical equivalence.)  I'm most interested in subsequences X of a
> word W where W is the same as AXB for some strings A and B.


> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/d7ad84ae/attachment.html>

From unicode at unicode.org  Tue Oct  2 07:03:48 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:03:48 +0200
Subject: Unicode String Models
In-Reply-To: <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
Message-ID: <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>

Mark


On Sun, Sep 9, 2018 at 3:42 PM Daniel B?nzli <daniel.buenzli at erratique.ch>
wrote:

> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
> should not really matter to the end user of the string data structure. The
> API defines how the Unicode text is accessed, expressed by what is the
> result of an indexing operation on the string. The latter is really what
> matters for the end-user and what I would call the "model".
>

Because of performance and storage consideration, you need to consider the
possible internal data structures when you are looking at something as
low-level as strings. But most of the 'model's in the document are only
really distinguished by API, only the "Code Point model" discussions are
segmented by internal storage, as with "Code Point Model: UTF-32"


> I think the presentation would benefit from making a clear distinction
> between the internal representation and the API; you could then easily
> summarize them in a table which would make a nice summary of the design
> space.
>

That's an interesting suggestion, I'll mull it over.

>
> I also think you are missing one API which is the one with ECG I would
> favour: indexing returns Unicode scalar values, internally be it whatever
> you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended
> by the "Code Point Model: Internal 8/16/32" but that's not what it says,
> the distinction between code point and scalar value is an important one and
> I think it would be good to insist on it to clarify the minds in such
> documents.
>

In reality, most APIs are not even going to be in terms of code points:
they will return int32's. So not only are they not scalar values,
99.97% are not even code points. Of course, values above 10FFFF or below 0
shouldn't ever be stored in strings, but in practice treating
non-scalar-value-code-points as "permanently unassigned" characters doesn't
really cause problems in processing.


> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/c2464744/attachment.html>

From unicode at unicode.org  Tue Oct  2 07:04:09 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:04:09 +0200
Subject: Unicode String Models
In-Reply-To: <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
Message-ID: <CAJ2xs_GrNHuZwOTEJzjc90wWLU9YquWvrn2FOHCaUVtO5w=OjA@mail.gmail.com>

Mark


On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode at unicode.org> wrote:

> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ?? via Unicode
> <unicode at unicode.org> wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> * The Grapheme Cluster Model seems to have a couple of disadvantages
> that are not mentioned:
>   1) The subunit of string is also a string (a short string conforming
> to particular constraints). There's a need for *another* more atomic
> mechanism for examining the internals of the grapheme cluster string.
>

I did mention this.


>   2) The way an arbitrary string is divided into units when iterating
> over it changes when the program is executed on a newer version of the
> language runtime that is aware of newly-assigned codepoints from a
> newer version of Unicode.
>

Good point. I did mention the EGC definitions changing, but should point
out that if you have a string with unassigned characters in it, they may be
clustered on future versions. Will add.


>  * The Python 3.3 model mentions the disadvantages of memory usage
> cliffs but doesn't mention the associated perfomance cliffs. It would
> be good to also mention that when a string manipulation causes the
> storage to expand or contract, there's a performance impact that's not
> apparent from the nature of the operation if the programmer's
> intuition works on the assumption that the programmer is dealing with
> UTF-32.
>

The focus was on immutable string models, but I didn't make that clear.
Added some text.

>
>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
> optionally, HotSpot
> (
> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
> ).
> That is, text has UTF-16 semantics, but if the high half of every code
> unit in a string is zero, only the lower half is stored. This has
> properties analogous to the Python 3.3 model, except non-BMP doesn't
> expand to UTF-32 but uses UTF-16 surrogate pairs.
>

Thanks, will add.

>
>  * I think the fact that systems that chose UTF-16 or UTF-32 have
> implemented models that try to save storage by omitting leading zeros
> and gaining complexity and performance cliffs as a result is a strong
> indication that UTF-8 should be recommended for newly-designed systems
> that don't suffer from a forceful legacy need to expose UTF-16 or
> UTF-32 semantics.
>
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
>
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
>
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.
>

Added a quote based on this; please check if it is ok.

>
>  * After working with different string models, I'd recommend the Rust
> model for newly-designed programming languages. (Not because I work
> for Mozilla but because I believe Rust's way of dealing with Unicode
> is the best I've seen.) Rust's standard library provides Unicode
> version-independent iterations over strings: by code unit and by code
> point. Iteration by extended grapheme cluster is provided by a library
> that's easy to include due to the nature of Rust package management
> (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
> buffer as a read-only byte buffer has zero run-time cost and allows
> for maximally fast guaranteed-valid-UTF-8 output.
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/948c7821/attachment.html>

From unicode at unicode.org  Tue Oct  2 07:04:40 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 2 Oct 2018 14:04:40 +0200
Subject: Unicode String Models
In-Reply-To: <868t4b3v80.fsf@mimuw.edu.pl>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <868t4b3v80.fsf@mimuw.edu.pl>
Message-ID: <CAJ2xs_FrR12GDP5FP3MOGa8h=+0XAyDuNEdGOO-CWJcn-WPBcQ@mail.gmail.com>

Whether or not it is well suited, that's probably water under the bridge at
this point. Think of it as a jargon at this point; after all, there are
lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly
a hit.

Mark


On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:

> On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ?? via Unicode wrote:
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> It's a good opportunity to propose a better term for "extended grapheme
> cluster", which usually are neither extended nor clusters, it's also not
> obvious that they are always graphemes.
>
> Cf.the earlier threads
>
> https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
> https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html
>
> Best regards
>
> Janusz
>
> --
>              ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/7076e419/attachment.html>

From unicode at unicode.org  Tue Oct  2 13:31:02 2018
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Tue, 2 Oct 2018 20:31:02 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
 <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
Message-ID: <etPan.5bb3b966.16798715.63d4@erratique.ch>

On 2 October 2018 at 14:03:48, Mark Davis ?? via Unicode (unicode at unicode.org) wrote:

> Because of performance and storage consideration, you need to consider the
> possible internal data structures when you are looking at something as
> low-level as strings. But most of the 'model's in the document are only
> really distinguished by API, only the "Code Point model" discussions are
> segmented by internal storage, as with "Code Point Model: UTF-32"

I guess my gripe with the presentation of that document is that it perpetuates the problem of confusing "unicode characters" (or integers, or scalar values) and their *encoding* (how to represent these integers as byte sequences) which a source of endless confusion among programmers.?

This confusion is easy lifted once you explain that there exists certain integers, the scalar values, which are your actual characters and then you have different ways of encoding your characters; one can then explain that a surrogate is not a character per se, it's a hack and there's no point in indexing them except if you want trouble.

This may also suggest another taxonomy of classification for the APIs, those in which you work directly with the character data (the scalar values) and those in which you work with an encoding of the actual character data (e.g. a JavaScript string).

> In reality, most APIs are not even going to be in terms of code points:
> they will return int32's.?

That reality depends on your programming language. If the latter supports type abstraction you can define an abstract type for scalar values (whose implementation may simply be an integer). If you always go through the constructor to create these "integers" you can maintain the invariant that a value of this type is an integer in the ranges [0x0000;0xD7FF] and [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you feed your "character" data to other processes like UTF-X encoders: it guarantees the correctness of their outputs regardless of what the programmer does.

Best,?

Daniel


From unicode at unicode.org  Tue Oct  2 15:12:36 2018
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Tue, 2 Oct 2018 13:12:36 -0700
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
Message-ID: <CAN49p6qMGCqKEtknZL4dLKAHPvCFa+-ppynSZnk1uQDTMjF0OA@mail.gmail.com>

On Tue, Oct 2, 2018 at 12:50 AM Martin J. D?rst via Unicode <
unicode at unicode.org> wrote:

> ... The only
> operation that can cause problems is 'capitalize'.
>
> When I say "cause problems", I mean producing mixed-case output. I
> originally thought that 'capitalize' would be fine. It is fine for
> lowercase input: I stays lowercase because Unicode Data indicates that
> titlecase for lowercase Georgian letters is the letter itself. But it
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
>    UTC?
> - How have any other implementers (ICU,...) addressed this, in
>    particular the operation that's called 'capitalize' in Ruby?
>

By default, ICU toTitle() functions titlecase at word boundaries (with
adjustment) and lowercase all else.
That is, we implement Unicode chapter 3.13 Default Case Conversions R3
toTitlecase(x), except that we modified the default boundary adjustment.

You can customize the boundaries (e.g., only the start of the string).
We have options for whether and how to adjust the boundaries (e.g., adjust
to the next cased letter) and for copying, not lowercasing, the other
characters.
See C++ and Java class CaseMap and the relevant options.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/dab3cefc/attachment.html>

From unicode at unicode.org  Tue Oct  2 16:07:56 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 2 Oct 2018 23:07:56 +0200
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <CAN49p6qMGCqKEtknZL4dLKAHPvCFa+-ppynSZnk1uQDTMjF0OA@mail.gmail.com>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
 <CAN49p6qMGCqKEtknZL4dLKAHPvCFa+-ppynSZnk1uQDTMjF0OA@mail.gmail.com>
Message-ID: <CAGa7JC2XWP79XBHDLefqJjfqHh_Rzo8_VY43xRgRGJPxiiX1xQ@mail.gmail.com>

I see no easy way to convert ALL UPPERCASE text with consistant casing as
there's no rule, except by using dictionnary lookups.
In reality data should be input using default casing (as in dictionnary
entries), independantly of their position in sentences, paragraphs or
titles, and the contextual conversion of some or all characters to
uppercase being done algorithmically (this is safe for conversion to ALL
UPPERCASE, and quite reliable for conversion to Tile Case, with just a few
dictionnary lookups for a small set of knows words per language.

Note that title casing works differently in English (which is most often
abusing by putting capitales on every word), while most other languages
capitalize only selected words, or just the first selected word in French
(in addition to the possible first letter of non-selected words such as
definite and indefinite articles at start of the sentence). Capitalization
of initials on every word is wrong in German which uses capitalisation even
more strictly than French or Italian: when in doubts, do not perform any
titlecasing, and allow data to provide the actual capitalization of titles
directly (it is OK and even recommanded in German to have section headings,
or even book titles, written as if they were in the middle of sentences,
and you capitalize only titles and headings that are full sentences
grammatically, but not simple nominal groups.

So title casing should not even be promoted by the UCD standard (where it
is in fact using only very basic, simplistic rules) and applicable only in
some applications for some languages and in specific technical or rendering
contexts.


Le mar. 2 oct. 2018 ? 22:21, Markus Scherer via Unicode <unicode at unicode.org>
a ?crit :

> On Tue, Oct 2, 2018 at 12:50 AM Martin J. D?rst via Unicode <
> unicode at unicode.org> wrote:
>
>> ... The only
>> operation that can cause problems is 'capitalize'.
>>
>> When I say "cause problems", I mean producing mixed-case output. I
>> originally thought that 'capitalize' would be fine. It is fine for
>> lowercase input: I stays lowercase because Unicode Data indicates that
>> titlecase for lowercase Georgian letters is the letter itself. But it
>> will produce the apparently undesirable Mixed Case for ALL UPPERCASE
>> input.
>>
>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>>    UTC?
>> - How have any other implementers (ICU,...) addressed this, in
>>    particular the operation that's called 'capitalize' in Ruby?
>>
>
> By default, ICU toTitle() functions titlecase at word boundaries (with
> adjustment) and lowercase all else.
> That is, we implement Unicode chapter 3.13 Default Case Conversions R3
> toTitlecase(x), except that we modified the default boundary adjustment.
>
> You can customize the boundaries (e.g., only the start of the string).
> We have options for whether and how to adjust the boundaries (e.g., adjust
> to the next cased letter) and for copying, not lowercasing, the other
> characters.
> See C++ and Java class CaseMap and the relevant options.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181002/4b01d3be/attachment.html>

From unicode at unicode.org  Tue Oct  2 16:43:27 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 2 Oct 2018 14:43:27 -0700
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
Message-ID: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>


On 10/2/2018 12:45 AM, Martin J. D?rst via Unicode wrote:
> capitalize: uppercase (or title-case) the first character of the 
> string, lowercase the rest
>
>
> When I say "cause problems", I mean producing mixed-case output. I 
> originally thought that 'capitalize' would be fine. It is fine for 
> lowercase input: I stays lowercase because Unicode Data indicates that 
> titlecase for lowercase Georgian letters is the letter itself. But it 
> will produce the apparently undesirable Mixed Case for ALL UPPERCASE 
> input.
>
> My questions here are:
> - Has this been considered when Georgian Mtavruli was discussed in the
> ? UTC?
>
Not explicitly, that I recall. The whole issue of titlecasing came up 
very late in the preparation of case mapping tables for Mtavruli and 
Mkhedruli for 11.0.

But it seems to me that the problem you are citing can be avoided if you 
simply rethink what your "capitalize" means. It really should be 
conceived of as first lowercasing the *entire* string, and then 
titlecasing the *eligible* letters -- i.e., usually the first letter. 
(Note that this allows for the concept that titlecasing might then be 
localized on a per-writing-system basis -- the issue would devolve to 
determining what the rules are for "eligible" letters.) But the simple 
default would just be to titlecase the initial letter of each "word" 
segment of a string.

Note that conceived this way, for the Georgian mappings, where the 
titlecase mapping for Mkhedruli is simply the letter itself, this 
approach ends up with:

capitalize(mkhedrulistring) --> mkhedrulistring

capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
mkhedrulistring

Thus avoiding any mixed case.

--Ken


From unicode at unicode.org  Wed Oct  3 02:17:10 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Wed, 3 Oct 2018 09:17:10 +0200
Subject: Unicode String Models
In-Reply-To: <etPan.5bb3b966.16798715.63d4@erratique.ch>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
 <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
 <etPan.5bb3b966.16798715.63d4@erratique.ch>
Message-ID: <CAJ2xs_ETNEzVYG04+R2eev9Ygky2APUGX0s4Wk6=EQXu=FJKAg@mail.gmail.com>

Mark


On Tue, Oct 2, 2018 at 8:31 PM Daniel B?nzli <daniel.buenzli at erratique.ch>
wrote:

> On 2 October 2018 at 14:03:48, Mark Davis ?? via Unicode (
> unicode at unicode.org) wrote:
>
> > Because of performance and storage consideration, you need to consider
> the
> > possible internal data structures when you are looking at something as
> > low-level as strings. But most of the 'model's in the document are only
> > really distinguished by API, only the "Code Point model" discussions are
> > segmented by internal storage, as with "Code Point Model: UTF-32"
>
> I guess my gripe with the presentation of that document is that it
> perpetuates the problem of confusing "unicode characters" (or integers, or
> scalar values) and their *encoding* (how to represent these integers as
> byte sequences) which a source of endless confusion among programmers.
>
> This confusion is easy lifted once you explain that there exists certain
> integers, the scalar values, which are your actual characters and then you
> have different ways of encoding your characters; one can then explain that
> a surrogate is not a character per se, it's a hack and there's no point in
> indexing them except if you want trouble.
>
> This may also suggest another taxonomy of classification for the APIs,
> those in which you work directly with the character data (the scalar
> values) and those in which you work with an encoding of the actual
> character data (e.g. a JavaScript string).
>

Thanks for the feedback. It is worth adding a discussion of the issues,
perhaps something like:

A code-point-based API takes and returns int32's, although only a small
subset of the values are valid code points, namely 0x0..0x10FFFF. (In
practice some APIs may support returning -1 to signal an error or
termination, such as before or after the end of a string.) A surrogate code
point is one in U+D800..U+DFFF; these reflect a range of special code units
used in pairs in UTF-16 for representing code points above U+FFFF. A scalar
value is a code point that is not a surrogate.

A scalar-value API for immutable strings requires that no surrogate code
points are ever returned. In practice, the main advantage of that API is
that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked
surrogate code point is relatively harmless: Unicode properties are devised
so that clients can essentially treat them as (permanently) unassigned
characters. Warning: an iterator should *never* avoid returning surrogate
code points by skipping them; that can cause security problems; see
https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences
and
https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters.

There are two main choices for a scalar-value API:

   1. Guarantee that the storage never contains surrogates. This is the
   simplest model.
   2. Substitute U+FFFD for surrogates when the API returns code
   points. This can be done where #1 is not feasible, such as where the API is
   a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
   that are not guaranteed to be UTF-16. The cost is extra tests on every code
   point access.


> > In reality, most APIs are not even going to be in terms of code points:
> > they will return int32's.
>
> That reality depends on your programming language. If the latter supports
> type abstraction you can define an abstract type for scalar values (whose
> implementation may simply be an integer). If you always go through the
> constructor to create these "integers" you can maintain the invariant that
> a value of this type is an integer in the ranges [0x0000;0xD7FF] and
> [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you
> feed your "character" data to other processes like UTF-X encoders: it
> guarantees the correctness of their outputs regardless of what the
> programmer does.
>

If the programming language provides for such a primitive datatype, that is
possible. That would mean at a minimum that casting/converting to that
datatype from other numerical datatypes would require bounds-checking and
throwing an exception for values outside of [0x0000..0xD7FF
0xE000..0x10FFFF]. Most common-use programming languages that I know of
don't support that for primitives; the API would have to use a class, which
would be so very painful for performance/storage. If you (or others) know
of languages that do have such a cheap primitive datatype, that would be
worth mentioning!


> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181003/478c2365/attachment.html>

From unicode at unicode.org  Wed Oct  3 08:01:15 2018
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Wed, 3 Oct 2018 15:01:15 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_ETNEzVYG04+R2eev9Ygky2APUGX0s4Wk6=EQXu=FJKAg@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
 <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
 <etPan.5bb3b966.16798715.63d4@erratique.ch>
 <CAJ2xs_ETNEzVYG04+R2eev9Ygky2APUGX0s4Wk6=EQXu=FJKAg@mail.gmail.com>
Message-ID: <etPan.5bb4bd9b.71998083.63d4@erratique.ch>

On 3 October 2018 at 09:17:10, Mark Davis ?? via Unicode (unicode at unicode.org) wrote:

> There are two main choices for a scalar-value API:
>  
> 1. Guarantee that the storage never contains surrogates. This is the
> simplest model.
> 2. Substitute U+FFFD for surrogates when the API returns code
> points. This can be done where #1 is not feasible, such as where the API is
> a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
> that are not guaranteed to be UTF-16. The cost is extra tests on every code
> point access.

I'm not sure 2. really makes sense in pratice: it would mean you can't access scalar values?
which needs surrogates to be encoded.?

Also regarding 1. you can always defines an API that has this property regardless of the actual storage, it's only that your indexing operations might be costly as they do not directly map to the underlying storage array.

That being said I don't think direct indexing/iterating for Unicode text is such an interesting operation due of course to the normalization/segmentation issues. Basically if your API provides them I only see these indexes as useful ways to define substrings. APIs that identify/iterate boundaries (and thus substrings) are more interesting due to the nature of Unicode text.

> If the programming language provides for such a primitive datatype, that is
> possible. That would mean at a minimum that casting/converting to that
> datatype from other numerical datatypes would require bounds-checking and
> throwing an exception for values outside of [0x0000..0xD7FF
> 0xE000..0x10FFFF].?

Yes. But note that in practice if you are in 1. above you usually perform this only at the point of decoding where you are already performing a lot of other checks. Once done you no longer need to check anything as long as the operations you perform on the values preserve the invariant.?Also converting back to an integer if you need one is a no-op: it's the identity function.?

The OCaml Uchar module does this. This is the interface:?

??https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli

which defines the type t as abstract and here is the implementation:?

??https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml

which defines the implementation of type t = int which means values of this type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml array). However since the module system enforces type abstraction the only way of creating such values is to use the constants or the constructors (e.g. of_int) which all maintain the scalar value invariant (if you disregard the unsafe_* functions).?

Note that it would perfectly be possible to adopt a similar approach in C via a typedef though given C's rather loose type system a little bit more discipline would be required from the programmer (always go through the constructor functions to create values of the type).

Best,?

Daniel


From unicode at unicode.org  Wed Oct  3 08:41:42 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Wed, 3 Oct 2018 15:41:42 +0200
Subject: Unicode String Models
In-Reply-To: <etPan.5bb4bd9b.71998083.63d4@erratique.ch>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
 <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
 <etPan.5bb3b966.16798715.63d4@erratique.ch>
 <CAJ2xs_ETNEzVYG04+R2eev9Ygky2APUGX0s4Wk6=EQXu=FJKAg@mail.gmail.com>
 <etPan.5bb4bd9b.71998083.63d4@erratique.ch>
Message-ID: <CAJ2xs_EX-E-JbPG-10rDq6m97suHceuGj=ifPdirenRWoOvxDg@mail.gmail.com>

Mark


On Wed, Oct 3, 2018 at 3:01 PM Daniel B?nzli <daniel.buenzli at erratique.ch>
wrote:

> On 3 October 2018 at 09:17:10, Mark Davis ?? via Unicode (
> unicode at unicode.org) wrote:
>
> > There are two main choices for a scalar-value API:
> >
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest model.
> > 2. Substitute U+FFFD for surrogates when the API returns code
> > points. This can be done where #1 is not feasible, such as where the API
> is
> > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code
> units
> > that are not guaranteed to be UTF-16. The cost is extra tests on every
> code
> > point access.
>
> I'm not sure 2. really makes sense in pratice: it would mean you can't
> access scalar values
> which needs surrogates to be encoded.
>

Let me clear that up; I meant that "the underlying storage never contains
something that would need to be represented as a surrogate code point." Of
course, UTF-16 does need surrogate code units. What #1 would be excluding
in the case of UTF-16 would be unpaired surrogates. That is, suppose the
underlying storage is UTF-16 code units that don't satisfy #1.

0061 D83D DC7D 0061 D83D

A code point API would return for those a sequence of 4 values, the last of
which would be a surrogate code point.

00000061, 0001F47D, 00000061, 0000D83D

A scalar value API would return for those also 4 values, but since we
aren't in #1, it would need to remap.

00000061, 0001F47D, 00000061, 0000FFFD

>
> Also regarding 1. you can always defines an API that has this property
> regardless of the actual storage, it's only that your indexing operations
> might be costly as they do not directly map to the underlying storage array.


> That being said I don't think direct indexing/iterating for Unicode text
> is such an interesting operation due of course to the
> normalization/segmentation issues. Basically if your API provides them I
> only see these indexes as useful ways to define substrings. APIs that
> identify/iterate boundaries (and thus substrings) are more interesting due
> to the nature of Unicode text.
>

I agree that iteration is a very common case. But quite often
implementations need to have at least opaque indexes (as discussed).

>
> > If the programming language provides for such a primitive datatype, that
> is
> > possible. That would mean at a minimum that casting/converting to that
> > datatype from other numerical datatypes would require bounds-checking and
> > throwing an exception for values outside of [0x0000..0xD7FF
> > 0xE000..0x10FFFF].
>
> Yes. But note that in practice if you are in 1. above you usually perform
> this only at the point of decoding where you are already performing a lot
> of other checks. Once done you no longer need to check anything as long as
> the operations you perform on the values preserve the invariant. Also
> converting back to an integer if you need one is a no-op: it's the identity
> function.
>

If it is a real datatype, with strong guarantees that it *never* contains
values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion
from number will require checking. And in my experience, without a strong
guarantee the datatype is in practice pretty useless.


>
> The OCaml Uchar module does this. This is the interface:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli
>
> which defines the type t as abstract and here is the implementation:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml
>
> which defines the implementation of type t = int which means values of
> this type are an *unboxed* OCaml integer (and will be stored as such in say
> an OCaml array). However since the module system enforces type abstraction
> the only way of creating such values is to use the constants or the
> constructors (e.g. of_int) which all maintain the scalar value invariant
> (if you disregard the unsafe_* functions).
>
> Note that it would perfectly be possible to adopt a similar approach in C
> via a typedef though given C's rather loose type system a little bit more
> discipline would be required from the programmer (always go through the
> constructor functions to create values of the type).


That's the C motto: "requiring a 'bit more' discipline from programmers"

>


> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181003/96efdec0/attachment.html>

From unicode at unicode.org  Wed Oct  3 09:15:55 2018
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Wed, 3 Oct 2018 16:15:55 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_EX-E-JbPG-10rDq6m97suHceuGj=ifPdirenRWoOvxDg@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <etPan.5b95233b.6c02cdbb.efff@erratique.ch>
 <CAJ2xs_HF6R+OG-819i5qMhE8bxW7jyAMBTh_8M_ymiZL=WsZ+A@mail.gmail.com>
 <etPan.5bb3b966.16798715.63d4@erratique.ch>
 <CAJ2xs_ETNEzVYG04+R2eev9Ygky2APUGX0s4Wk6=EQXu=FJKAg@mail.gmail.com>
 <etPan.5bb4bd9b.71998083.63d4@erratique.ch>
 <CAJ2xs_EX-E-JbPG-10rDq6m97suHceuGj=ifPdirenRWoOvxDg@mail.gmail.com>
Message-ID: <etPan.5bb4cf1b.3312f258.63d4@erratique.ch>

On 3 October 2018 at 15:41:42, Mark Davis ?? via Unicode (unicode at unicode.org) wrote:
?
> Let me clear that up; I meant that "the underlying storage never contains
> something that would need to be represented as a surrogate code point." Of
> course, UTF-16 does need surrogate code units. What #1 would be excluding
> in the case of UTF-16 would be unpaired surrogates. That is, suppose the
> underlying storage is UTF-16 code units that don't satisfy #1.
>  
> 0061 D83D DC7D 0061 D83D
>  
> A code point API would return for those a sequence of 4 values, the last of
> which would be a surrogate code point.
>  
> 00000061, 0001F47D, 00000061, 0000D83D
>  
> A scalar value API would return for those also 4 values, but since we
> aren't in #1, it would need to remap.
>  
> 00000061, 0001F47D, 00000061, 0000FFFD

Ok understood. But I think that if you go to the length of providing a scalar-value API you would also prevent the construction of strings that have such anomalities in the first place (e.g. by erroring in the constructor if you provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's perspective I really don't get anything from 2. except confusion.

> If it is a real datatype, with strong guarantees that it *never* contains
> values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion
> from number will require checking. And in my experience, without a strong
> guarantee the datatype is in practice pretty useless.

Sure. My point was that the places where you perform this check are few in practice. Namely mainly at the IO boundary of your program where you actually need to deal with encodings and, additionally, whenever you define scalar value constants (a check that could actually be performed by your compiler if your language provides a literal notation for values of this type).

Best,?

Daniel


From unicode at unicode.org  Thu Oct  4 04:37:25 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Thu, 4 Oct 2018 18:37:25 +0900
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
 <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
Message-ID: <df3984b1-fbd3-5b33-737c-9723bf0c03a9@it.aoyama.ac.jp>

Ken, Markus,

Many thanks for your ideas, which I noted at
https://bugs.ruby-lang.org/issues/14839.

Regards,   Martin.

On 2018/10/03 06:43, Ken Whistler wrote:
> 
> On 10/2/2018 12:45 AM, Martin J. D?rst via Unicode wrote:

>> My questions here are:
>> - Has this been considered when Georgian Mtavruli was discussed in the
>> ? UTC?
>>
> Not explicitly, that I recall. The whole issue of titlecasing came up 
> very late in the preparation of case mapping tables for Mtavruli and 
> Mkhedruli for 11.0.
> 
> But it seems to me that the problem you are citing can be avoided if you 
> simply rethink what your "capitalize" means. It really should be 
> conceived of as first lowercasing the *entire* string, and then 
> titlecasing the *eligible* letters -- i.e., usually the first letter. 
> (Note that this allows for the concept that titlecasing might then be 
> localized on a per-writing-system basis -- the issue would devolve to 
> determining what the rules are for "eligible" letters.) But the simple 
> default would just be to titlecase the initial letter of each "word" 
> segment of a string.
> 
> Note that conceived this way, for the Georgian mappings, where the 
> titlecase mapping for Mkhedruli is simply the letter itself, this 
> approach ends up with:
> 
> capitalize(mkhedrulistring) --> mkhedrulistring
> 
> capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
> mkhedrulistring
> 
> Thus avoiding any mixed case.


From unicode at unicode.org  Thu Oct  4 10:40:16 2018
From: unicode at unicode.org (Rick McGowan via Unicode)
Date: Thu, 04 Oct 2018 08:40:16 -0700
Subject: Unicode CLDR 34 beta available for testing
Message-ID: <5BB63460.3040102@unicode.org>

The *beta* version of Unicode CLDR 34 
<http://cldr.unicode.org/index/downloads/cldr-34> is available for 
testing. The final release is expected on October 12.

CLDR 34 provides an update to the key building blocks for software 
supporting the world?s languages. This data is used by all major 
software systems <http://cldr.unicode.org/index#TOC-Who-uses-CLDR-> for 
their software internationalization and localization, adapting software 
to the conventions of different languages for such common software tasks.

CLDR 34 included a full Survey Tool data collection phase. Other 
enhancements include several changes to prepare for the new Japanese 
calendar era starting 2019-05-01; updated emoji names, annotations, 
collation and grouping; and other specific fixes. The draft release page 
at http://cldr.unicode.org/index/downloads/cldr-34 lists the major 
features, and has pointers to the newest data and charts. It will be 
fleshed out over the coming weeks with more details, migration issues, 
known problems, and so on. Particularly useful for review are:

    * Delta Charts <http://unicode.org/cldr/charts/34/delta/index.html>
      - the data that changed during the release
    * By-Type Charts
      <http://unicode.org/cldr/charts/34/by_type/index.html> - a
      side-by-side comparison of data from different locales
    * Annotation Charts
      <http://unicode.org/cldr/charts/34/annotations/index.html> - new
      emoji names and keywords

Please report any problems that you find using a CLDR ticket 
<http://unicode.org/cldr/trac/newticket>. We?d also appreciate it if 
programmatic users of CLDR data download the xml files and do a trial 
integration to see if any problems arise.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181004/01e05431/attachment.html>

From unicode at unicode.org  Tue Oct  9 02:47:14 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Tue, 9 Oct 2018 16:47:14 +0900
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
 <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
Message-ID: <b1b4667c-8f8d-0715-7b44-ff0f9903cb11@it.aoyama.ac.jp>

Hello Ken, others,

On 2018/10/03 06:43, Ken Whistler wrote:

> But it seems to me that the problem you are citing can be avoided if you 
> simply rethink what your "capitalize" means. It really should be 
> conceived of as first lowercasing the *entire* string, and then 
> titlecasing the *eligible* letters -- i.e., usually the first letter. 
> (Note that this allows for the concept that titlecasing might then be 
> localized on a per-writing-system basis -- the issue would devolve to 
> determining what the rules are for "eligible" letters.) But the simple 
> default would just be to titlecase the initial letter of each "word" 
> segment of a string.
> 
> Note that conceived this way, for the Georgian mappings, where the 
> titlecase mapping for Mkhedruli is simply the letter itself, this 
> approach ends up with:
> 
> capitalize(mkhedrulistring) --> mkhedrulistring
> 
> capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> 
> mkhedrulistring
> 
> Thus avoiding any mixed case.

I have been thinking through this. It seems quite appealing.

But I'm concerned there may be some edge cases. I have been able to come 
up with two so far:

- Applying this to a string starting with upper-case SZ (U+1E9E).
   This may change SZ ? ? ? Ss.
- Using the 'capitalize' method to (try to) get the titlecase
   property of a MTAVRULI character. (There's no other way
   currently in Ruby to get the titlecase property.)

There may be others. If you have some ideas, I'd appreciate to know 
about them.

This lets me wonder why the UTC didn't simply declare the titlecase 
property of MTAVRULI to be mkhedruli. Was this considered or not? The 
way things are currently set up, there seems to be no benefit of 
MTAVRULI being its own titlecase, because in actual use, that requires 
additional processing.

Regards,   Martin.

From unicode at unicode.org  Tue Oct  9 03:22:25 2018
From: unicode at unicode.org (Marius Spix via Unicode)
Date: Tue, 9 Oct 2018 10:22:25 +0200
Subject: Aw: Re: Dealing with Georgian capitalization in programming languages
In-Reply-To: <b1b4667c-8f8d-0715-7b44-ff0f9903cb11@it.aoyama.ac.jp>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
 <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
 <b1b4667c-8f8d-0715-7b44-ff0f9903cb11@it.aoyama.ac.jp>
Message-ID: <trinity-523a93cb-5b0d-43ca-ad84-9e481247ceb9-1539073345196@3c-app-webde-bap18>

The capital ? (U+1E9E) has been officially approved by the Council for the German Language since July 2018. However, there is no word starting with ?, that means the character is only relevant for full-capitalized words. It may only stand alone in spaced type, when there is no available italic font-style.

In the Ruby bug tracker that there is also an issue with Dutch ij ? IJ. The dedicated ligatures ??(U+0133) and ? (U+0133) are not recommended and thus never used, but leading ij must always be capitalized to IJ, as in IJSBERG ? ijsberg ? IJsberg. The actual problem is that the current capitalization algorithm is based on a regular grammar (type 3). It has to be adjusted for a context-sensitive (type 1) grammar. 

Regards,

Marius

?

On 2018/10/09 09:47, Martin J. D?rst wrote:

> I have been thinking through this. It seems quite appealing.
> 
> But I'm concerned there may be some edge cases. I have been able to come
> up with two so far:
> 
> - Applying this to a string starting with upper-case SZ (U+1E9E).
> This may change SZ ? ? ? Ss.
> - Using the 'capitalize' method to (try to) get the titlecase
> property of a MTAVRULI character. (There's no other way
> currently in Ruby to get the titlecase property.)
> 
> There may be others. If you have some ideas, I'd appreciate to know
> about them.
> 
> This lets me wonder why the UTC didn't simply declare the titlecase
> property of MTAVRULI to be mkhedruli. Was this considered or not? The
> way things are currently set up, there seems to be no benefit of
> MTAVRULI being its own titlecase, because in actual use, that requires
> additional processing.
> 
> Regards, Martin.


From unicode at unicode.org  Tue Oct  9 14:49:09 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 9 Oct 2018 12:49:09 -0700
Subject: Dealing with Georgian capitalization in programming languages
In-Reply-To: <b1b4667c-8f8d-0715-7b44-ff0f9903cb11@it.aoyama.ac.jp>
References: <ac668503-2dc7-3736-7a00-1e9c1db6eca1@it.aoyama.ac.jp>
 <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net>
 <b1b4667c-8f8d-0715-7b44-ff0f9903cb11@it.aoyama.ac.jp>
Message-ID: <f2054476-09a2-ba74-b952-6e5fdf7415bc@att.net>

Martin,

On 10/9/2018 12:47 AM, Martin J. D?rst via Unicode wrote:
> - Using the 'capitalize' method to (try to) get the titlecase
> ? property of a MTAVRULI character. (There's no other way
> ? currently in Ruby to get the titlecase property.)
>
> There may be others. If you have some ideas, I'd appreciate to know 
> about them.
>
> This lets me wonder why the UTC didn't simply declare the titlecase 
> property of MTAVRULI to be mkhedruli. Was this considered or not? The 
> way things are currently set up, there seems to be no benefit of 
> MTAVRULI being its own titlecase, because in actual use, that requires 
> additional processing.

Titlecasing for Georgian was not completely thought through before 
Mtavruli was added. As I noted in my earlier comment on this thread, the 
titlecase mapping values for Mkhredruli were added late in the process, 
when it became clear that not doing so would result in inappropriate 
outcomes for existing Mkhredruli text.

I don't think there is a fully-worked out position on this, but adding a 
Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, 
just further muddy waters for implementers, because it would be in 
effect saying that an uppercase letter titlecases by shifting to its 
lowercase mapping. A headscratcher, at the very least.

Note that with the current mappings as they are, Changes_When_Titlecased 
is False for all Mkhedruli and for all Mtavruli characters, which I 
think is the desired state of affairs. A titlecasing string operation of 
Mtavruli that does something other than just leave the string alone 
should, IMO, be documented as doing something extra and *should* have to 
do additional processing.

--Ken


From unicode at unicode.org  Wed Oct 10 03:14:12 2018
From: unicode at unicode.org (arno.schmitt via Unicode)
Date: Wed, 10 Oct 2018 09:14:12 +0100
Subject: Unicode Arabic Mark Rendering UTR #53 Now Published
In-Reply-To: <5BBD0263.2040006@unicode.org>
References: <5BBD0263.2040006@unicode.org>
Message-ID: <63e9abb7-394b-ed3c-84d6-a39969359f34@gmx.net>

The paper adopted treats the word shown (fa-?ul??ika) writing
with an unkown letter +  kasra below + hamza below.
I thought, in Unicode I should use 'ARABIC LETTER YEH WITH HAMZA ABOVE' 
(U+0626) or its phonological equivilant 'ARABIC LETTER YEH WITH HAMZA 
BELOW' (U+0826) or the basic letter 'ARABIC LETTER YEH WITH HAMZA' 
(U+0825). My error or an inconsistency in Unicode?

Am 09.10.2018 um 21:32 schrieb announcements at unicode.org:
> exampleThe combining classes of Arabic combining characters in Unicode 
> are different than combining classes in most other scripts. They are a 
> mixture of special classes for specific marks plus two more generalized 
> classes for all the other marks. This has resulted in inconsistent 
> and/or incorrect rendering for sequences with multiple combining marks 
> since Unicode 2.0.
> 
> 
> The Arabic Mark Transient Reordering Algorithm (AMTRA) described in UTR 
> #53 <http://www.unicode.org/reports/tr53/> is the recommended solution 
> to achieving correct and consistent rendering of Arabic combining mark 
> sequences. This algorithm provides results that match user expectations 
> and assures that canonically equivalent sequences are rendered 
> identically, independent of the order of the combining marks.
> 
> 
> The concepts in this algorithm were first proposed four years ago by 
> Roozbeh Pournader. We are pleased it has now been published as an 
> official Technical Report.
> 

From unicode at unicode.org  Fri Oct 12 05:54:57 2018
From: unicode at unicode.org (Costello, Roger L. via Unicode)
Date: Fri, 12 Oct 2018 10:54:57 +0000
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
Message-ID: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>

Hi Unicode Experts,

Suppose base64 encoding is applied to m to yield base64 text t. 

Next, suppose base64 encoding is applied to m' to yield base64 text t'.

If m is not equal to m', then t will not equal t'.

In other words, given different inputs, base64 encoding always yields different base64 texts.

True or false?

How about the opposite direction: If m is base64 encoded to yield t and then t is base64 decoded to yield n, will it always be the case that m equals n?

/Roger


From unicode at unicode.org  Fri Oct 12 06:08:40 2018
From: unicode at unicode.org (J Decker via Unicode)
Date: Fri, 12 Oct 2018 04:08:40 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
Message-ID: <CAA2GJqVaVGmW4YVFY8_1qQsL2yL6PHTsygN7Kw+g1Q_ZkQh9UQ@mail.gmail.com>

On Fri, Oct 12, 2018 at 3:57 AM Costello, Roger L. via Unicode <
unicode at unicode.org> wrote:

> Hi Unicode Experts,
>
> Suppose base64 encoding is applied to m to yield base64 text t.
>
> Next, suppose base64 encoding is applied to m' to yield base64 text t'.
>
> If m is not equal to m', then t will not equal t'.
>
> In other words, given different inputs, base64 encoding always yields
> different base64 texts.
>
> True or false?
>
true.  base64 to and from is always the same thing.

>
> How about the opposite direction: If m is base64 encoded to yield t and
> then t is base64 decoded to yield n, will it always be the case that m
> equals n?
>
False.
Canonical translation may occur which the different base64 may be the same
sort of string...

https://en.wikipedia.org/wiki/Unicode_equivalence
https://en.wikipedia.org/wiki/Canonical_form


> /Roger
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181012/84cfe83e/attachment.html>

From unicode at unicode.org  Fri Oct 12 11:17:59 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Fri, 12 Oct 2018 09:17:59 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or =?UTF-8?Q?false=3F?=
Message-ID: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>

J Decker wrote:

>> How about the opposite direction: If m is base64 encoded to yield t
>> and then t is base64 decoded to yield n, will it always be the case
>> that m equals n?
>
> False.
> Canonical translation may occur which the different base64 may be the
> same sort of string...

Base64 is a binary-to-text encoding. Neither encoding nor decoding
should presume any special knowledge of the meaning of the binary data,
or do anything extra based on that presumption.

Converting Unicode text to and from base64 should not perform any sort
of Unicode normalization, convert between UTFs, insert or remove BOMs,
etc. This is like saying that converting a JPEG image to and from base64
should not resize or rescale the image, change its color depth, convert
it to another graphic format, etc.

So I'd say "true" to Roger's question.

I touched on this a little bit in UTN #14, from the standpoint of trying
to improve compression by normalizing the Unicode text first.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Fri Oct 12 11:29:29 2018
From: unicode at unicode.org (J Decker via Unicode)
Date: Fri, 12 Oct 2018 09:29:29 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>
References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>
Message-ID: <CAA2GJqWPK4wehmLnjo9dGP7+RCfMCTb=YJ7keupKfZV-wVL_ng@mail.gmail.com>

On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
On the first side (X to base64) definitely true.

But there is potential that text resulting from some decoded buffer is
translated, resulting in a 'congruent' string that's not exactly the
same... and the base64 will be different.

Comparing some base64 string with some other base64 string shows a binary
difference, but may be still the 'same' string.


>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181012/ac0ca3bd/attachment.html>

From unicode at unicode.org  Fri Oct 12 14:26:45 2018
From: unicode at unicode.org (Tex via Unicode)
Date: Fri, 12 Oct 2018 12:26:45 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAA2GJqWPK4wehmLnjo9dGP7+RCfMCTb=YJ7keupKfZV-wVL_ng@mail.gmail.com>
References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>
 <CAA2GJqWPK4wehmLnjo9dGP7+RCfMCTb=YJ7keupKfZV-wVL_ng@mail.gmail.com>
Message-ID: <007601d46261$8454d990$8cfe8cb0$@xencraft.com>

I agree with Doug. Base64 maps each byte of the source string to unique bytes in the destination string. Decoding is also a unique mapping.

If the encoded string is ?translated? in some way by additional processes, canonical or otherwise, then all bets are off.

 
If you disagree, please offer an example or additional details of how 2 base64 strings might be equivalent.

 
Tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker via Unicode
Sent: Friday, October 12, 2018 9:29 AM
To: doug at ewellic.org
Cc: Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

 
On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode <unicode at unicode.org> wrote:

J Decker wrote:

>> How about the opposite direction: If m is base64 encoded to yield t
>> and then t is base64 decoded to yield n, will it always be the case
>> that m equals n?
>
> False.
> Canonical translation may occur which the different base64 may be the
> same sort of string...

Base64 is a binary-to-text encoding. Neither encoding nor decoding
should presume any special knowledge of the meaning of the binary data,
or do anything extra based on that presumption.

Converting Unicode text to and from base64 should not perform any sort
of Unicode normalization, convert between UTFs, insert or remove BOMs,
etc. This is like saying that converting a JPEG image to and from base64
should not resize or rescale the image, change its color depth, convert
it to another graphic format, etc.

So I'd say "true" to Roger's question.

On the first side (X to base64) definitely true.

 
But there is potential that text resulting from some decoded buffer is translated, resulting in a 'congruent' string that's not exactly the same... and the base64 will be different.

 
Comparing some base64 string with some other base64 string shows a binary difference, but may be still the 'same' string. 

 
I touched on this a little bit in UTN #14, from the standpoint of trying
to improve compression by normalizing the Unicode text first.

--
Doug Ewell | Thornton, CO, US | ewellic.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181012/997b5c00/attachment.html>

From unicode at unicode.org  Fri Oct 12 20:12:40 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 13 Oct 2018 03:12:40 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>
References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC1dH92A0UEgD=i+9Fkr7xdjnb9HkNJBbfR1gCbJwUid-g@mail.gmail.com>

I also think the reverse is also true !

Decoding a Base64 entity does not warranty it will return valid text in any
known encoding. So Unicode normalization of the output cannot apply.

Even if it represents text, nothing indicates that the result will be
encoded with some Unicode encoding form (unless this is tagged separately,
like in MIME).

If you use Base64 for decoding MIME contents (e.g. for emails), the Base-64
decoding itself will not transform the encoding, but then the email parser
will have to ensure that the text encoding is valid, at which time it will
have to transform it (possibly replace some invalid sequences or truncate
it), and then only it may apply normalization to help render that text. But
these transforms are part of the MIME application and independant of whever
you used Base-64 or any another binary encoding or transport syntax.

In other words: "If m is not equal to m', then t will not equal t'" is
reversible, but nothing indicates that m or m' Base64-decoded are texts,
they are just opaque binary objects which are still equal in value like
their t or t' Base64-encodings.

Note: some Base64 envelope formats (like MIME) allow multiple
representations t and t' from the same message m, by adding paddings or
transport syntaxes like line-splitting (with varaible length). Base64 alone
does not allow that variation (it normally uses a static alphabet), but
there are variants that accept decoding extended alphabets as binary
equivalent. So you may have two MIME-encoded texts that have different
encodings (with Base64 or Quopted-Printable, with variable line lengths)
but that represent the same source binary object, and decoding these
different encoded messages will yeld the same binary object: this does not
depend on Base64 but on the permissivity/flexibility of decoders for these
envelope formats (using **extensions** of Base64 specific to the envelope
format).


Le ven. 12 oct. 2018 ? 18:27, Doug Ewell via Unicode <unicode at unicode.org>
a ?crit :

> J Decker wrote:
>
> >> How about the opposite direction: If m is base64 encoded to yield t
> >> and then t is base64 decoded to yield n, will it always be the case
> >> that m equals n?
> >
> > False.
> > Canonical translation may occur which the different base64 may be the
> > same sort of string...
>
> Base64 is a binary-to-text encoding. Neither encoding nor decoding
> should presume any special knowledge of the meaning of the binary data,
> or do anything extra based on that presumption.
>
> Converting Unicode text to and from base64 should not perform any sort
> of Unicode normalization, convert between UTFs, insert or remove BOMs,
> etc. This is like saying that converting a JPEG image to and from base64
> should not resize or rescale the image, change its color depth, convert
> it to another graphic format, etc.
>
> So I'd say "true" to Roger's question.
>
> I touched on this a little bit in UTN #14, from the standpoint of trying
> to improve compression by normalizing the Unicode text first.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181013/c989a24f/attachment.html>

From unicode at unicode.org  Sat Oct 13 09:16:59 2018
From: unicode at unicode.org (Costello, Roger L. via Unicode)
Date: Sat, 13 Oct 2018 14:16:59 +0000
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
Message-ID: <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>

Hi Folks,

Thank you for your outstanding responses! 

Below is a summary of what I learned. Are there any errors in the summary? Is there anything you would add? Please let me know of anything that is not clear.   /Roger

1. While base64 encoding is usually applied to binary, it is also sometimes applied to text, such as Unicode text.

Note: Since base64 encoding may be applied to both binary and text, in the following bullets I use the more generic term "data". For example, "Data d is base64-encoded to yield ..."

2. Neither base64 encoding nor decoding should presume any special knowledge of the meaning of the data or do anything extra based on that presumption. 

For example, converting Unicode text to and from base64 should not perform any sort of Unicode normalization, convert between UTFs, insert or remove BOMs, etc. This is like saying that converting a JPEG image to and from base64 should not resize or rescale the image, change its color depth, convert it to another graphic format, etc.

If you use base64 for encoding MIME content (e.g. emails), the base64 decoding will not transform the content. The email parser must ensure that the content is valid, so the parser might have to transform the content (possibly replacing some invalid sequences or truncating), and then apply Unicode normalization to render the text. These transforms are part of the MIME application and are independent of whether you use base64 or any another encoding or transport syntax.

3. If data d is different than d', then the base64 text resulting from encoding d is different than the base64 text resulting from encoding d'.

4. If base64 text t is different than t', then the data resulting from decoding t is different than the data resulting from decoding t'.

5. For every data d there is exactly one base64 encoding t.

6. Every base64 text t is an encoding of exactly one data d.

7. For all data d, Base64_Decode[Base64_Encode[d]] = d


From unicode at unicode.org  Sat Oct 13 09:45:10 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 13 Oct 2018 16:45:10 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
Message-ID: <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>

You forget that Base64 (as used in MIME) does not follow these rules as it
allows multiple different encodings for the same source binary. MIME
actually splits a binary object into multiple fragments at random
positions, and then encodes these fragments separately. Also MIME uses an
extension of Base64 where it allows some variations in the encoding
alphabet (so even the same fragment of the same length may have two disting
encodings).

Base64 in MIME is different from standard Base64 (which never splits the
binary object before encoding it, and uses a strict alphabet of 64 ASCII
characters, allowing no variation). So MIME requires special handling: the
assumpton that a binary message is encoded the same is wrong, but MIME
still requires that this non unique Base64 encoding will be decoded back to
the same initial (unsplitted) binary object (independantly of its size and
independantly of the splitting boundaries used in the transport, which may
change during the transport).

This also applies to the Base64 encoding used in HTTP transport syntax, and
notably in the HTTP/1.1 streaming feature where fragment sizes are also
variable.


Le sam. 13 oct. 2018 ? 16:27, Costello, Roger L. via Unicode <
unicode at unicode.org> a ?crit :

> Hi Folks,
>
> Thank you for your outstanding responses!
>
> Below is a summary of what I learned. Are there any errors in the summary?
> Is there anything you would add? Please let me know of anything that is not
> clear.   /Roger
>
> 1. While base64 encoding is usually applied to binary, it is also
> sometimes applied to text, such as Unicode text.
>
> Note: Since base64 encoding may be applied to both binary and text, in the
> following bullets I use the more generic term "data". For example, "Data d
> is base64-encoded to yield ..."
>
> 2. Neither base64 encoding nor decoding should presume any special
> knowledge of the meaning of the data or do anything extra based on that
> presumption.
>
> For example, converting Unicode text to and from base64 should not perform
> any sort of Unicode normalization, convert between UTFs, insert or remove
> BOMs, etc. This is like saying that converting a JPEG image to and from
> base64 should not resize or rescale the image, change its color depth,
> convert it to another graphic format, etc.
>
> If you use base64 for encoding MIME content (e.g. emails), the base64
> decoding will not transform the content. The email parser must ensure that
> the content is valid, so the parser might have to transform the content
> (possibly replacing some invalid sequences or truncating), and then apply
> Unicode normalization to render the text. These transforms are part of the
> MIME application and are independent of whether you use base64 or any
> another encoding or transport syntax.
>
> 3. If data d is different than d', then the base64 text resulting from
> encoding d is different than the base64 text resulting from encoding d'.
>
> 4. If base64 text t is different than t', then the data resulting from
> decoding t is different than the data resulting from decoding t'.
>
> 5. For every data d there is exactly one base64 encoding t.
>
> 6. Every base64 text t is an encoding of exactly one data d.
>
> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181013/6d52d609/attachment.html>

From unicode at unicode.org  Sat Oct 13 09:51:50 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 13 Oct 2018 16:51:50 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
Message-ID: <CAGa7JC2to4VWLO=EJTcef0z=qZNZFSWZJYrRBv0PywcbbEL3YQ@mail.gmail.com>

In summary, two disating implementations are allowed to return different
values t and t' of Base64_Encode(d) from the same message d, but both
Base64_Decode(t') and  Base64_Decode(t) will be equal and will MUST return
d exactly.

There's an allowed choice of implementation for Base64_Encode() but
Base64_Decode() must then be updated to be permissive/flexible and ensure
that in all cases,
Base64_Decode[Base64_Encode[d]] = d, for every value of d.

The reverse is not true because of this flexibility (needed for various
transport protocols that have different requirements, notably on the
allowed set of characters, and on their maximum line lengths):
Base64_Encode[Base64_Decode[t]] = t may be false.


Le sam. 13 oct. 2018 ? 16:45, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> You forget that Base64 (as used in MIME) does not follow these rules as it
> allows multiple different encodings for the same source binary. MIME
> actually splits a binary object into multiple fragments at random
> positions, and then encodes these fragments separately. Also MIME uses an
> extension of Base64 where it allows some variations in the encoding
> alphabet (so even the same fragment of the same length may have two disting
> encodings).
>
> Base64 in MIME is different from standard Base64 (which never splits the
> binary object before encoding it, and uses a strict alphabet of 64 ASCII
> characters, allowing no variation). So MIME requires special handling: the
> assumpton that a binary message is encoded the same is wrong, but MIME
> still requires that this non unique Base64 encoding will be decoded back to
> the same initial (unsplitted) binary object (independantly of its size and
> independantly of the splitting boundaries used in the transport, which may
> change during the transport).
>
> This also applies to the Base64 encoding used in HTTP transport syntax,
> and notably in the HTTP/1.1 streaming feature where fragment sizes are also
> variable.
>
>
> Le sam. 13 oct. 2018 ? 16:27, Costello, Roger L. via Unicode <
> unicode at unicode.org> a ?crit :
>
>> Hi Folks,
>>
>> Thank you for your outstanding responses!
>>
>> Below is a summary of what I learned. Are there any errors in the
>> summary? Is there anything you would add? Please let me know of anything
>> that is not clear.   /Roger
>>
>> 1. While base64 encoding is usually applied to binary, it is also
>> sometimes applied to text, such as Unicode text.
>>
>> Note: Since base64 encoding may be applied to both binary and text, in
>> the following bullets I use the more generic term "data". For example,
>> "Data d is base64-encoded to yield ..."
>>
>> 2. Neither base64 encoding nor decoding should presume any special
>> knowledge of the meaning of the data or do anything extra based on that
>> presumption.
>>
>> For example, converting Unicode text to and from base64 should not
>> perform any sort of Unicode normalization, convert between UTFs, insert or
>> remove BOMs, etc. This is like saying that converting a JPEG image to and
>> from base64 should not resize or rescale the image, change its color depth,
>> convert it to another graphic format, etc.
>>
>> If you use base64 for encoding MIME content (e.g. emails), the base64
>> decoding will not transform the content. The email parser must ensure that
>> the content is valid, so the parser might have to transform the content
>> (possibly replacing some invalid sequences or truncating), and then apply
>> Unicode normalization to render the text. These transforms are part of the
>> MIME application and are independent of whether you use base64 or any
>> another encoding or transport syntax.
>>
>> 3. If data d is different than d', then the base64 text resulting from
>> encoding d is different than the base64 text resulting from encoding d'.
>>
>> 4. If base64 text t is different than t', then the data resulting from
>> decoding t is different than the data resulting from decoding t'.
>>
>> 5. For every data d there is exactly one base64 encoding t.
>>
>> 6. Every base64 text t is an encoding of exactly one data d.
>>
>> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181013/2efd932e/attachment.html>

From unicode at unicode.org  Sat Oct 13 11:50:19 2018
From: unicode at unicode.org (Steffen Nurpmeso via Unicode)
Date: Sat, 13 Oct 2018 18:50:19 +0200
Subject: Base64 encoding applied to different unicode texts always
 yields different base64 texts ... true or false?
In-Reply-To: <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
Message-ID: <20181013165019.sxGzV%steffen@sdaoden.eu>

Philippe Verdy via Unicode wrote in <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+\
w9+jEARW4Ghyk8hg at mail.gmail.com>:
 |You forget that Base64 (as used in MIME) does not follow these rules \
 |as it allows multiple different encodings for the same source binary. \
 |MIME actually 
 |splits a binary object into multiple fragments at random positions, \
 |and then encodes these fragments separately. Also MIME uses an extension \
 |of Base64 
 |where it allows some variations in the encoding alphabet (so even the \
 |same fragment of the same length may have two disting encodings).
 |
 |Base64 in MIME is different from standard Base64 (which never splits \
 |the binary object before encoding it, and uses a strict alphabet of \
 |64 ASCII 
 |characters, allowing no variation). So MIME requires special handling: \
 |the assumpton that a binary message is encoded the same is wrong, but \
 |MIME still 
 |requires that this non unique Base64 encoding will be decoded back \
 |to the same initial (unsplitted) binary object (independantly of its \
 |size and 
 |independantly of the splitting boundaries used in the transport, which \
 |may change during the transport).

Base64 is defined in RFC 2045 (Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies).
It is a content-transfer-encoding and encodes any data
transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
(the authors commemorate that) text.
When decoding it reverts this representation into its original form.
Ok, there is the CRLF newline problem, as below.
What do you mean by "splitting"?

...
The only variance is described as:

  Care must be taken to use the proper octets for line breaks if base64
  encoding is applied directly to text material that has not been
  converted to canonical form.  In particular, text line breaks must be
  converted into CRLF sequences prior to base64 encoding.  The
  important thing to note is that this may be done directly by the
  encoder rather than in a prior canonicalization step in some
  implementations.

This is MIME, it specifies (in the same RFC):

  2.10.  Lines

   "Lines" are defined as sequences of octets separated by a CRLF
   sequences.  This is consistent with both RFC 821 and RFC 822.
   "Lines" only refers to a unit of data in a message, which may or may
   not correspond to something that is actually displayed by a user
   agent.

and furthermore

  6.5.  Translating Encodings

   The quoted-printable and base64 encodings are designed so that
   conversion between them is possible.  The only issue that arises in
   such a conversion is the handling of hard line breaks in quoted-
   printable encoding output. When converting from quoted-printable to
   base64 a hard line break in the quoted-printable form represents a
   CRLF sequence in the canonical form of the data. It must therefore be
   converted to a corresponding encoded CRLF in the base64 form of the
   data.  Similarly, a CRLF sequence in the canonical form of the data
   obtained after base64 decoding must be converted to a quoted-
   printable hard line break, but ONLY when converting text data.

So we go over

  6.6.  Canonical Encoding Model

   There was some confusion, in the previous versions of this RFC,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system, and the relationship
   between content-transfer-encodings and character sets.  A canonical
   model for encoding is presented in RFC 2049 for this reason.

to RFC 2049 where we find

         For example, in the case of text/plain data, the text
          must be converted to a supported character set and
          lines must be delimited with CRLF delimiters in
          accordance with RFC 822.  Note that the restriction on
          line lengths implied by RFC 822 is eliminated if the
          next step employs either quoted-printable or base64
          encoding.

and, later

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

and, later

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

Whoever understands this emojibake.
My MUA still gnaws at antiquated structures (i am too lazy), but
in quoted-printable we encode CRLF in the raw text to "=0D=0A=",
i.e., a trailing soft line break so that data is decoded as plain
CRLF again.  Something like that it should be i think.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From unicode at unicode.org  Sat Oct 13 18:37:35 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 14 Oct 2018 01:37:35 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <20181013165019.sxGzV%steffen@sdaoden.eu>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
Message-ID: <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>

Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
unicode at unicode.org> a ?crit :

> Philippe Verdy via Unicode wrote in <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+\
> w9+jEARW4Ghyk8hg at mail.gmail.com>:
>  |You forget that Base64 (as used in MIME) does not follow these rules \
>  |as it allows multiple different encodings for the same source binary. \
>  |MIME actually
>  |splits a binary object into multiple fragments at random positions, \
>  |and then encodes these fragments separately. Also MIME uses an extension
> \
>  |of Base64
>  |where it allows some variations in the encoding alphabet (so even the \
>  |same fragment of the same length may have two disting encodings).
>  |
>  |Base64 in MIME is different from standard Base64 (which never splits \
>  |the binary object before encoding it, and uses a strict alphabet of \
>  |64 ASCII
>  |characters, allowing no variation). So MIME requires special handling: \
>  |the assumpton that a binary message is encoded the same is wrong, but \
>  |MIME still
>  |requires that this non unique Base64 encoding will be decoded back \
>  |to the same initial (unsplitted) binary object (independantly of its \
>  |size and
>  |independantly of the splitting boundaries used in the transport, which \
>  |may change during the transport).
>
> Base64 is defined in RFC 2045 (Multipurpose Internet Mail
> Extensions (MIME) Part One: Format of Internet Message Bodies).
> It is a content-transfer-encoding and encodes any data
> transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
> (the authors commemorate that) text.
> When decoding it reverts this representation into its original form.
> Ok, there is the CRLF newline problem, as below.
> What do you mean by "splitting"?
>
> ...
> The only variance is described as:
>
>   Care must be taken to use the proper octets for line breaks if base64
>   encoding is applied directly to text material that has not been
>   converted to canonical form.  In particular, text line breaks must be
>   converted into CRLF sequences prior to base64 encoding.  The
>   important thing to note is that this may be done directly by the
>   encoder rather than in a prior canonicalization step in some
>   implementations.
>
> This is MIME, it specifies (in the same RFC):


I've not spoken aboutr the encoding of new lines **in the actual encoded
text**:
-  if their existing text-encoding ever gets converted to Base64 as if the
whole text was an opaque binary object, their initial text-encoding will be
preserved (so yes it will preserve the way these embedded newlines are
encoded as CR, LF, CR+LF, NL...)

I spoke about newlines used in the transport syntax to split the initial
binary object (which may actually contain text but it does not matter).
MIME defines this operation and even requires splitting the binary object
in fragments with maximum binary size so that these binary fragments can be
converted with Base64 into lines with maximum length. In the MIME Base64
representation you can insert newlines anywhere between fragments encoded
separately.

The maximum size of fragment is not fixed (it is usually about 60 binary
octets, that are converted to lines of 80 ASCII characters, followed by a
newline (CR+LF is strongly suggested for MIME, but it is admitted to use
other newline sequences). Email forwarding agents frequently needed these
line lengths to process the mail properly (not just the MIME headers but as
well the content body, where they want at least some whitespace or newline
in the middle where they can freely rearrange the line lines by compressing
whitespaces or splitting lines to shorter length as necessary to their
processing; this is much less frequent today because most mail agents are
8-bit clean and allow arbitrary line lengths... except in MIME headers).

In MIME headers the situation is different, there's really a maximum
line-length there, and if a header is too long, it has to be split on
multiple lines (using continuation sequences, i.e. a newline (CR+LF is
standard here) followed by at least one space (this
insertion/change/removal of whitespaces is permitted everywhere in the MIME
header after the header type, but even before the colon that follows the
header type). So a MIME header value whose included text gets encoded with
Base64 will be split using "=?" sequences starting the indication that the
fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and
then a separator and the encapsulated Base-64 encoding of a fragment, and a
single header may have multiple Base64-encoded fragments in the same header
value, and there's large freedom about where to split the value to isolate
fragments with convenient size that satisfies the MIME requirements. These
multiple fragemetns may then occur on the same line (separated by
whitespace) or on multiple line (separated by continuation sequences).

In that case, the same initial text can have multiple valid representation
in a MIME envelope format using Base64: it is not Base64 itself that splits
the message, but the MIME transport syntax (which itself does not alter the
initial text-encoding of the initial text... except in parts that are NOT
binary-encoded using Base64 or QuotedPrintable).

We are in a case where Base64 is not applied uniquely, because it is driven
not by the actual transported text, but by the MIME transport syntax, and
MIME allows freely changing the Base64 fragment sizes (or even switch to
another encoding) as long as it preserves the binary value of the embedded
object, and also to change the text-encoding (UTF-8, ISO 8859-*, etc.) if
encoded fragments are identified to actually contain text (this does not
apply to content bodies, unless they are declared with a "text/*" MIME type
in the headers; but this applies for known headers whose value is
necessarily a text type (such as in headers with types "From:", "To:",
"Cc:", "Subject:", "Date:" ...)

MIME defines two distinct syntaxes, one for declaration headers, another
for content bodies. Each one can use Base64 encoding and split the content
(but differently).

HTTP also has a mechanism for splitting a large body into fragments (this
allows notably to create streaming protocols where fragments can be easily
multiplexed with parallel streams, or to include digital fingerprints or
security signatures for individual fragments to secure the stream. This
fragmentation is independant of the network transport (generally TCP, but
not only) which has its own transparent MTUs at session layer, link layers,
and also can be itself be encapsulated through tunnels transported by other
means with different MTUs and fragmentation : HTTP does not have to manage
that lower layer).

Both MIME (for mails) and HTTP define allowed transformations to drive how
Base64 will be used. Both have enough flexibility to allow variable
fragment sizes, and even allow them to be changed as needed for the
transport (this is challending for data signatures of the exchanged
contents, but both MIME and HTTP can safely preserve the content without
breaking these signatures in the middle): the recipient may not recieve
exactly the same Base-64 encoded message, but it will get the same message
content (once it is Base64 decoded)

Base64 is used exactly to support this flexibility in transport (or
storage) without altering any bit of the initial content once it is decoded.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/2803c1ce/attachment.html>

From unicode at unicode.org  Sat Oct 13 19:02:59 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 14 Oct 2018 01:02:59 +0100
Subject: Fallback for Sinhala Consonant  Clusters
Message-ID: <20181014010259.4fb5436a@JRWUBU2>

Are there fallback rules for Sinhala consonant clusters?  There are
fallback rules for Devanagari, but I'm not sure if they read across.

The problem I am seeing is that the Pali syllable 'ndhe' ????? <U+0DB1
NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9
KOMBUVA> is being rendered identically to a hypothetical Sinhalese
'n?dha' ??? <U+0DB1, U+0DDA DIGA KOMBUVA, U+0DB0>,  which in NFD is
<U+0DB1, U+0DD9, U+0DCA, U+0DB0>, when I use a font that lacks the
conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
preference would lead to a fallback rendering as ????  (Sinhalese
'ndhe'), which is encoded as <U+0DB1 NAYANNA, U+0DCA AL-LAKUNA, U+0DB0
MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>.  Is the rendering I am getting
technically wrong, or is it merely undesirable?

The ambiguity arises in part because, like the Brahmi script, the
Sinhala script uses its virama character as a vowel length indicator.

Missing touching consonants are being rendered almost as though there
were no ZWJ, but the combination of consonant and al-lakuna is being
rendered badly.

Richard.


From unicode at unicode.org  Sat Oct 13 20:39:04 2018
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Sun, 14 Oct 2018 03:39:04 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
Message-ID: <20181014013904.idfomqt5s65wnqro@angband.pl>

On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> unicode at unicode.org> a ?crit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow"     is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


????
-- 
??????? 
??????? 10 people enter a bar: 1 who understands binary,
??????? 1 who doesn't, D who prefer to write it as hex,
??????? and 1 who narrowly avoided an off-by-one error.

From unicode at unicode.org  Sun Oct 14 03:15:26 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Sun, 14 Oct 2018 17:15:26 +0900
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <20181014010259.4fb5436a@JRWUBU2>
References: <20181014010259.4fb5436a@JRWUBU2>
Message-ID: <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp>

Hello Richard,

On 2018/10/14 09:02, Richard Wordingham via Unicode wrote:
> Are there fallback rules for Sinhala consonant clusters?  There are
> fallback rules for Devanagari, but I'm not sure if they read across.
> 
> The problem I am seeing is that the Pali syllable 'ndhe' ????? <U+0DB1
> NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9
> KOMBUVA>

Let's label this as (1)

> is being rendered identically to a hypothetical Sinhalese
> 'n?dha' ??? <U+0DB1, U+0DDA DIGA KOMBUVA, U+0DB0>,

It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1).

Your mail is written as if you are speaking about a general phenomenon, 
but I guess there are differences depending on the font and rendering stack.

> which in NFD is
> <U+0DB1, U+0DD9, U+0DCA, U+0DB0>, when I use a font that lacks the
> conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
> preference would lead to a fallback rendering as ????  (Sinhalese
> 'ndhe'),

Here, this (3) looks like it has the same three components as (2), but 
the first two are exchanged, so that the piece that looks like @ is now 
in the middle (it was at the left in (1) and (2)).

Hope this helps.  Regards,    Martin.

> which is encoded as <U+0DB1 NAYANNA, U+0DCA AL-LAKUNA, U+0DB0
> MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>.  Is the rendering I am getting
> technically wrong, or is it merely undesirable?
> 
> The ambiguity arises in part because, like the Brahmi script, the
> Sinhala script uses its virama character as a vowel length indicator.
> 
> Missing touching consonants are being rendered almost as though there
> were no ZWJ, but the combination of consonant and al-lakuna is being
> rendered badly.
> 
> Richard.
> 
> .
> 

-- 
Prof. Dr.sc. Martin J. D?rst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

From unicode at unicode.org  Sun Oct 14 03:41:28 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 14 Oct 2018 10:41:28 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <20181014013904.idfomqt5s65wnqro@angband.pl>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
Message-ID: <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>

Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
enough to indicate the end of an octets-span. The extra = after it do not
add any other octet. and as well you're allowed to insert whitespaces
anywhere in the encoded stream (this is what ensures that the
Base64-encoded octets-stream will not be altered if line breaks are forced
anywhere (notably within the body of emails).

So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
LF, NEL) in the middle is non-significant and ignorable on decoding (their
"encoded" bit length is 0 and they don't terminate an octets-span, unlike
"=" which discards extra bits remaining from the encoded stream before that
are not on 8-bit boundaries).

Also:
- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
before "=" can vary in its 4 lowest bits (which are then ignored/discarded
by the "=" symbol)
- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol
before "=" can vary in its 2 lowest bits (which are then ignored/discarded
by the "=" symbol)

So you can use Base64 by encoding each octet in separate pieces, as one
Base64 symbol followed by an "=" symbol, and even insert any number of
whitespaces between them: there's a infinite number of valid Base64
encodings for representing the same octets-stream payload.

Base64 allows encoding any octets streams but not directly any bits-streams
: it assumes that the effective bits-stream has a binary length multiple of
8. To encode a bits-stream with an exact number of bits (not multiple of
8), you need to encode an extra payload to indicate the effective number of
bits to keep at end of the encoded octets-stream (or at start):
- Base64 does not specify how you convert a bitstream of arbitrary length
to an octets-stream;
- for that purpose, you may need to pad the bits-stream at start or at end
with 1 to 6 bits (so that it the resulting bitstream has a length multiple
of 8, then encodable with Base64 which takes only octets on input).
- these extra padding bits are not significant for the original bitstream,
but are significant for the Base64 encoder/decoder, they will be discarded
by the bitstream decoder built on top of the Base64 decoder, but not by the
Base64 decoder itself.

You need to encode somewhere with the bitstream encoder how many padding
bits (0 to 7) are present at start or end of the octets-stream; this can be
done:
- as a separate payload (not encoded by Base64), or
- by prepending 3 bits at start of the bits-stream then padded at end with
1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
encoding.
- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
Finally your bits-stream decoder will be able to use this padding count to
discard these random padding bits (and possibly realign the stream on
different byte-boundaries when the effective bitlength bits-stream payload
is not a multiple of 8 and padding bits were added)

Base64 also does not specify how bits of the original bits-stream payload
are packed into the octets-stream input suitable for Base64-encoding,
notably it does not specify their order and endian-ness. The same remark
applies as well for MIME, HTTP. So lot of network protocols and file
formats need to how to properly encode which possible option is used to
encode bits-streams of arbitrary length, or need to specify which default
choice to apply if this option is not encoded, or which option must be used
(with no possible variation). And this also adds to the number of distinct
encodings that are possible but are still equivalent for the same effective
bits-stream payload.

All these allowed variations are from the encoder perspective. For
interoperability, the decoder has to be flexible and to support various
options to be compatible with different implementations of the encoder,
notably when the encoder was run on a different system. And this is the
case for the MIME transport by mail, or for HTTP and FTP transports, or
file/media storage formats even if the file is stored on the same system,
because it may actually be a copy stored locally but coming from another
system where the file was actually encoded).

Now if we come back to the encoding of plain-text payloads, Unicode just
specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
points (it actually does not mandate an exact bit-length because the range
does not fully fit exactly to 21 bits and an encoder can still pack
multiple code points together into more compact code units.

However Unicode provides and standardizes several encodings (UTF-8/16/32)
which use code units whose size is directly suitable as input for an
octets-stream, so that they are directly encodable with Base64, without
having to specify an extra layer for the bits-stream encoder/decoder.

But many other encodings are still possible (and can be conforming to
Unicode, provided they preserve each Unicode scalar value, or at least the
code point identity because an encoder/decoder is not required to support
non-character code points such as surrogates or U+FFFE), where Base64 may
be used for internally generated octets-streams.


Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <unicode at unicode.org>
a ?crit :

> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> > unicode at unicode.org> a ?crit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes XXXX
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow"     is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ????
> --
> ???????
> ??????? 10 people enter a bar: 1 who understands binary,
> ??????? 1 who doesn't, D who prefer to write it as hex,
> ??????? and 1 who narrowly avoided an off-by-one error.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/011ac0d6/attachment.html>

From unicode at unicode.org  Sun Oct 14 06:44:56 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 14 Oct 2018 12:44:56 +0100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp>
References: <20181014010259.4fb5436a@JRWUBU2>
 <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp>
Message-ID: <20181014124456.459cdef0@JRWUBU2>

On Sun, 14 Oct 2018 17:15:26 +0900
"Martin J. D?rst via Unicode" <unicode at unicode.org> wrote:

> Hello Richard,
> 
> On 2018/10/14 09:02, Richard Wordingham via Unicode wrote:
> > Are there fallback rules for Sinhala consonant clusters?  There are
> > fallback rules for Devanagari, but I'm not sure if they read across.
> > 
> > The problem I am seeing is that the Pali syllable 'ndhe' ?????
> > <U+0DB1 NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA
> > DAYANNA, U+0DD9  
> > KOMBUVA>  
> 
> Let's label this as (1)
> 
> > is being rendered identically to a hypothetical Sinhalese
> > 'n?dha' ??? <U+0DB1, U+0DDA DIGA KOMBUVA, U+0DB0>,  
> 
> It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1).
> 
> Your mail is written as if you are speaking about a general
> phenomenon, but I guess there are differences depending on the font
> and rendering stack.

The critical one is whether the font has the conjunct.  The default
Sinhala font on supported Windows, Iskoola Pota, has the conjunct. For
an example that should illustrate my points with that font (at least,
as on Windows 7) and the HarfBuzz renderer (as I believe in
Thunderbird), we have

1') Pali thve ????? <U+0DAE MAHAAPRAANA TAYANNA, U+0DCA
AL-LAKUNA, 200D ZWJ, U+0DC0 VAYANNA, U+0DD9 KOMBUVA>

It's a very rare syllable - it only occurs in sandhi, and I have only
a single example.  Iskoola Pota has neither the conjunct nor the
touching form; I would actually expect it to be the touching form
that exists.

2') Misleading look-alike th?va ??? <U+0DAE, U+0DDA DIGA KOMBUVA,
U+0DC0>

3') Preferred fallback appearance thve ????  <U+0DAE, U+0DCA, U+0DC0,
U+0DD9>.

My question is, 'What should a rendering stack that claims to support
the Sinhala script display when it lacks the conjunct in the font
being used?'

Now what does get displayed does depend on the rendering stack.
HarfBuzz (e.g. Firefox, Google Chrome, LibreOffice, and most Linux) and
Notepad on Windows 7 move the vowel to the left and display al-lakuna,
the display I object to. iPhone and Notepad on Windows 10 display
the vowel in the middle and display al-lakuna (possibly ligated), which
is the solution I prefer.

> Hope this helps.

Well, it has prompted me to find a 'me-too' argument for improving the
rendering.  I wanted a standards-based argument.

>> Missing touching consonants are being rendered almost as though
>> there were no ZWJ, but the combination of consonant and al-lakuna
>> is being rendered badly.

This looks like a common font problem.  Iskoola Pota does not suffer
from it.

Richard.


From unicode at unicode.org  Sun Oct 14 09:55:24 2018
From: unicode at unicode.org (Harshula via Unicode)
Date: Mon, 15 Oct 2018 01:55:24 +1100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <20181014010259.4fb5436a@JRWUBU2>
References: <20181014010259.4fb5436a@JRWUBU2>
Message-ID: <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>

Hi Richard,

1) From a pronunciation perspective, your first and third examples will
be similar. Your second example will be pronounced very differently. I
did some quick testing on Linux and reproduced the behaviour that you
observed.

2) Going back more than a decade, the state tables used by some
layout/shaping engines used the same 'virama' rules for North Indian
scripts and Sinhala. This resulted in undesirable *implicit* conjuncts
being created for Sinhala consonant clusters. That then resulted in
undesirable positioning of dependent vowels. e.g.
https://bugzilla.gnome.org/show_bug.cgi?id=161981

3) However, what you have observed is an issue with *explicit* conjunct
creation. After the segmentation is completed, the layout/shaping engine
needs to first check if there is a corresponding lookup for the explicit
conjunct, if not, then it needs to remove the ZWJ and redo the
segmentation and lookup(s). Perhaps that is not happening in Harfbuzz.

4) I've been out of the loop for many years, so I have CC'd Ruvan &
Harsha who may already be aware of what you have observed.

cya,
#

On 14/10/18 11:02 am, Richard Wordingham via Unicode wrote:
> Are there fallback rules for Sinhala consonant clusters?  There are
> fallback rules for Devanagari, but I'm not sure if they read across.
> 
> The problem I am seeing is that the Pali syllable 'ndhe' ????? <U+0DB1
> NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9
> KOMBUVA> is being rendered identically to a hypothetical Sinhalese
> 'n?dha' ??? <U+0DB1, U+0DDA DIGA KOMBUVA, U+0DB0>,  which in NFD is
> <U+0DB1, U+0DD9, U+0DCA, U+0DB0>, when I use a font that lacks the
> conjunct.  (Most fonts lack the conjunct.)  The Devanagari rules and my
> preference would lead to a fallback rendering as ????  (Sinhalese
> 'ndhe'), which is encoded as <U+0DB1 NAYANNA, U+0DCA AL-LAKUNA, U+0DB0
> MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>.  Is the rendering I am getting
> technically wrong, or is it merely undesirable?
> 
> The ambiguity arises in part because, like the Brahmi script, the
> Sinhala script uses its virama character as a vowel length indicator.
> 
> Missing touching consonants are being rendered almost as though there
> were no ZWJ, but the combination of consonant and al-lakuna is being
> rendered badly.
> 
> Richard.
> 

From unicode at unicode.org  Sun Oct 14 14:10:45 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Sun, 14 Oct 2018 13:10:45 -0600
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
Message-ID: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>

Steffen Nurpmeso wrote:

> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> (MIME) Part One: Format of Internet Message Bodies).

Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data 
Encodings." RFC 2045 defines a particular implementation of base64, 
specific to transporting Internet mail in a 7-bit environment.

RFC 4648 discusses many of the "higher-level protocol" topics that some 
people are focusing on, such as separating the base64-encoded output 
into lines of length 72 (or other), alternative target code unit sets or 
"alphabets," and padding characters. It would be helpful for everyone to 
read this particular RFC before concluding that these topics have not 
been considered, or that they compromise round-tripping or other 
characteristics of base64.

I had assumed that when Roger asked about "base64 encoding," he was 
asking about the basic definition of base64.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Sun Oct 14 16:50:52 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 14 Oct 2018 23:50:52 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
Message-ID: <CAGa7JC3_=aEqdFagzyVuyTZVqzjfzmqg3W8_Ycppzmn-2PPFFQ@mail.gmail.com>

It's also interesting to look at https://tools.ietf.org/html/rfc3501
- which defines (for IMAP v4) another "BASE64" encoding,
- and also defines a "Modified UTF-7" encoding using it, deviating from
Unicode's definition of UTF-7,
- and adding other requirements (which forbids alternate encodings
permitted in UTF-7 and all other Base64 variants, including those used in
MIME/RFC 2045 or SMTP, used in strong relations with IMAP !).

And nothing in RFC 4648 is clear about the fact that it only covers the
encoding of "octets streams" and not "bits streams". It also does not
discuss the adaptation for "Base64" for transport and storage (needed for
MIME, IMAP, but also in HTTP, and in several file/data formats including
XML, or digital signatures).

That RFC 4648 is only superficial, and does not cover everything (even
Unicode has its own definition for UTF-7 and also allows variations).

As we are on this Unicode list, the definition used by Unicode (more in
line with MIME), does not follow at all those in RFC 4648.
Most uses of Base64 encodings are based on the original MIME definition,
and all of them perform new adaptations. (Even the definition of "Base16"
in RFC4648 contradicts most other definitions).


Le dim. 14 oct. 2018 ? 21:21, Doug Ewell via Unicode <unicode at unicode.org>
a ?crit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>
> RFC 4648 discusses many of the "higher-level protocol" topics that some
> people are focusing on, such as separating the base64-encoded output
> into lines of length 72 (or other), alternative target code unit sets or
> "alphabets," and padding characters. It would be helpful for everyone to
> read this particular RFC before concluding that these topics have not
> been considered, or that they compromise round-tripping or other
> characteristics of base64.
>
> I had assumed that when Roger asked about "base64 encoding," he was
> asking about the basic definition of base64.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/75d82a42/attachment.html>

From unicode at unicode.org  Sun Oct 14 16:59:05 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 14 Oct 2018 23:59:05 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
Message-ID: <CAGa7JC2KTrMSE03QMe=347O9NgbDa75Co3+WcjMKmRd4_164MA@mail.gmail.com>

Le dim. 14 oct. 2018 ? 21:21, Doug Ewell via Unicode <unicode at unicode.org>
a ?crit :

> Steffen Nurpmeso wrote:
>
> > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
> > (MIME) Part One: Format of Internet Message Bodies).
>
> Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
> Encodings." RFC 2045 defines a particular implementation of base64,
> specific to transporting Internet mail in a 7-bit environment.
>

Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8
bit environment (today almost all mail agents are operating in 8 bit), and
then it is referenced directly by HTTP (and its HTTPS variant).

So this is no so "specific". MIME is extremely popular, RFC 4648 is
extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
specific as it is now a very popular protocol, widely used as well). MIME
is so frequently used, that almost all people refer to it when they look
for Base64, or do not explicitly state that another definition (found in an
exotic RFC) is explicitly used.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/9ad69c22/attachment.html>

From unicode at unicode.org  Sun Oct 14 20:56:15 2018
From: unicode at unicode.org (Tex via Unicode)
Date: Sun, 14 Oct 2018 18:56:15 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
Message-ID: <000601d4642a$4274ec70$c75ec550$@xencraft.com>

Philippe,

 
Where is the use of whitespace or the idea that 1-byte pieces do not need all the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don?t see it there.

 
Are these part of any standards? Or are you claiming these are practices despite the standards? If so, are these just tolerated by parsers, or are they actually generated by encoders?

 
What would be the rationale for supporting unnecessary whitespace? If linebreaks are forced at some line length they can presumably be removed at that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher level protocols prescribe how they are embedded within the protocol.

 
Tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

 
Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails).

 
So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries).

 
Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol)

 
So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload.

 
Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself.

 
You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done:

- as a separate payload (not encoded by Base64), or

- by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding.

- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.

Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added)

 
Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload.

 
All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded).

 
Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units.

 
However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder.

 
But many other encodings are still possible (and can be conforming to Unicode, provided they preserve each Unicode scalar value, or at least the code point identity because an encoder/decoder is not required to support non-character code points such as surrogates or U+FFFE), where Base64 may be used for internally generated octets-streams.

 
Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <unicode at unicode.org> a ?crit :

On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> unicode at unicode.org> a ?crit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow"     is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


????
-- 
??????? 
??????? 10 people enter a bar: 1 who understands binary,
??????? 1 who doesn't, D who prefer to write it as hex,
??????? and 1 who narrowly avoided an off-by-one error.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181014/1b52abfe/attachment.html>

From unicode at unicode.org  Mon Oct 15 02:53:59 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 15 Oct 2018 08:53:59 +0100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
Message-ID: <20181015085359.339c5747@JRWUBU2>

On Mon, 15 Oct 2018 01:55:24 +1100
Harshula via Unicode <unicode at unicode.org> wrote:

> 3) However, what you have observed is an issue with *explicit*
> conjunct creation. After the segmentation is completed, the
> layout/shaping engine needs to first check if there is a
> corresponding lookup for the explicit conjunct, if not, then it needs
> to remove the ZWJ and redo the segmentation and lookup(s). Perhaps
> that is not happening in Harfbuzz.

This indeed seems to be the problem with HarfBuzz and with Windows 7
Uniscribe.  Curiously, they almost adopt this behaviour when touching
letters are not available.  (The ZWJ seems not to be completely removed
- in HarfBuzz at least it can result in the al-lakuna not interacting
properly with the base character.)

But where is this usually useful behaviour specified?

1.  There may be nothing but time and money to stop fallbacks being
built into the font.  For example, what prohibits the rendering of a
conjunct falling back to touching letters or a missing glyph symbol?

2. One could argue that the current behaviour falls back to a <consonant,
al-lakuna, consonant> display; Pali in Thai script does use sequences
of <left matra, consonant, vowel-killer, consonant>. The problem is
that al-lakuna also acts as a vowel modifier.

3. What stops one arguing that a conjunct is an abstract
character and that to render it with a sequence using a visible
al-lakuna would violate its identity? 

Richard.

From unicode at unicode.org  Mon Oct 15 06:13:41 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 15 Oct 2018 13:13:41 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <000601d4642a$4274ec70$c75ec550$@xencraft.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
Message-ID: <CAGa7JC2ZnswnE7otZLXiPKrDFjw8qAL_poSs0KH91TXEK+hhxQ@mail.gmail.com>

Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
sentence, it is explicitly stated :

In some circumstances, the use of padding ("=") in base-encoded data
is not required or used.


Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don?t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bits-streams of arbitrary length, or need to specify which default
> choice to apply if this option is not encoded, or which option must be used
> (with no possible variation). And this also adds to the number of distinct
> encodings that are possible but are still equivalent for the same effective
> bits-stream payload.
>
>
>
> All these allowed variations are from the encoder perspective. For
> interoperability, the decoder has to be flexible and to support various
> options to be compatible with different implementations of the encoder,
> notably when the encoder was run on a different system. And this is the
> case for the MIME transport by mail, or for HTTP and FTP transports, or
> file/media storage formats even if the file is stored on the same system,
> because it may actually be a copy stored locally but coming from another
> system where the file was actually encoded).
>
>
>
> Now if we come back to the encoding of plain-text payloads, Unicode just
> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
> points (it actually does not mandate an exact bit-length because the range
> does not fully fit exactly to 21 bits and an encoder can still pack
> multiple code points together into more compact code units.
>
>
>
> However Unicode provides and standardizes several encodings (UTF-8/16/32)
> which use code units whose size is directly suitable as input for an
> octets-stream, so that they are directly encodable with Base64, without
> having to specify an extra layer for the bits-stream encoder/decoder.
>
>
>
> But many other encodings are still possible (and can be conforming to
> Unicode, provided they preserve each Unicode scalar value, or at least the
> code point identity because an encoder/decoder is not required to support
> non-character code points such as surrogates or U+FFFE), where Base64 may
> be used for internally generated octets-streams.
>
>
>
>
>
> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <
> unicode at unicode.org> a ?crit :
>
> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> > unicode at unicode.org> a ?crit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes XXXX
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow"     is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ????
> --
> ???????
> ??????? 10 people enter a bar: 1 who understands binary,
> ??????? 1 who doesn't, D who prefer to write it as hex,
> ??????? and 1 who narrowly avoided an off-by-one error.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/5462df58/attachment.html>

From unicode at unicode.org  Mon Oct 15 06:24:38 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 15 Oct 2018 13:24:38 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <000601d4642a$4274ec70$c75ec550$@xencraft.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
Message-ID: <CAGa7JC2tCx_gfWaEySr_TytwToidS8LKzvqHr35uifttaBdAjA@mail.gmail.com>

Also the rationale for supporting "unnecessary" whitespace is found in
MIME's version of Base64, also in RFCs describing encoding formats for
digital certificates, or for exchanging public keys in encryption
algorithms like PGP (notably, but not only, as texts in the body of emails
or in documentations and websites).

Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :

> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don?t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bits-streams of arbitrary length, or need to specify which default
> choice to apply if this option is not encoded, or which option must be used
> (with no possible variation). And this also adds to the number of distinct
> encodings that are possible but are still equivalent for the same effective
> bits-stream payload.
>
>
>
> All these allowed variations are from the encoder perspective. For
> interoperability, the decoder has to be flexible and to support various
> options to be compatible with different implementations of the encoder,
> notably when the encoder was run on a different system. And this is the
> case for the MIME transport by mail, or for HTTP and FTP transports, or
> file/media storage formats even if the file is stored on the same system,
> because it may actually be a copy stored locally but coming from another
> system where the file was actually encoded).
>
>
>
> Now if we come back to the encoding of plain-text payloads, Unicode just
> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
> points (it actually does not mandate an exact bit-length because the range
> does not fully fit exactly to 21 bits and an encoder can still pack
> multiple code points together into more compact code units.
>
>
>
> However Unicode provides and standardizes several encodings (UTF-8/16/32)
> which use code units whose size is directly suitable as input for an
> octets-stream, so that they are directly encodable with Base64, without
> having to specify an extra layer for the bits-stream encoder/decoder.
>
>
>
> But many other encodings are still possible (and can be conforming to
> Unicode, provided they preserve each Unicode scalar value, or at least the
> code point identity because an encoder/decoder is not required to support
> non-character code points such as surrogates or U+FFFE), where Base64 may
> be used for internally generated octets-streams.
>
>
>
>
>
> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <
> unicode at unicode.org> a ?crit :
>
> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> > unicode at unicode.org> a ?crit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes XXXX
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow"     is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ????
> --
> ???????
> ??????? 10 people enter a bar: 1 who understands binary,
> ??????? 1 who doesn't, D who prefer to write it as hex,
> ??????? and 1 who narrowly avoided an off-by-one error.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/7398da88/attachment.html>

From unicode at unicode.org  Mon Oct 15 06:57:14 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 15 Oct 2018 13:57:14 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC2tCx_gfWaEySr_TytwToidS8LKzvqHr35uifttaBdAjA@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
 <CAGa7JC2tCx_gfWaEySr_TytwToidS8LKzvqHr35uifttaBdAjA@mail.gmail.com>
Message-ID: <CAGa7JC17WjAO-GvxzmEX85H5bYDA77QJ2c7pA0j=Vh5RmDm1ig@mail.gmail.com>

If you want an example where padding with "=" is not used at all,
- look into URL-shortening schemes
- look into database fields or data input forms and numerous data formats
where the "=" sign is restricted (just like in URLs and file paths, or in
identifiers)
Padding is not used anywhere in the middle of the binary encoding or even
at end, only the 64 symbols of the encoding alphabet are needed and the
extra 2 or 4 lowest bits that may be encoded in the last character of the
encoded sequence are discarded by the decoder (these extra bits are not
necessarily set to 0 by encoders in the last symbol, even if this is the
canonical form recommanded in encoders, their value is simply ignored by
decoders).
Some Base64 encoders do not necessarily encode binary octets-streams, but
bits-streams whose length in bits is not necessarily multiple of 8, in
which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
symbol of the encoded sequence.
Other encoders use streams of binary code units that are larger than 8
bits, and may want to encode more padding symbols to force the alignment of
data required in their associated decoders, or will choose to not use any
padding at all, letting the decoder discard the trailing bits themselves at
end of the encoded stream.

Le lun. 15 oct. 2018 ? 13:24, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> Also the rationale for supporting "unnecessary" whitespace is found in
> MIME's version of Base64, also in RFCs describing encoding formats for
> digital certificates, or for exchanging public keys in encryption
> algorithms like PGP (notably, but not only, as texts in the body of emails
> or in documentations and websites).
>
> Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :
>
>> Philippe,
>>
>>
>>
>> Where is the use of whitespace or the idea that 1-byte pieces do not need
>> all the equal sign paddings documented?
>>
>> I read the rfc 3501 you pointed at, I don?t see it there.
>>
>>
>>
>> Are these part of any standards? Or are you claiming these are practices
>> despite the standards? If so, are these just tolerated by parsers, or are
>> they actually generated by encoders?
>>
>>
>>
>> What would be the rationale for supporting unnecessary whitespace? If
>> linebreaks are forced at some line length they can presumably be removed at
>> that length and not treated as part of the encoding.
>>
>> Maybe we differ on define where the encoding begins and ends, and where
>> higher level protocols prescribe how they are embedded within the protocol.
>>
>>
>>
>> Tex
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
>> Verdy via Unicode
>> *Sent:* Sunday, October 14, 2018 1:41 AM
>> *To:* Adam Borowski
>> *Cc:* unicode Unicode Discussion
>> *Subject:* Re: Base64 encoding applied to different unicode texts always
>> yields different base64 texts ... true or false?
>>
>>
>>
>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>> enough to indicate the end of an octets-span. The extra = after it do not
>> add any other octet. and as well you're allowed to insert whitespaces
>> anywhere in the encoded stream (this is what ensures that the
>> Base64-encoded octets-stream will not be altered if line breaks are forced
>> anywhere (notably within the body of emails).
>>
>>
>>
>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>> unlike "=" which discards extra bits remaining from the encoded stream
>> before that are not on 8-bit boundaries).
>>
>>
>>
>> Also:
>>
>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>> symbol before "=" can vary in its 4 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>> symbol before "=" can vary in its 2 lowest bits (which are then
>> ignored/discarded by the "=" symbol)
>>
>>
>>
>> So you can use Base64 by encoding each octet in separate pieces, as one
>> Base64 symbol followed by an "=" symbol, and even insert any number of
>> whitespaces between them: there's a infinite number of valid Base64
>> encodings for representing the same octets-stream payload.
>>
>>
>>
>> Base64 allows encoding any octets streams but not directly any
>> bits-streams : it assumes that the effective bits-stream has a binary
>> length multiple of 8. To encode a bits-stream with an exact number of bits
>> (not multiple of 8), you need to encode an extra payload to indicate the
>> effective number of bits to keep at end of the encoded octets-stream (or at
>> start):
>>
>> - Base64 does not specify how you convert a bitstream of arbitrary length
>> to an octets-stream;
>>
>> - for that purpose, you may need to pad the bits-stream at start or at
>> end with 1 to 6 bits (so that it the resulting bitstream has a length
>> multiple of 8, then encodable with Base64 which takes only octets on input).
>>
>> - these extra padding bits are not significant for the original
>> bitstream, but are significant for the Base64 encoder/decoder, they will be
>> discarded by the bitstream decoder built on top of the Base64 decoder, but
>> not by the Base64 decoder itself.
>>
>>
>>
>> You need to encode somewhere with the bitstream encoder how many padding
>> bits (0 to 7) are present at start or end of the octets-stream; this can be
>> done:
>>
>> - as a separate payload (not encoded by Base64), or
>>
>> - by prepending 3 bits at start of the bits-stream then padded at end
>> with 1 to 7 random bits to get a bit-length multiple of 8 suitable for
>> Base64 encoding.
>>
>> - by appending 3 bits at end of the  bits-stream, just after 1 to 7
>> random bits needed to get a bit-length multiple of 8 suitable for Base64
>> encoding.
>>
>> Finally your bits-stream decoder will be able to use this padding count
>> to discard these random padding bits (and possibly realign the stream on
>> different byte-boundaries when the effective bitlength bits-stream payload
>> is not a multiple of 8 and padding bits were added)
>>
>>
>>
>> Base64 also does not specify how bits of the original bits-stream payload
>> are packed into the octets-stream input suitable for Base64-encoding,
>> notably it does not specify their order and endian-ness. The same remark
>> applies as well for MIME, HTTP. So lot of network protocols and file
>> formats need to how to properly encode which possible option is used to
>> encode bits-streams of arbitrary length, or need to specify which default
>> choice to apply if this option is not encoded, or which option must be used
>> (with no possible variation). And this also adds to the number of distinct
>> encodings that are possible but are still equivalent for the same effective
>> bits-stream payload.
>>
>>
>>
>> All these allowed variations are from the encoder perspective. For
>> interoperability, the decoder has to be flexible and to support various
>> options to be compatible with different implementations of the encoder,
>> notably when the encoder was run on a different system. And this is the
>> case for the MIME transport by mail, or for HTTP and FTP transports, or
>> file/media storage formats even if the file is stored on the same system,
>> because it may actually be a copy stored locally but coming from another
>> system where the file was actually encoded).
>>
>>
>>
>> Now if we come back to the encoding of plain-text payloads, Unicode just
>> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
>> points (it actually does not mandate an exact bit-length because the range
>> does not fully fit exactly to 21 bits and an encoder can still pack
>> multiple code points together into more compact code units.
>>
>>
>>
>> However Unicode provides and standardizes several encodings (UTF-8/16/32)
>> which use code units whose size is directly suitable as input for an
>> octets-stream, so that they are directly encodable with Base64, without
>> having to specify an extra layer for the bits-stream encoder/decoder.
>>
>>
>>
>> But many other encodings are still possible (and can be conforming to
>> Unicode, provided they preserve each Unicode scalar value, or at least the
>> code point identity because an encoder/decoder is not required to support
>> non-character code points such as surrogates or U+FFFE), where Base64 may
>> be used for internally generated octets-streams.
>>
>>
>>
>>
>>
>> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <
>> unicode at unicode.org> a ?crit :
>>
>> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode
>> wrote:
>> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
>> > unicode at unicode.org> a ?crit :
>> > > The only variance is described as:
>> > >
>> > >   Care must be taken to use the proper octets for line breaks if
>> base64
>> > >   encoding is applied directly to text material that has not been
>> > >   converted to canonical form.  In particular, text line breaks must
>> be
>> > >   converted into CRLF sequences prior to base64 encoding.  The
>> > >   important thing to note is that this may be done directly by the
>> > >   encoder rather than in a prior canonicalization step in some
>> > >   implementations.
>> > >
>> > > This is MIME, it specifies (in the same RFC):
>> >
>> > I've not spoken aboutr the encoding of new lines **in the actual encoded
>> > text**:
>> > -  if their existing text-encoding ever gets converted to Base64 as if
>> the
>> > whole text was an opaque binary object, their initial text-encoding
>> will be
>> > preserved (so yes it will preserve the way these embedded newlines are
>> > encoded as CR, LF, CR+LF, NL...)
>> >
>> > I spoke about newlines used in the transport syntax to split the initial
>> > binary object (which may actually contain text but it does not matter).
>> > MIME defines this operation and even requires splitting the binary
>> object
>> > in fragments with maximum binary size so that these binary fragments
>> can be
>> > converted with Base64 into lines with maximum length. In the MIME Base64
>> > representation you can insert newlines anywhere between fragments
>> encoded
>> > separately.
>>
>> There's another kind of fragmentation that can make the encoding differ
>> (but
>> still decode to the same payload):
>>
>> The data stream gets split into 3-byte internal, 4-byte external packets.
>> Any packet may contain less than those 3 bytes, in which cases it is
>> padded
>> with = characters:
>> 3 bytes XXXX
>> 2 bytes XXX=
>> 1 byte  XX==
>>
>> Usually, such smaller packets happen only at the end of a message, but to
>> support encoding a stream piecewise, they are allowed at any point.
>>
>> For example:
>> "meow"     is bWVvdw==
>> "me""ow"   is bWU=b3c=
>> yet both carry the same payload.
>>
>> > Base64 is used exactly to support this flexibility in transport (or
>> > storage) without altering any bit of the initial content once it is
>> > decoded.
>>
>> Right, any such variations are in packaging only.
>>
>>
>> ????
>> --
>> ???????
>> ??????? 10 people enter a bar: 1 who understands binary,
>> ??????? 1 who doesn't, D who prefer to write it as hex,
>> ??????? and 1 who narrowly avoided an off-by-one error.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/b5d9df9e/attachment.html>

From unicode at unicode.org  Mon Oct 15 07:11:58 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 15 Oct 2018 14:11:58 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC17WjAO-GvxzmEX85H5bYDA77QJ2c7pA0j=Vh5RmDm1ig@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
 <CAGa7JC2tCx_gfWaEySr_TytwToidS8LKzvqHr35uifttaBdAjA@mail.gmail.com>
 <CAGa7JC17WjAO-GvxzmEX85H5bYDA77QJ2c7pA0j=Vh5RmDm1ig@mail.gmail.com>
Message-ID: <CAGa7JC1xvoe_hAZo+2Bsm-iXxxDFrTEE+7zAu4=n4d-XgtXpvg@mail.gmail.com>

Note that all these discussion about padding applies to all other base-N
encodings, including base-10.

For example to represent numbers of arbitrary precision: padding does not
require a separate symbol but can use the "0" digit which is part of the
10-symbols alphabet, or encoders can discard them on the left, or on the
right if there's a decimal dot; when the precision is less than a integral
number of decimal digits, the extra bits or fractional bits of information
in the last digit of the encoded sequence does not matter, encoders may
choose to not set them to 0 but may prefer to use rounding which may
conditionally set these bits to 1, depedning on the value of the last
significant bits or fractional bits of maximum precision.

As well the same decoders may want to use extra whitespaces (notably to
limit line lengths at arbitrary lengths, notably for embedding the encoded
sequences in printed documents or documents with a page layout and rendered
with a readable font size suitable for the page width, or for presentation
purpose by grouping symbols).

In summary, padding is not required at all by all Base-N encoders/decoders,
and non significant whitespace is frequently needed.


Le lun. 15 oct. 2018 ? 13:57, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> If you want an example where padding with "=" is not used at all,
> - look into URL-shortening schemes
> - look into database fields or data input forms and numerous data formats
> where the "=" sign is restricted (just like in URLs and file paths, or in
> identifiers)
> Padding is not used anywhere in the middle of the binary encoding or even
> at end, only the 64 symbols of the encoding alphabet are needed and the
> extra 2 or 4 lowest bits that may be encoded in the last character of the
> encoded sequence are discarded by the decoder (these extra bits are not
> necessarily set to 0 by encoders in the last symbol, even if this is the
> canonical form recommanded in encoders, their value is simply ignored by
> decoders).
> Some Base64 encoders do not necessarily encode binary octets-streams, but
> bits-streams whose length in bits is not necessarily multiple of 8, in
> which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last
> symbol of the encoded sequence.
> Other encoders use streams of binary code units that are larger than 8
> bits, and may want to encode more padding symbols to force the alignment of
> data required in their associated decoders, or will choose to not use any
> padding at all, letting the decoder discard the trailing bits themselves at
> end of the encoded stream.
>
> Le lun. 15 oct. 2018 ? 13:24, Philippe Verdy <verdy_p at wanadoo.fr> a
> ?crit :
>
>> Also the rationale for supporting "unnecessary" whitespace is found in
>> MIME's version of Base64, also in RFCs describing encoding formats for
>> digital certificates, or for exchanging public keys in encryption
>> algorithms like PGP (notably, but not only, as texts in the body of emails
>> or in documentations and websites).
>>
>> Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :
>>
>>> Philippe,
>>>
>>>
>>>
>>> Where is the use of whitespace or the idea that 1-byte pieces do not
>>> need all the equal sign paddings documented?
>>>
>>> I read the rfc 3501 you pointed at, I don?t see it there.
>>>
>>>
>>>
>>> Are these part of any standards? Or are you claiming these are practices
>>> despite the standards? If so, are these just tolerated by parsers, or are
>>> they actually generated by encoders?
>>>
>>>
>>>
>>> What would be the rationale for supporting unnecessary whitespace? If
>>> linebreaks are forced at some line length they can presumably be removed at
>>> that length and not treated as part of the encoding.
>>>
>>> Maybe we differ on define where the encoding begins and ends, and where
>>> higher level protocols prescribe how they are embedded within the protocol.
>>>
>>>
>>>
>>> Tex
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
>>> Verdy via Unicode
>>> *Sent:* Sunday, October 14, 2018 1:41 AM
>>> *To:* Adam Borowski
>>> *Cc:* unicode Unicode Discussion
>>> *Subject:* Re: Base64 encoding applied to different unicode texts
>>> always yields different base64 texts ... true or false?
>>>
>>>
>>>
>>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
>>> enough to indicate the end of an octets-span. The extra = after it do not
>>> add any other octet. and as well you're allowed to insert whitespaces
>>> anywhere in the encoded stream (this is what ensures that the
>>> Base64-encoded octets-stream will not be altered if line breaks are forced
>>> anywhere (notably within the body of emails).
>>>
>>>
>>>
>>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB,
>>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding
>>> (their "encoded" bit length is 0 and they don't terminate an octets-span,
>>> unlike "=" which discards extra bits remaining from the encoded stream
>>> before that are not on 8-bit boundaries).
>>>
>>>
>>>
>>> Also:
>>>
>>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X"
>>> symbol before "=" can vary in its 4 lowest bits (which are then
>>> ignored/discarded by the "=" symbol)
>>>
>>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
>>> symbol before "=" can vary in its 2 lowest bits (which are then
>>> ignored/discarded by the "=" symbol)
>>>
>>>
>>>
>>> So you can use Base64 by encoding each octet in separate pieces, as one
>>> Base64 symbol followed by an "=" symbol, and even insert any number of
>>> whitespaces between them: there's a infinite number of valid Base64
>>> encodings for representing the same octets-stream payload.
>>>
>>>
>>>
>>> Base64 allows encoding any octets streams but not directly any
>>> bits-streams : it assumes that the effective bits-stream has a binary
>>> length multiple of 8. To encode a bits-stream with an exact number of bits
>>> (not multiple of 8), you need to encode an extra payload to indicate the
>>> effective number of bits to keep at end of the encoded octets-stream (or at
>>> start):
>>>
>>> - Base64 does not specify how you convert a bitstream of arbitrary
>>> length to an octets-stream;
>>>
>>> - for that purpose, you may need to pad the bits-stream at start or at
>>> end with 1 to 6 bits (so that it the resulting bitstream has a length
>>> multiple of 8, then encodable with Base64 which takes only octets on input).
>>>
>>> - these extra padding bits are not significant for the original
>>> bitstream, but are significant for the Base64 encoder/decoder, they will be
>>> discarded by the bitstream decoder built on top of the Base64 decoder, but
>>> not by the Base64 decoder itself.
>>>
>>>
>>>
>>> You need to encode somewhere with the bitstream encoder how many padding
>>> bits (0 to 7) are present at start or end of the octets-stream; this can be
>>> done:
>>>
>>> - as a separate payload (not encoded by Base64), or
>>>
>>> - by prepending 3 bits at start of the bits-stream then padded at end
>>> with 1 to 7 random bits to get a bit-length multiple of 8 suitable for
>>> Base64 encoding.
>>>
>>> - by appending 3 bits at end of the  bits-stream, just after 1 to 7
>>> random bits needed to get a bit-length multiple of 8 suitable for Base64
>>> encoding.
>>>
>>> Finally your bits-stream decoder will be able to use this padding count
>>> to discard these random padding bits (and possibly realign the stream on
>>> different byte-boundaries when the effective bitlength bits-stream payload
>>> is not a multiple of 8 and padding bits were added)
>>>
>>>
>>>
>>> Base64 also does not specify how bits of the original bits-stream
>>> payload are packed into the octets-stream input suitable for
>>> Base64-encoding, notably it does not specify their order and endian-ness.
>>> The same remark applies as well for MIME, HTTP. So lot of network protocols
>>> and file formats need to how to properly encode which possible option is
>>> used to encode bits-streams of arbitrary length, or need to specify which
>>> default choice to apply if this option is not encoded, or which option must
>>> be used (with no possible variation). And this also adds to the number of
>>> distinct encodings that are possible but are still equivalent for the same
>>> effective bits-stream payload.
>>>
>>>
>>>
>>> All these allowed variations are from the encoder perspective. For
>>> interoperability, the decoder has to be flexible and to support various
>>> options to be compatible with different implementations of the encoder,
>>> notably when the encoder was run on a different system. And this is the
>>> case for the MIME transport by mail, or for HTTP and FTP transports, or
>>> file/media storage formats even if the file is stored on the same system,
>>> because it may actually be a copy stored locally but coming from another
>>> system where the file was actually encoded).
>>>
>>>
>>>
>>> Now if we come back to the encoding of plain-text payloads, Unicode just
>>> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
>>> points (it actually does not mandate an exact bit-length because the range
>>> does not fully fit exactly to 21 bits and an encoder can still pack
>>> multiple code points together into more compact code units.
>>>
>>>
>>>
>>> However Unicode provides and standardizes several encodings
>>> (UTF-8/16/32) which use code units whose size is directly suitable as input
>>> for an octets-stream, so that they are directly encodable with Base64,
>>> without having to specify an extra layer for the bits-stream
>>> encoder/decoder.
>>>
>>>
>>>
>>> But many other encodings are still possible (and can be conforming to
>>> Unicode, provided they preserve each Unicode scalar value, or at least the
>>> code point identity because an encoder/decoder is not required to support
>>> non-character code points such as surrogates or U+FFFE), where Base64 may
>>> be used for internally generated octets-streams.
>>>
>>>
>>>
>>>
>>>
>>> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <
>>> unicode at unicode.org> a ?crit :
>>>
>>> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode
>>> wrote:
>>> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
>>> > unicode at unicode.org> a ?crit :
>>> > > The only variance is described as:
>>> > >
>>> > >   Care must be taken to use the proper octets for line breaks if
>>> base64
>>> > >   encoding is applied directly to text material that has not been
>>> > >   converted to canonical form.  In particular, text line breaks must
>>> be
>>> > >   converted into CRLF sequences prior to base64 encoding.  The
>>> > >   important thing to note is that this may be done directly by the
>>> > >   encoder rather than in a prior canonicalization step in some
>>> > >   implementations.
>>> > >
>>> > > This is MIME, it specifies (in the same RFC):
>>> >
>>> > I've not spoken aboutr the encoding of new lines **in the actual
>>> encoded
>>> > text**:
>>> > -  if their existing text-encoding ever gets converted to Base64 as if
>>> the
>>> > whole text was an opaque binary object, their initial text-encoding
>>> will be
>>> > preserved (so yes it will preserve the way these embedded newlines are
>>> > encoded as CR, LF, CR+LF, NL...)
>>> >
>>> > I spoke about newlines used in the transport syntax to split the
>>> initial
>>> > binary object (which may actually contain text but it does not matter).
>>> > MIME defines this operation and even requires splitting the binary
>>> object
>>> > in fragments with maximum binary size so that these binary fragments
>>> can be
>>> > converted with Base64 into lines with maximum length. In the MIME
>>> Base64
>>> > representation you can insert newlines anywhere between fragments
>>> encoded
>>> > separately.
>>>
>>> There's another kind of fragmentation that can make the encoding differ
>>> (but
>>> still decode to the same payload):
>>>
>>> The data stream gets split into 3-byte internal, 4-byte external packets.
>>> Any packet may contain less than those 3 bytes, in which cases it is
>>> padded
>>> with = characters:
>>> 3 bytes XXXX
>>> 2 bytes XXX=
>>> 1 byte  XX==
>>>
>>> Usually, such smaller packets happen only at the end of a message, but to
>>> support encoding a stream piecewise, they are allowed at any point.
>>>
>>> For example:
>>> "meow"     is bWVvdw==
>>> "me""ow"   is bWU=b3c=
>>> yet both carry the same payload.
>>>
>>> > Base64 is used exactly to support this flexibility in transport (or
>>> > storage) without altering any bit of the initial content once it is
>>> > decoded.
>>>
>>> Right, any such variations are in packaging only.
>>>
>>>
>>> ????
>>> --
>>> ???????
>>> ??????? 10 people enter a bar: 1 who understands binary,
>>> ??????? 1 who doesn't, D who prefer to write it as hex,
>>> ??????? and 1 who narrowly avoided an off-by-one error.
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/dce4647f/attachment.html>

From unicode at unicode.org  Mon Oct 15 08:02:08 2018
From: unicode at unicode.org (Tex via Unicode)
Date: Mon, 15 Oct 2018 06:02:08 -0700
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC2ZnswnE7otZLXiPKrDFjw8qAL_poSs0KH91TXEK+hhxQ@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
 <CAGa7JC2ZnswnE7otZLXiPKrDFjw8qAL_poSs0KH91TXEK+hhxQ@mail.gmail.com>
Message-ID: <002801d46487$4821e350$d865a9f0$@xencraft.com>

Philippe, quote the entire section:

 
In some circumstances, the use of padding ("=") in base-encoded data

   is not required or used.  In the general case, when assumptions about

   the size of transported data cannot be made, padding is required to

   yield correct decoded data.

 
   Implementations MUST include appropriate pad characters at the end of

   encoded data unless the specification referring to this document

   explicitly states otherwise.

 
The first para clarifies that padding is required when the length is not otherwise known. Only if the length is provided or predefined can the padding be dropped.

The second para clarifies it must be included unless the higher level protocol states otherwise, in which case it is likely using another mechanism to define length.

 
It doesn?t seem to me to be as open ended as you implied in your initial mails, but well-defined depending on whether base64 is being used as spec?d in the RFC, or being explicitly modified to suit an embedding protocol.

And certainly the first sentence in this section isn?t intended to be taken without the context of the rest of the section.

 
tex

 
From: Philippe Verdy [mailto:verdy_p at wanadoo.fr] 
Sent: Monday, October 15, 2018 4:14 AM
To: Tex Texin
Cc: Adam Borowski; unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

 
Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st sentence, it is explicitly stated :

 
In some circumstances, the use of padding ("=") in base-encoded data is not required or used.

 
Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :

Philippe,

 
Where is the use of whitespace or the idea that 1-byte pieces do not need all the equal sign paddings documented?

I read the rfc 3501 you pointed at, I don?t see it there.

 
Are these part of any standards? Or are you claiming these are practices despite the standards? If so, are these just tolerated by parsers, or are they actually generated by encoders?

 
What would be the rationale for supporting unnecessary whitespace? If linebreaks are forced at some line length they can presumably be removed at that length and not treated as part of the encoding.

Maybe we differ on define where the encoding begins and ends, and where higher level protocols prescribe how they are embedded within the protocol.

 
Tex

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode
Sent: Sunday, October 14, 2018 1:41 AM
To: Adam Borowski
Cc: unicode Unicode Discussion
Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

 
Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails).

 
So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries).

 
Also:

- For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol)

- For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol)

 
So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload.

 
Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start):

- Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream;

- for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input).

- these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself.

 
You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done:

- as a separate payload (not encoded by Base64), or

- by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding.

- by appending 3 bits at end of the  bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.

Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added)

 
Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload.

 
All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded).

 
Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units.

 
However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder.

 
But many other encodings are still possible (and can be conforming to Unicode, provided they preserve each Unicode scalar value, or at least the code point identity because an encoder/decoder is not required to support non-character code points such as surrogates or U+FFFE), where Base64 may be used for internally generated octets-streams.

 
Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <unicode at unicode.org> a ?crit :

On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> unicode at unicode.org> a ?crit :
> > The only variance is described as:
> >
> >   Care must be taken to use the proper octets for line breaks if base64
> >   encoding is applied directly to text material that has not been
> >   converted to canonical form.  In particular, text line breaks must be
> >   converted into CRLF sequences prior to base64 encoding.  The
> >   important thing to note is that this may be done directly by the
> >   encoder rather than in a prior canonicalization step in some
> >   implementations.
> >
> > This is MIME, it specifies (in the same RFC):
> 
> I've not spoken aboutr the encoding of new lines **in the actual encoded
> text**:
> -  if their existing text-encoding ever gets converted to Base64 as if the
> whole text was an opaque binary object, their initial text-encoding will be
> preserved (so yes it will preserve the way these embedded newlines are
> encoded as CR, LF, CR+LF, NL...)
> 
> I spoke about newlines used in the transport syntax to split the initial
> binary object (which may actually contain text but it does not matter).
> MIME defines this operation and even requires splitting the binary object
> in fragments with maximum binary size so that these binary fragments can be
> converted with Base64 into lines with maximum length. In the MIME Base64
> representation you can insert newlines anywhere between fragments encoded
> separately.

There's another kind of fragmentation that can make the encoding differ (but
still decode to the same payload):

The data stream gets split into 3-byte internal, 4-byte external packets.
Any packet may contain less than those 3 bytes, in which cases it is padded
with = characters:
3 bytes XXXX
2 bytes XXX=
1 byte  XX==

Usually, such smaller packets happen only at the end of a message, but to
support encoding a stream piecewise, they are allowed at any point.

For example:
"meow"     is bWVvdw==
"me""ow"   is bWU=b3c=
yet both carry the same payload.

> Base64 is used exactly to support this flexibility in transport (or
> storage) without altering any bit of the initial content once it is
> decoded.

Right, any such variations are in packaging only.


????
-- 
??????? 
??????? 10 people enter a bar: 1 who understands binary,
??????? 1 who doesn't, D who prefer to write it as hex,
??????? and 1 who narrowly avoided an off-by-one error.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/3ae63f38/attachment-0001.html>

From unicode at unicode.org  Mon Oct 15 10:47:36 2018
From: unicode at unicode.org (Harshula via Unicode)
Date: Tue, 16 Oct 2018 02:47:36 +1100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <20181015085359.339c5747@JRWUBU2>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
Message-ID: <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>

Hi Richard,

On 15/10/18 6:53 pm, Richard Wordingham via Unicode wrote:
> On Mon, 15 Oct 2018 01:55:24 +1100
> Harshula via Unicode <unicode at unicode.org> wrote:
> 
>> 3) However, what you have observed is an issue with *explicit*
>> conjunct creation. After the segmentation is completed, the
>> layout/shaping engine needs to first check if there is a
>> corresponding lookup for the explicit conjunct, if not, then it needs
>> to remove the ZWJ and redo the segmentation and lookup(s). Perhaps
>> that is not happening in Harfbuzz.
> 
> This indeed seems to be the problem with HarfBuzz and with Windows 7
> Uniscribe.  Curiously, they almost adopt this behaviour when touching
> letters are not available.  (The ZWJ seems not to be completely removed
> - in HarfBuzz at least it can result in the al-lakuna not interacting
> properly with the base character.)
> 
> But where is this usually useful behaviour specified?
> 
> 1.  There may be nothing but time and money to stop fallbacks being
> built into the font.  For example, what prohibits the rendering of a
> conjunct falling back to touching letters or a missing glyph symbol?

I had not considered the missing glyph symbol. Perhaps that is the most
accurate solution when a font is missing a glyph during an *explicit*
conjunct lookup.

Note, touching letters are formed by <ZWJ><AL-LAKUNA>, so they should
not be displayed as a fallback for <AL-LAKUNA><ZWJ> conjuncts.

cya,
#

From unicode at unicode.org  Mon Oct 15 11:55:39 2018
From: unicode at unicode.org (Steffen Nurpmeso via Unicode)
Date: Mon, 15 Oct 2018 18:55:39 +0200
Subject: Base64 encoding applied to different unicode texts always
 yields different base64 texts ... true or false?
In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
Message-ID: <20181015165539.v6fy9%steffen@sdaoden.eu>

Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554 at DougEwell>:
 |Steffen Nurpmeso wrote:
 |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
 |> (MIME) Part One: Format of Internet Message Bodies).
 |
 |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
 |Encodings." RFC 2045 defines a particular implementation of base64,
 |specific to transporting Internet mail in a 7-bit environment.
 |
 |RFC 4648 discusses many of the "higher-level protocol" topics that some
 |people are focusing on, such as separating the base64-encoded output
 |into lines of length 72 (or other), alternative target code unit sets or
 |"alphabets," and padding characters. It would be helpful for everyone to
 |read this particular RFC before concluding that these topics have not
 |been considered, or that they compromise round-tripping or other
 |characteristics of base64.
 |
 |I had assumed that when Roger asked about "base64 encoding," he was
 |asking about the basic definition of base64.

Sure; i have only followed the discussion superficially, and even
though everybody can read RFCs, i felt the necessity to polemicize
against the false however i look at it "MIME actually splits
a binary object into multiple fragments at random positions".
Solely my fault.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From unicode at unicode.org  Mon Oct 15 12:03:42 2018
From: unicode at unicode.org (Peter Saint-Andre via Unicode)
Date: Mon, 15 Oct 2018 11:03:42 -0600
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <CAGa7JC2KTrMSE03QMe=347O9NgbDa75Co3+WcjMKmRd4_164MA@mail.gmail.com>
References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell>
 <CAGa7JC2KTrMSE03QMe=347O9NgbDa75Co3+WcjMKmRd4_164MA@mail.gmail.com>
Message-ID: <25de2517-14f0-d05c-9ece-02e9644dad6a@mozilla.com>

On 10/14/18 3:59 PM, Philippe Verdy via Unicode wrote:
> 
> 
> Le?dim. 14 oct. 2018 ??21:21, Doug Ewell via Unicode
> <unicode at unicode.org <mailto:unicode at unicode.org>> a ?crit?:
> 
>     Steffen Nurpmeso wrote:
> 
>     > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
>     > (MIME) Part One: Format of Internet Message Bodies).
> 
>     Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
>     Encodings." RFC 2045 defines a particular implementation of base64,
>     specific to transporting Internet mail in a 7-bit environment.
> 
> 
> Wrong, this is "specific" to transporting Internet mail in any 7 bit or
> 8 bit environment (today almost all mail agents are operating in 8 bit),
> and then it is referenced directly by HTTP (and its HTTPS variant).
> 
> So this is no so "specific". MIME is extremely popular, RFC 4648 is
> extremely exotic (and RFC 4648 is wrong when saying that IMAP is very
> specific as it is now a very popular protocol, widely used as well).
> MIME is so frequently used, that almost all people refer to it when they
> look for Base64, or do not explicitly state that another definition
> (found in an exotic RFC) is explicitly used.

RFC 4648 is used in many, many Internet protocols. It's definitely not
"extremely exotic".

Peter


From unicode at unicode.org  Mon Oct 15 13:22:07 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 15 Oct 2018 20:22:07 +0200
Subject: Base64 encoding applied to different unicode texts always yields
 different base64 texts ... true or false?
In-Reply-To: <002801d46487$4821e350$d865a9f0$@xencraft.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
 <CAGa7JC2ZnswnE7otZLXiPKrDFjw8qAL_poSs0KH91TXEK+hhxQ@mail.gmail.com>
 <002801d46487$4821e350$d865a9f0$@xencraft.com>
Message-ID: <CAGa7JC05XOjm87J1k92_moRyWWgfDfcbH5piMYTYcDLtbS4VZA@mail.gmail.com>

Padding itself does not clearly indicate the length.
It's an artefact that **may** be infered only in some other layers of
protocols which specify when and how padding is needed (and how many
padding bytes are required or accepted), it works only if these upper layer
protocols are using **octets** streams, but it is still not usable for more
general bitstreams (with arbitrary bit lengths).

This RFC does not mandate/require these padding bytes and in fact many
upper layer protocols do not ever need it (including UTF-7 for example),
they are never necessary to infer a length in octets and insufficient for
specifying a length in bits.

As well the usage in MIME (where there's a requirement that lines of
headers or in the content body is limited to 1000 bytes) requires free
splitting of Base64 (there's no agreed maximum length, some sources insist
it should not be more than 72 bytes, others use 80 bytes, but mail
forwarding may add other characters at start of lines, forcing them to be
shorter (leaving for example a line of 72 bytes+CRLF and another line of 8
bytes+CRLF): this means that padding may not be used where one would expect
them, and padding can event occur in the middle of the encoded stream (not
just at end) along with other whitespaces or separators (like "> " at start
of lines in cited messages).

More generally the padding in MIME offers no benefit at all. The actual
length is infered from the whole content body, and it's just safer to
ignore/discard all padding symbols in decoders (just like they will discard
whitespaces or ">"). If one wants to get a sure indication that the stream
is not truncated and has the expected length, the encoded message must
either embed this length as part of the original binary stream itself, or
can embed secure "digital signatures", "message digests" or "hashes", or
the length can be specified separately in the unencoded MIME body, or as
part of the MIME header if the whole MIME content body is specified as
using a base64 encoding. The same applies to HTTP.

I have rarely seen RFC 4648 used alone outside of another upper layer
protocol. This statement in RFC 4648 section 3.1 is for example completely
wrong for Base16 where paddings are almost always avoided.

Various other Base-N profiles for other upper layer protocols never need
(and sometime even forbid) the presence of any padding symbol, or consider
that paddding can also be made using the bits representing 0 to pad the
original binary stream, or can be made using other ignored/discard
whitespaces or symbols, without assigning any specific role to "=" (as a
length indicator or stream terminator).


Le lun. 15 oct. 2018 ? 15:02, Tex <textexin at xencraft.com> a ?crit :

> Philippe, quote the entire section:
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data
>
>    is not required or used.  In the general case, when assumptions about
>
>    the size of transported data cannot be made, padding is required to
>
>    yield correct decoded data.
>
>
>
>    Implementations MUST include appropriate pad characters at the end of
>
>    encoded data unless the specification referring to this document
>
>    explicitly states otherwise.
>
>
>
> The first para clarifies that padding is required when the length is not
> otherwise known. Only if the length is provided or predefined can the
> padding be dropped.
>
> The second para clarifies it must be included unless the higher level
> protocol states otherwise, in which case it is likely using another
> mechanism to define length.
>
>
>
> It doesn?t seem to me to be as open ended as you implied in your initial
> mails, but well-defined depending on whether base64 is being used as spec?d
> in the RFC, or being explicitly modified to suit an embedding protocol.
>
> And certainly the first sentence in this section isn?t intended to be
> taken without the context of the rest of the section.
>
>
>
> tex
>
>
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verdy_p at wanadoo.fr]
> *Sent:* Monday, October 15, 2018 4:14 AM
> *To:* Tex Texin
> *Cc:* Adam Borowski; unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st
> sentence, it is explicitly stated :
>
>
>
> In some circumstances, the use of padding ("=") in base-encoded data is not required or used.
>
>
>
> Le lun. 15 oct. 2018 ? 03:56, Tex <textexin at xencraft.com> a ?crit :
>
> Philippe,
>
>
>
> Where is the use of whitespace or the idea that 1-byte pieces do not need
> all the equal sign paddings documented?
>
> I read the rfc 3501 you pointed at, I don?t see it there.
>
>
>
> Are these part of any standards? Or are you claiming these are practices
> despite the standards? If so, are these just tolerated by parsers, or are
> they actually generated by encoders?
>
>
>
> What would be the rationale for supporting unnecessary whitespace? If
> linebreaks are forced at some line length they can presumably be removed at
> that length and not treated as part of the encoding.
>
> Maybe we differ on define where the encoding begins and ends, and where
> higher level protocols prescribe how they are embedded within the protocol.
>
>
>
> Tex
>
>
>
>
>
>
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy via Unicode
> *Sent:* Sunday, October 14, 2018 1:41 AM
> *To:* Adam Borowski
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Base64 encoding applied to different unicode texts always
> yields different base64 texts ... true or false?
>
>
>
> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is
> enough to indicate the end of an octets-span. The extra = after it do not
> add any other octet. and as well you're allowed to insert whitespaces
> anywhere in the encoded stream (this is what ensures that the
> Base64-encoded octets-stream will not be altered if line breaks are forced
> anywhere (notably within the body of emails).
>
>
>
> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR,
> LF, NEL) in the middle is non-significant and ignorable on decoding (their
> "encoded" bit length is 0 and they don't terminate an octets-span, unlike
> "=" which discards extra bits remaining from the encoded stream before that
> are not on 8-bit boundaries).
>
>
>
> Also:
>
> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol
> before "=" can vary in its 4 lowest bits (which are then ignored/discarded
> by the "=" symbol)
>
> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X"
> symbol before "=" can vary in its 2 lowest bits (which are then
> ignored/discarded by the "=" symbol)
>
>
>
> So you can use Base64 by encoding each octet in separate pieces, as one
> Base64 symbol followed by an "=" symbol, and even insert any number of
> whitespaces between them: there's a infinite number of valid Base64
> encodings for representing the same octets-stream payload.
>
>
>
> Base64 allows encoding any octets streams but not directly any
> bits-streams : it assumes that the effective bits-stream has a binary
> length multiple of 8. To encode a bits-stream with an exact number of bits
> (not multiple of 8), you need to encode an extra payload to indicate the
> effective number of bits to keep at end of the encoded octets-stream (or at
> start):
>
> - Base64 does not specify how you convert a bitstream of arbitrary length
> to an octets-stream;
>
> - for that purpose, you may need to pad the bits-stream at start or at end
> with 1 to 6 bits (so that it the resulting bitstream has a length multiple
> of 8, then encodable with Base64 which takes only octets on input).
>
> - these extra padding bits are not significant for the original bitstream,
> but are significant for the Base64 encoder/decoder, they will be discarded
> by the bitstream decoder built on top of the Base64 decoder, but not by the
> Base64 decoder itself.
>
>
>
> You need to encode somewhere with the bitstream encoder how many padding
> bits (0 to 7) are present at start or end of the octets-stream; this can be
> done:
>
> - as a separate payload (not encoded by Base64), or
>
> - by prepending 3 bits at start of the bits-stream then padded at end with
> 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64
> encoding.
>
> - by appending 3 bits at end of the  bits-stream, just after 1 to 7 random
> bits needed to get a bit-length multiple of 8 suitable for Base64 encoding.
>
> Finally your bits-stream decoder will be able to use this padding count to
> discard these random padding bits (and possibly realign the stream on
> different byte-boundaries when the effective bitlength bits-stream payload
> is not a multiple of 8 and padding bits were added)
>
>
>
> Base64 also does not specify how bits of the original bits-stream payload
> are packed into the octets-stream input suitable for Base64-encoding,
> notably it does not specify their order and endian-ness. The same remark
> applies as well for MIME, HTTP. So lot of network protocols and file
> formats need to how to properly encode which possible option is used to
> encode bits-streams of arbitrary length, or need to specify which default
> choice to apply if this option is not encoded, or which option must be used
> (with no possible variation). And this also adds to the number of distinct
> encodings that are possible but are still equivalent for the same effective
> bits-stream payload.
>
>
>
> All these allowed variations are from the encoder perspective. For
> interoperability, the decoder has to be flexible and to support various
> options to be compatible with different implementations of the encoder,
> notably when the encoder was run on a different system. And this is the
> case for the MIME transport by mail, or for HTTP and FTP transports, or
> file/media storage formats even if the file is stored on the same system,
> because it may actually be a copy stored locally but coming from another
> system where the file was actually encoded).
>
>
>
> Now if we come back to the encoding of plain-text payloads, Unicode just
> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code
> points (it actually does not mandate an exact bit-length because the range
> does not fully fit exactly to 21 bits and an encoder can still pack
> multiple code points together into more compact code units.
>
>
>
> However Unicode provides and standardizes several encodings (UTF-8/16/32)
> which use code units whose size is directly suitable as input for an
> octets-stream, so that they are directly encodable with Base64, without
> having to specify an extra layer for the bits-stream encoder/decoder.
>
>
>
> But many other encodings are still possible (and can be conforming to
> Unicode, provided they preserve each Unicode scalar value, or at least the
> code point identity because an encoder/decoder is not required to support
> non-character code points such as surrogates or U+FFFE), where Base64 may
> be used for internally generated octets-streams.
>
>
>
>
>
> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode <
> unicode at unicode.org> a ?crit :
>
> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote:
> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode <
> > unicode at unicode.org> a ?crit :
> > > The only variance is described as:
> > >
> > >   Care must be taken to use the proper octets for line breaks if base64
> > >   encoding is applied directly to text material that has not been
> > >   converted to canonical form.  In particular, text line breaks must be
> > >   converted into CRLF sequences prior to base64 encoding.  The
> > >   important thing to note is that this may be done directly by the
> > >   encoder rather than in a prior canonicalization step in some
> > >   implementations.
> > >
> > > This is MIME, it specifies (in the same RFC):
> >
> > I've not spoken aboutr the encoding of new lines **in the actual encoded
> > text**:
> > -  if their existing text-encoding ever gets converted to Base64 as if
> the
> > whole text was an opaque binary object, their initial text-encoding will
> be
> > preserved (so yes it will preserve the way these embedded newlines are
> > encoded as CR, LF, CR+LF, NL...)
> >
> > I spoke about newlines used in the transport syntax to split the initial
> > binary object (which may actually contain text but it does not matter).
> > MIME defines this operation and even requires splitting the binary object
> > in fragments with maximum binary size so that these binary fragments can
> be
> > converted with Base64 into lines with maximum length. In the MIME Base64
> > representation you can insert newlines anywhere between fragments encoded
> > separately.
>
> There's another kind of fragmentation that can make the encoding differ
> (but
> still decode to the same payload):
>
> The data stream gets split into 3-byte internal, 4-byte external packets.
> Any packet may contain less than those 3 bytes, in which cases it is padded
> with = characters:
> 3 bytes XXXX
> 2 bytes XXX=
> 1 byte  XX==
>
> Usually, such smaller packets happen only at the end of a message, but to
> support encoding a stream piecewise, they are allowed at any point.
>
> For example:
> "meow"     is bWVvdw==
> "me""ow"   is bWU=b3c=
> yet both carry the same payload.
>
> > Base64 is used exactly to support this flexibility in transport (or
> > storage) without altering any bit of the initial content once it is
> > decoded.
>
> Right, any such variations are in packaging only.
>
>
> ????
> --
> ???????
> ??????? 10 people enter a bar: 1 who understands binary,
> ??????? 1 who doesn't, D who prefer to write it as hex,
> ??????? and 1 who narrowly avoided an off-by-one error.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181015/a66f6178/attachment.html>

From unicode at unicode.org  Mon Oct 15 14:26:24 2018
From: unicode at unicode.org (Steffen Nurpmeso via Unicode)
Date: Mon, 15 Oct 2018 21:26:24 +0200
Subject: Base64 encoding applied to different unicode texts always
 yields different base64 texts ... true or false?
In-Reply-To: <CAGa7JC05XOjm87J1k92_moRyWWgfDfcbH5piMYTYcDLtbS4VZA@mail.gmail.com>
References: <BN7PR09MB2546B1D4ADAEEA567E3B20BBC8E20@BN7PR09MB2546.namprd09.prod.outlook.com>
 <BN7PR09MB254617037A4C9D769CCC0C8BC8E30@BN7PR09MB2546.namprd09.prod.outlook.com>
 <CAGa7JC3UomnN+Qzr3JGhqgJY+e-y6AYFk+w9+jEARW4Ghyk8hg@mail.gmail.com>
 <20181013165019.sxGzV%steffen@sdaoden.eu>
 <CAGa7JC1MpLp2A1qY+6ae8c3NbMGS1Y1mu917cV30N_8Tz8nCEA@mail.gmail.com>
 <20181014013904.idfomqt5s65wnqro@angband.pl>
 <CAGa7JC3HaYeqz3+NFKVwib3+AFiZDEeiiCsbr1YukwD82TbF3w@mail.gmail.com>
 <000601d4642a$4274ec70$c75ec550$@xencraft.com>
 <CAGa7JC2ZnswnE7otZLXiPKrDFjw8qAL_poSs0KH91TXEK+hhxQ@mail.gmail.com>
 <002801d46487$4821e350$d865a9f0$@xencraft.com>
 <CAGa7JC05XOjm87J1k92_moRyWWgfDfcbH5piMYTYcDLtbS4VZA@mail.gmail.com>
Message-ID: <20181015192624.pY-ze%steffen@sdaoden.eu>

Philippe Verdy via Unicode wrote in <CAGa7JC05XOjm87J1k92_moRyWWgfDfcbH5\
piMYTYcDLtbS4VZA at mail.gmail.com>:
 |Padding itself does not clearly indicate the length.
 |
 |It's an artefact that **may** be infered only in some other layers \
 |of protocols which specify when and how padding is needed (and how \
 |many padding bytes 
 |are required or accepted), it works only if these upper layer protocols \
 |are using **octets** streams, but it is still not usable for more general 
 |bitstreams (with arbitrary bit lengths).
 |
 |This RFC does not mandate/require these padding bytes and in fact many \
 |upper layer protocols do not ever need it (including UTF-7 for example), \
 |they are 
 |never necessary to infer a length in octets and insufficient for specify\
 |ing a length in bits.
 |
 |As well the usage in MIME (where there's a requirement that lines of \
 |headers or in the content body is limited to 1000 bytes) requires free \
 |splitting of 
 |Base64 (there's no agreed maximum length, some sources insist it should \
 |not be more than 72 bytes, others use 80 bytes, but mail forwarding \
 |may add other 
 |characters at start of lines, forcing them to be shorter (leaving for \
 |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \
 |this means that 
 |padding may not be used where one would expect them, and padding can \
 |event occur in the middle of the encoded stream (not just at end) along \

That was actually a bug in my MUA.  Other MUAs were not capable of
decoding this correctly.
Sorry :-(!!

 |with other 
 |whitespaces or separators (like "> " at start of lines in cited messages).

In fact garbage bytes may be embedded explicitly says MIME.
Most handle that right, and skip (silently, maybe not right),
but some explicit base64 decoders fail miserably when such things
are seen (openssl base64, NetBSD base64 decoder (current)), others
do not (busybox base64, for example).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From unicode at unicode.org  Mon Oct 15 14:57:25 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 15 Oct 2018 20:57:25 +0100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
 <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
Message-ID: <20181015205725.38772e05@JRWUBU2>

On Tue, 16 Oct 2018 02:47:36 +1100
Harshula via Unicode <unicode at unicode.org> wrote:

> Note, touching letters are formed by <ZWJ><AL-LAKUNA>, so they should
> not be displayed as a fallback for <AL-LAKUNA><ZWJ> conjuncts.

I don't follow that.  While the conjuncts with r-, -r and -y are very
different to pairs of touching letters, the conjuncts for tth, nd, ndr,
ndh, kv and tv would be very similar to the hypothetical corresponding
touching letters and quite different to the fallbacks with visible
al-lakuna.

Richard.

From unicode at unicode.org  Mon Oct 15 19:59:54 2018
From: unicode at unicode.org (Harshula via Unicode)
Date: Tue, 16 Oct 2018 11:59:54 +1100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <20181015205725.38772e05@JRWUBU2>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
 <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
 <20181015205725.38772e05@JRWUBU2>
Message-ID: <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au>

Hi Richard,

On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote:
> On Tue, 16 Oct 2018 02:47:36 +1100
> Harshula via Unicode <unicode at unicode.org> wrote:
> 
>> Note, touching letters are formed by <ZWJ><AL-LAKUNA>, so they should
>> not be displayed as a fallback for <AL-LAKUNA><ZWJ> conjuncts.
> 
> I don't follow that.  While the conjuncts with r-, -r and -y are very
> different to pairs of touching letters, the conjuncts for tth, nd, ndr,
> ndh, kv and tv would be very similar to the hypothetical corresponding
> touching letters and quite different to the fallbacks with visible
> al-lakuna.

If you haven't already, it's best you read SLS 1134:2011:
http://www.language.lk/en/download/standards/

or the older SLS 1134:2004:
http://unicode.org/wg2/docs/n2737.pdf

cya,
#

From unicode at unicode.org  Mon Oct 15 22:29:36 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 16 Oct 2018 04:29:36 +0100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
 <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
 <20181015205725.38772e05@JRWUBU2>
 <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au>
Message-ID: <20181016042936.21ce4fc9@JRWUBU2>

On Tue, 16 Oct 2018 11:59:54 +1100
Harshula via Unicode <unicode at unicode.org> wrote:

> Hi Richard,
> 
> On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote:
> > On Tue, 16 Oct 2018 02:47:36 +1100
> > Harshula via Unicode <unicode at unicode.org> wrote:
> >   
> >> Note, touching letters are formed by <ZWJ><AL-LAKUNA>, so they
> >> should not be displayed as a fallback for <AL-LAKUNA><ZWJ>
> >> conjuncts.  
> > 
> > I don't follow that.  While the conjuncts with r-, -r and -y are
> > very different to pairs of touching letters, the conjuncts for tth,
> > nd, ndr, ndh, kv and tv would be very similar to the hypothetical
> > corresponding touching letters and quite different to the fallbacks
> > with visible al-lakuna.  
> 
> If you haven't already, it's best you read SLS 1134:2011:
> http://www.language.lk/en/download/standards/
> 
> or the older SLS 1134:2004:
> http://unicode.org/wg2/docs/n2737.pdf

The latter actually says, in Section 5.8, that <AL-LAKUNA, ZWJ> may be
used for either!  I suspect that that is a printing error.

The Sri Lankan standard simply assumes that the rendering system can
accommodate what is requested in the backing store.  It says nothing
about fallbacks.  So, if the user specifies the the syllable ddho
written with a conjunct and encoded as ????? but the conjunct is
missing from the fonts' repertoires, why is it right to display it with
al-lakuna as though it were ???? but wrong to display it with the
touching letters encoded as ???????   There are three different
correct ways of writing 'ddho', but many systems only support one of
them (and some weirdly use a fourth method). 

Richard.


From unicode at unicode.org  Tue Oct 16 06:00:18 2018
From: unicode at unicode.org (Harshula via Unicode)
Date: Tue, 16 Oct 2018 22:00:18 +1100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <20181016042936.21ce4fc9@JRWUBU2>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
 <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
 <20181015205725.38772e05@JRWUBU2>
 <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au>
 <20181016042936.21ce4fc9@JRWUBU2>
Message-ID: <f774b974-19e8-2d01-01a3-5239d89584b9@hj.id.au>

Hi Richard,

On 16/10/18 2:29 pm, Richard Wordingham via Unicode wrote:
> On Tue, 16 Oct 2018 11:59:54 +1100
> Harshula via Unicode <unicode at unicode.org> wrote:
> 
>> Hi Richard,
>>
>> On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote:
>>> On Tue, 16 Oct 2018 02:47:36 +1100
>>> Harshula via Unicode <unicode at unicode.org> wrote:
>>>   
>>>> Note, touching letters are formed by <ZWJ><AL-LAKUNA>, so they
>>>> should not be displayed as a fallback for <AL-LAKUNA><ZWJ>
>>>> conjuncts.  
>>>
>>> I don't follow that.  While the conjuncts with r-, -r and -y are
>>> very different to pairs of touching letters, the conjuncts for tth,
>>> nd, ndr, ndh, kv and tv would be very similar to the hypothetical
>>> corresponding touching letters and quite different to the fallbacks
>>> with visible al-lakuna.  
>>
>> If you haven't already, it's best you read SLS 1134:2011:
>> http://www.language.lk/en/download/standards/
>>
>> or the older SLS 1134:2004:
>> http://unicode.org/wg2/docs/n2737.pdf
> 
> The latter actually says, in Section 5.8, that <AL-LAKUNA, ZWJ> may be
> used for either!  I suspect that that is a printing error.

The former (SLS1134:2011) has a section for Touching letters. It is
explicitly stated to use <ZWJ><AL-LAKUNA> for Touching letters.

Sorry, the file n2737.pdf hosted on unicode.org appears to be a draft.
It is not the final SLS1134:2004. The final contains a section on
Touching letters like SLS1134:2011.

> The Sri Lankan standard simply assumes that the rendering system can
> accommodate what is requested in the backing store.  It says nothing
> about fallbacks.  So, if the user specifies the the syllable ddho
> written with a conjunct and encoded as ????? but the conjunct is
> missing from the fonts' repertoires, why is it right to display it with
> al-lakuna as though it were ???? but wrong to display it with the
> touching letters encoded as ???????   There are three different
> correct ways of writing 'ddho', but many systems only support one of
> them (and some weirdly use a fourth method). 

When a font is missing a glyph during an *explicit* conjunct lookup, it
appears the most accurate solution is to display the missing glyph symbol.

cya,
#

From unicode at unicode.org  Tue Oct 16 19:04:19 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 17 Oct 2018 01:04:19 +0100
Subject: Fallback for Sinhala Consonant Clusters
In-Reply-To: <f774b974-19e8-2d01-01a3-5239d89584b9@hj.id.au>
References: <20181014010259.4fb5436a@JRWUBU2>
 <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au>
 <20181015085359.339c5747@JRWUBU2>
 <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au>
 <20181015205725.38772e05@JRWUBU2>
 <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au>
 <20181016042936.21ce4fc9@JRWUBU2>
 <f774b974-19e8-2d01-01a3-5239d89584b9@hj.id.au>
Message-ID: <20181017010419.704f5283@JRWUBU2>

On Tue, 16 Oct 2018 22:00:18 +1100
Harshula via Unicode <unicode at unicode.org> wrote:

> When a font is missing a glyph during an *explicit* conjunct lookup,
> it appears the most accurate solution is to display the missing glyph
> symbol.

However, I don't believe that that is the most useful solution, and it
certainly isn't when composing 'plain text'.  Now, if one is composing
the font that will be used to read the text as one writes the text, it
may have some benefit; it may also have some benefit if one can select
the font that will be used, and a suitable font is available.

Richard.

From unicode at unicode.org  Sat Oct 27 06:10:20 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sat, 27 Oct 2018 13:10:20 +0200
Subject: A sign/abbreviation for "magister"
Message-ID: <86tvl7tzkz.fsf@mimuw.edu.pl>


Hi!

On the over 100 years old postcard

https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6

you can see 2 occurences of a symbol which is explicitely explained (in
Polish) as meaning "Magister".

First question is: how do you interpret the symbol? For me it is
definitely the capital M followed by the superscript "r" (written in an
old style no longer used in Poland), but there is something below the
superscript. It looks like a small "z", but such an interpretation
doesn't make sense for me.

The second question is: are you familiar with such or a similar symbol?
Have you ever seen it in print?

The third and the last question is: how to encode this symbol in
Unicode?

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Sat Oct 27 07:36:59 2018
From: unicode at unicode.org (rein via Unicode)
Date: Sat, 27 Oct 2018 14:36:59 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86tvl7tzkz.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
Message-ID: <op.zrjmnxuinwijsc@desktop-fi4a3h4.fritz.box>

Janusz,

reminds me of the "numero sign "   &#8470;

I tried to read the letter but couldn't manage to all the way ;)

Droga i Kochana Wiria?ko

za?aczam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?.
co u Was wi?cej s?ycha?
?adnych jeszcze ni mam odpowiedzi ze znanych Ci miejscowoci
?adresowa?? do Staszki jak ty? chcia?a pisa? (W.Pan Mr Micha? Ga?kiewicz  
Feldspital 411 Feldpost 380.) Mr znaczy Magister. On przy tem szpitalu  
aptekarzem.
ca?uj? Ci? ze wargatkiem Mami r?czki
Tw?j Kochaj?cy W?odek
12/9 917

pozdrawiam, Rein

Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode  
<unicode at unicode.org>:

>
> Hi!
>
> On the over 100 years old postcard
>
> https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6
>
> you can see 2 occurences of a symbol which is explicitely explained (in
> Polish) as meaning "Magister".
>
> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.
>
> The second question is: are you familiar with such or a similar symbol?
> Have you ever seen it in print?
>
> The third and the last question is: how to encode this symbol in
> Unicode?
>
> Best regards
>
> Janusz
>


-- 
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181027/df3e453d/attachment.html>

From unicode at unicode.org  Sat Oct 27 07:58:38 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Sat, 27 Oct 2018 05:58:38 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86tvl7tzkz.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
Message-ID: <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181027/2a84daec/attachment.html>

From unicode at unicode.org  Sat Oct 27 08:09:35 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sat, 27 Oct 2018 15:09:35 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <op.zrjmnxuinwijsc@desktop-fi4a3h4.fritz.box> (rein's message of
 "Sat, 27 Oct 2018 14:36:59 +0200")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <op.zrjmnxuinwijsc@desktop-fi4a3h4.fritz.box>
Message-ID: <867ei3sfhs.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at 14:36 +0200, rein wrote:
> Janusz,
>
> reminds me of the "numero sign " &#8470;

Yes, that's definitely similar.

>
> I tried to read the letter but couldn't manage to all the way ;)

Congratulation, you have done it better than me!

>
> Droga i Kochana Wiria?ko

Rather "Wisie?ko": "Ludwika" -> "Ludwisie?ka" ->"Wisie?ka"

>
> za?aczam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?.  co
> u Was wi?cej s?ycha? ?adnych jeszcze ni mam odpowiedzi

I didn't recognized "odpowiedzi".

>ze znanych Ci miejscowoci ?adresowa??

"Adresowa?" makes sense, although some letters seem missing.

> do Staszki jak ty? chcia?a pisa?
>(W.Pan Mr Micha? Ga?kiewicz Feldspital 411 Feldpost 380.) Mr znaczy
>Magister. On przy tem szpitalu aptekarzem.  ca?uj? Ci? ze wargatkiem

I read this "wszystkiemi".

>Mami

I can't guess a word which would make sense of this phrase...

> r?czki Tw?j Kochaj?cy W?odek 12/9 917
>
> pozdrawiam, Rein

Nawzajem :-)

>
> Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode <unicode at unicode.org>:

[...]

>> The second question is: are you familiar with such or a similar symbol?
>> Have you ever seen it in print?

The postcard is from the front of the first WW written by an
Austro-Hungarian soldier. He explaines the meaning of the abbreviation
to his wife, so looks like the abbreviation was used but not very
popular.

>>
>> The third and the last question is: how to encode this symbol in
>> Unicode?

I've got a comment to this question off the list, but I'm waiting to see
more opinions.

Best regards

Janusz

P.S. I subscribe the list in the digest form but I look up the archive -
I think Asmus Freytag interpretation is the correct one (similar
interpretation was suggested also of the list).

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Sat Oct 27 09:32:49 2018
From: unicode at unicode.org (rein via Unicode)
Date: Sat, 27 Oct 2018 16:32:49 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <867ei3sfhs.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <op.zrjmnxuinwijsc@desktop-fi4a3h4.fritz.box> <867ei3sfhs.fsf@mimuw.edu.pl>
Message-ID: <op.zrjr0zp3nwijsc@desktop-fi4a3h4.fritz.box>

Janusz,

"wszystkimi m(oj)ami r?czki"  some sort of  plural instrumentalis :)

"embracing you with all my  hands/arms"

pozdrawiam, Rein

Op Sat, 27 Oct 2018 15:09:35 +0200 schreef Janusz S. Bie?  
<jsbien at mimuw.edu.pl>:

> On Sat, Oct 27 2018 at 14:36 +0200, rein wrote:
>> Janusz,
>>
>> reminds me of the "numero sign " &#8470;
>
> Yes, that's definitely similar.
>
>>
>> I tried to read the letter but couldn't manage to all the way ;)
>
> Congratulation, you have done it better than me!
>
>>
>> Droga i Kochana Wiria?ko
>
> Rather "Wisie?ko": "Ludwika" -> "Ludwisie?ka" ->"Wisie?ka"
>
>>
>> za??czam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?.  co
>> u Was wi?cej s?ycha? ?adnych jeszcze ni mam odpowiedzi
>
> I didn't recognized "odpowiedzi".
>
>> ze znanych Ci miejscowoci ?adresowa??
>
> "Adresowa?" makes sense, although some letters seem missing.
>
>> do Staszki jak ty? chcia?a pisa?
>> (W.Pan Mr Micha? Ga?kiewicz Feldspital 411 Feldpost 380.) Mr znaczy
>> Magister. On przy tem szpitalu aptekarzem.  ca?uj? Ci? ze wargatkiem
>
> I read this "wszystkiemi".
>
>> Mami
>
> I can't guess a word which would make sense of this phrase...
>
>> r?czki Tw?j Kochaj?cy W?odek 12/9 917
>>
>> pozdrawiam, Rein
>
> Nawzajem :-)
>
>>
>> Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode  
>> <unicode at unicode.org>:
>
> [...]
>
>>> The second question is: are you familiar with such or a similar symbol?
>>> Have you ever seen it in print?
>
> The postcard is from the front of the first WW written by an
> Austro-Hungarian soldier. He explaines the meaning of the abbreviation
> to his wife, so looks like the abbreviation was used but not very
> popular.
>
>>>
>>> The third and the last question is: how to encode this symbol in
>>> Unicode?
>
> I've got a comment to this question off the list, but I'm waiting to see
> more opinions.
>
> Best regards
>
> Janusz
>
> P.S. I subscribe the list in the digest form but I look up the archive -
> I think Asmus Freytag interpretation is the correct one (similar
> interpretation was suggested also of the list).
>


-- 
Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/


From unicode at unicode.org  Sat Oct 27 09:53:56 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sat, 27 Oct 2018 16:53:56 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <op.zrjr0zp3nwijsc@desktop-fi4a3h4.fritz.box> (rein's message of
 "Sat, 27 Oct 2018 16:32:49 +0200")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <op.zrjmnxuinwijsc@desktop-fi4a3h4.fritz.box>
 <867ei3sfhs.fsf@mimuw.edu.pl>
 <op.zrjr0zp3nwijsc@desktop-fi4a3h4.fritz.box>
Message-ID: <86a7mzphiz.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at 16:32 +0200, rein wrote:
> Janusz,
>
> "wszystkimi m(oj)ami r?czki"  some sort of  plural instrumentalis :)

Rather "moimi", although still the phrase sounds strange.

> "embracing you with all my  hands/arms"

Now "kiss" (ca?owa?) and "embrace" (obejmowa?) are strictly separated,
but perhaps 100 years ago it was differently.

Bess regards

Janusz

P.S. This discussion is completely of the topic of the list, but I'm
very greatful for the help received on and off the list.

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Sat Oct 27 12:25:02 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sat, 27 Oct 2018 19:25:02 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com> (Asmus
 Freytag via Unicode's message of "Sat, 27 Oct 2018 05:58:38 -0700")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
Message-ID: <86bm7fnvyp.fsf@mimuw.edu.pl>

On Sat, Oct 27 2018 at  5:58 -0700, Asmus Freytag via Unicode wrote:

[...]

> My suspicion would be that the small "z" is rather a "=" that acquired
> a connecting stroke as part of quick handwriting.

You must be right.

In the meantime I looked up some other postcards written by the same
person i found several other abbreviation including ? 'NUMERO SIGN'
(U+2116) written in the same way, i.e. with a double instead of a single
line.

So we have a consensus about how to interpret the sign, but there are
still open questions about the scope of its usage, and its encoding.

Thanks one again to all who contributed to the discussion.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Sat Oct 27 12:35:17 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 27 Oct 2018 19:35:17 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
Message-ID: <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>

Le sam. 27 oct. 2018 ? 15:06, Asmus Freytag via Unicode <unicode at unicode.org>
a ?crit :

> First question is: how do you interpret the symbol? For me it is
> definitely the capital M followed by the superscript "r" (written in an
> old style no longer used in Poland), but there is something below the
> superscript. It looks like a small "z", but such an interpretation
> doesn't make sense for me.
>
> My suspicion would be that the small "z" is rather a "=" that acquired a
> connecting stroke as part of quick handwriting.
>
I have the same kind of reading, the zigzagging stroek is an hnadwritten
emphasis of the uperscript r above it (explicitly noting it is terminating
the abbreviation), jut like the small underline that happens sometimes
below the superscript o in the abbreviation of "numero" (as well sometimes
there was not just one but two small underlines, including in some prints).

This sample is a perfect example of fast cursive handwritting (due to high
variability of all other letter shapes, sizes and joinings, where even the
capital M is written as two unconnected strokes), and it's not abnormal to
see in such condition this cursive joining between the two underlining
strokes so that it looks like a single zigzag.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181027/ca563044/attachment.html>

From unicode at unicode.org  Sat Oct 27 14:52:32 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 27 Oct 2018 19:52:32 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
Message-ID: <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>


Mr? / M=?

An image search for "magister symbol" finds many interesting graphics, 
but I couldn't find any resembling the abreviation shown on the post 
card.? (Magister symbol appears to be popular for certain religious and 
gaming uses.)


From unicode at unicode.org  Sat Oct 27 20:59:31 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 02:59:31 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
Message-ID: <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>

Do you speak about this one?
https://www.magisterdaire.com/magister-symbol-black-sq/
It looks like a graphic personal signature for the author of this esoteric
book, even if it looks like an interesting composition of several of our
existing Unicode symbols, glued together in a vertical ligature, rather
than a pure combining sequence.
Such technics can be used extensively to create lot of other symbols, by
gluing any kind of wellknown glyphs for standard characters.
Mathematics and technologies (but also companies for their private
corporate logos and branding marks) are constantly inventing new symbols
like this.


Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <unicode at unicode.org>
a ?crit :

>
> Mr? / M=?
>
> An image search for "magister symbol" finds many interesting graphics,
> but I couldn't find any resembling the abreviation shown on the post
> card.  (Magister symbol appears to be popular for certain religious and
> gaming uses.)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/c46c6eab/attachment.html>

From unicode at unicode.org  Sat Oct 27 21:40:58 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 03:40:58 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
 <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
Message-ID: <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>

More interesting: the Masonic alphabet
http://tallermasonico.com/0diccio1.htm

- 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
and K), are disposed by group of 2 letters in a 3x3 square grid, whose
global outer sides are not marked on the outer border of the grid but on
lines separating columns or rows. Then letters are noted by the marked
sides of the square in which they are located, the second letter of the
group being distinguished by adding a dot in the middle of the square.
- The 4 other letters U to Z (excluding V and W) are noted by disposing
them on a 2x2 square grid (this time rotated 45 degrees), whose global
outer sides are also not marked on the outer border of the grid but on
lines separating columns or rows (only 1 letter is places by cell).
They are also noted by the marked sides of their square only.- Finally (if
needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
distinguished by adding the central dot.


   AB | CD | EF
 ------+-----+-----
   GH | I L | MN
 ------+-----+-----
   OP | QR | ST

     \  XK  /
 UJ  >  < WZ
     /  YV  \


So:
- "A" becomes approximately  "_|"
- "B" becomes approximately  "_|" with central dot
- "U" becomes approximately ">"
- "X" becomes approximately "\/"
- "J" is noted like "I" as a square, or distinctly approximately as ">"
with a central dot

The 3x3 grid had some esoterical meaning based on numerology (a legend now
propaged by scientology).


Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> Do you speak about this one?
> https://www.magisterdaire.com/magister-symbol-black-sq/
> It looks like a graphic personal signature for the author of this esoteric
> book, even if it looks like an interesting composition of several of our
> existing Unicode symbols, glued together in a vertical ligature, rather
> than a pure combining sequence.
> Such technics can be used extensively to create lot of other symbols, by
> gluing any kind of wellknown glyphs for standard characters.
> Mathematics and technologies (but also companies for their private
> corporate logos and branding marks) are constantly inventing new symbols
> like this.
>
>
> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <unicode at unicode.org>
> a ?crit :
>
>>
>> Mr? / M=?
>>
>> An image search for "magister symbol" finds many interesting graphics,
>> but I couldn't find any resembling the abreviation shown on the post
>> card.  (Magister symbol appears to be popular for certain religious and
>> gaming uses.)
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/e7266f95/attachment.html>

From unicode at unicode.org  Sat Oct 27 22:02:55 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 04:02:55 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
 <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
 <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
Message-ID: <CAGa7JC3_6H0JC2KLdAa20k_e7YWSFJZvK6K5spOmgfg6AohyBw@mail.gmail.com>

I must add that the Masonic 3x3 grid alphabet has been proposed as an
alternative to Braille, easier to learn and memoize, easier and faster to
draw with a pen on paper without any physical guide, and easier also to
recognize using only tactile contact by a finger tip, but more difficult to
form without cutting the sheet of paper while tracing the strokes. But it
was seen on some manufactured Masonic objects.

To note digits with the same shapes (like does Braille with its 2x3 dots
grid), the same 3x3 grid is used for digits 1 to 9 (digit 0 uses the same
square where it is significant as 5, but with a central dot, or use a
space), but additional symbols "+" and "-" are used (without central dot)
to switch between letters and digits. The placement of digits 1 to 9
(except 0 and 5) on the 3x3 grid varies (horizontally first, or vertically
first).

Le dim. 28 oct. 2018 ? 03:40, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> More interesting: the Masonic alphabet
> http://tallermasonico.com/0diccio1.htm
>
> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
> global outer sides are not marked on the outer border of the grid but on
> lines separating columns or rows. Then letters are noted by the marked
> sides of the square in which they are located, the second letter of the
> group being distinguished by adding a dot in the middle of the square.
> - The 4 other letters U to Z (excluding V and W) are noted by disposing
> them on a 2x2 square grid (this time rotated 45 degrees), whose global
> outer sides are also not marked on the outer border of the grid but on
> lines separating columns or rows (only 1 letter is places by cell).
> They are also noted by the marked sides of their square only.- Finally (if
> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
> distinguished by adding the central dot.
>
>
>    AB | CD | EF
>  ------+-----+-----
>    GH | I L | MN
>  ------+-----+-----
>    OP | QR | ST
>
>      \  XK  /
>  UJ  >  < WZ
>      /  YV  \
>
>
> So:
> - "A" becomes approximately  "_|"
> - "B" becomes approximately  "_|" with central dot
> - "U" becomes approximately ">"
> - "X" becomes approximately "\/"
> - "J" is noted like "I" as a square, or distinctly approximately as ">"
> with a central dot
>
> The 3x3 grid had some esoterical meaning based on numerology (a legend now
> propaged by scientology).
>
>
> Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy <verdy_p at wanadoo.fr> a
> ?crit :
>
>> Do you speak about this one?
>> https://www.magisterdaire.com/magister-symbol-black-sq/
>> It looks like a graphic personal signature for the author of this
>> esoteric book, even if it looks like an interesting composition of several
>> of our existing Unicode symbols, glued together in a vertical ligature,
>> rather than a pure combining sequence.
>> Such technics can be used extensively to create lot of other symbols, by
>> gluing any kind of wellknown glyphs for standard characters.
>> Mathematics and technologies (but also companies for their private
>> corporate logos and branding marks) are constantly inventing new symbols
>> like this.
>>
>>
>> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <unicode at unicode.org>
>> a ?crit :
>>
>>>
>>> Mr? / M=?
>>>
>>> An image search for "magister symbol" finds many interesting graphics,
>>> but I couldn't find any resembling the abreviation shown on the post
>>> card.  (Magister symbol appears to be popular for certain religious and
>>> gaming uses.)
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/588c42f3/attachment.html>

From unicode at unicode.org  Sat Oct 27 22:12:06 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sat, 27 Oct 2018 20:12:06 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
 <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
 <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
Message-ID: <CA+p4_H0bGXznduFpnsd0-+OJ-x2_cGR+aYZ_2kXFb1h6rDLNnQ@mail.gmail.com>

I learned that one as a kid, as the "pigpen cipher". I'm not aware of any
numerological significance (which is easy enough to "find" in anything).

On Sat, Oct 27, 2018 at 7:43 PM Philippe Verdy via Unicode <
unicode at unicode.org> wrote:

> More interesting: the Masonic alphabet
> http://tallermasonico.com/0diccio1.htm
>
> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
> global outer sides are not marked on the outer border of the grid but on
> lines separating columns or rows. Then letters are noted by the marked
> sides of the square in which they are located, the second letter of the
> group being distinguished by adding a dot in the middle of the square.
> - The 4 other letters U to Z (excluding V and W) are noted by disposing
> them on a 2x2 square grid (this time rotated 45 degrees), whose global
> outer sides are also not marked on the outer border of the grid but on
> lines separating columns or rows (only 1 letter is places by cell).
> They are also noted by the marked sides of their square only.- Finally (if
> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
> distinguished by adding the central dot.
>
>
>    AB | CD | EF
>  ------+-----+-----
>    GH | I L | MN
>  ------+-----+-----
>    OP | QR | ST
>
>      \  XK  /
>  UJ  >  < WZ
>      /  YV  \
>
>
> So:
> - "A" becomes approximately  "_|"
> - "B" becomes approximately  "_|" with central dot
> - "U" becomes approximately ">"
> - "X" becomes approximately "\/"
> - "J" is noted like "I" as a square, or distinctly approximately as ">"
> with a central dot
>
> The 3x3 grid had some esoterical meaning based on numerology (a legend now
> propaged by scientology).
>
>
> Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy <verdy_p at wanadoo.fr> a
> ?crit :
>
>> Do you speak about this one?
>> https://www.magisterdaire.com/magister-symbol-black-sq/
>> It looks like a graphic personal signature for the author of this
>> esoteric book, even if it looks like an interesting composition of several
>> of our existing Unicode symbols, glued together in a vertical ligature,
>> rather than a pure combining sequence.
>> Such technics can be used extensively to create lot of other symbols, by
>> gluing any kind of wellknown glyphs for standard characters.
>> Mathematics and technologies (but also companies for their private
>> corporate logos and branding marks) are constantly inventing new symbols
>> like this.
>>
>>
>> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <unicode at unicode.org>
>> a ?crit :
>>
>>>
>>> Mr? / M=?
>>>
>>> An image search for "magister symbol" finds many interesting graphics,
>>> but I couldn't find any resembling the abreviation shown on the post
>>> card.  (Magister symbol appears to be popular for certain religious and
>>> gaming uses.)
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181027/f3fc9ae0/attachment.html>

From unicode at unicode.org  Sat Oct 27 22:16:55 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 04:16:55 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
 <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
 <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
Message-ID: <CAGa7JC1LOhSzd4sMzazOnrEZgtyyyqWcCP+R-gz9+TeaPs9VxQ@mail.gmail.com>

So in summary this Masonic "alphabet" uses 13 square "letters" and a single
combining mark (the central dot), possibly extended with the minus and plus
signs and space. It's possible that the central dot is used as a spacing
mark to note a punctuation.
The assignment of Latin (or Hebrew) letters to this alphabet varies (just
like Braille symbols depending on languages/scripts)
It may have extensions (like Braille outside its basic 2x3 patterns of
dots), such as a second dot in squares, horizontally as "??" or vertically
as ":"

Le dim. 28 oct. 2018 ? 03:40, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> More interesting: the Masonic alphabet
> http://tallermasonico.com/0diccio1.htm
>
> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
> global outer sides are not marked on the outer border of the grid but on
> lines separating columns or rows. Then letters are noted by the marked
> sides of the square in which they are located, the second letter of the
> group being distinguished by adding a dot in the middle of the square.
> - The 4 other letters U to Z (excluding V and W) are noted by disposing
> them on a 2x2 square grid (this time rotated 45 degrees), whose global
> outer sides are also not marked on the outer border of the grid but on
> lines separating columns or rows (only 1 letter is places by cell).
> They are also noted by the marked sides of their square only.- Finally (if
> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
> distinguished by adding the central dot.
>
>
>    AB | CD | EF
>  ------+-----+-----
>    GH | I L | MN
>  ------+-----+-----
>    OP | QR | ST
>
>      \  XK  /
>  UJ  >  < WZ
>      /  YV  \
>
>
> So:
> - "A" becomes approximately  "_|"
> - "B" becomes approximately  "_|" with central dot
> - "U" becomes approximately ">"
> - "X" becomes approximately "\/"
> - "J" is noted like "I" as a square, or distinctly approximately as ">"
> with a central dot
>
> The 3x3 grid had some esoterical meaning based on numerology (a legend now
> propaged by scientology).
>
>
> Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy <verdy_p at wanadoo.fr> a
> ?crit :
>
>> Do you speak about this one?
>> https://www.magisterdaire.com/magister-symbol-black-sq/
>> It looks like a graphic personal signature for the author of this
>> esoteric book, even if it looks like an interesting composition of several
>> of our existing Unicode symbols, glued together in a vertical ligature,
>> rather than a pure combining sequence.
>> Such technics can be used extensively to create lot of other symbols, by
>> gluing any kind of wellknown glyphs for standard characters.
>> Mathematics and technologies (but also companies for their private
>> corporate logos and branding marks) are constantly inventing new symbols
>> like this.
>>
>>
>> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <unicode at unicode.org>
>> a ?crit :
>>
>>>
>>> Mr? / M=?
>>>
>>> An image search for "magister symbol" finds many interesting graphics,
>>> but I couldn't find any resembling the abreviation shown on the post
>>> card.  (Magister symbol appears to be popular for certain religious and
>>> gaming uses.)
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/83597312/attachment.html>

From unicode at unicode.org  Sat Oct 27 22:29:26 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 04:29:26 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CA+p4_H0bGXznduFpnsd0-+OJ-x2_cGR+aYZ_2kXFb1h6rDLNnQ@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <CAGa7JC3G5Xjj+e=7RVYA7kAo+ycDJsxNBzcmz5_Koor8EFE+dg@mail.gmail.com>
 <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com>
 <CAGa7JC2rGW1wArXy+yjkNRg2_yz1kFMqfcR_ax1zGXdFUf9Peg@mail.gmail.com>
 <CAGa7JC2AoHRd_mzqgSZBvgG47qkTjc4roqzcf-nt7gORY386Vw@mail.gmail.com>
 <CA+p4_H0bGXznduFpnsd0-+OJ-x2_cGR+aYZ_2kXFb1h6rDLNnQ@mail.gmail.com>
Message-ID: <CAGa7JC058b9iUzgskwLHdeOH594sJAJfMA-sc8LrY12SSGOZvQ@mail.gmail.com>

If it was encoded in Unicode, it would use a single column and the encoding
seems evident:

x0 = MASONIC SQUARE SPACE
x1 = MASONIC SYMBOL A B OR ONE
x2 = MASONIC SYMBOL C D OR TWO
x3 = MASONIC SYMBOL E F OR THREE
x4 = MASONIC SYMBOL G H OR FOUR
x5 = MASONIC SYMBOL I L OR ZERO FIVE
x6 = MASONIC SYMBOL M N OR SIX
x7 = MASONIC SYMBOL O P OR SEVEN
x8 = MASONIC SYMBOL Q R OR EIGHT
x9 = MASONIC SYMBOL S T OR NINE
xA = MASONIC SYMBOL U J
xB = MASONIC SYMBOL X K
xC = MASONIC SYMBOL Y V
xD = MASONIC SYMBOL Z W
xE = MASONIC COMBINING DOT
xF = MASONIC COMBINING DOUBLE DOT (?)


Le dim. 28 oct. 2018 ? 04:21, Garth Wallace via Unicode <unicode at unicode.org>
a ?crit :

> I learned that one as a kid, as the "pigpen cipher". I'm not aware of any
> numerological significance (which is easy enough to "find" in anything).
>
> On Sat, Oct 27, 2018 at 7:43 PM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>> More interesting: the Masonic alphabet
>> http://tallermasonico.com/0diccio1.htm
>>
>> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J
>> and K), are disposed by group of 2 letters in a 3x3 square grid, whose
>> global outer sides are not marked on the outer border of the grid but on
>> lines separating columns or rows. Then letters are noted by the marked
>> sides of the square in which they are located, the second letter of the
>> group being distinguished by adding a dot in the middle of the square.
>> - The 4 other letters U to Z (excluding V and W) are noted by disposing
>> them on a 2x2 square grid (this time rotated 45 degrees), whose global
>> outer sides are also not marked on the outer border of the grid but on
>> lines separating columns or rows (only 1 letter is places by cell).
>> They are also noted by the marked sides of their square only.- Finally (if
>> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are
>> distinguished by adding the central dot.
>>
>>
>>    AB | CD | EF
>>  ------+-----+-----
>>    GH | I L | MN
>>  ------+-----+-----
>>    OP | QR | ST
>>
>>      \  XK  /
>>  UJ  >  < WZ
>>      /  YV  \
>>
>>
>> So:
>> - "A" becomes approximately  "_|"
>> - "B" becomes approximately  "_|" with central dot
>> - "U" becomes approximately ">"
>> - "X" becomes approximately "\/"
>> - "J" is noted like "I" as a square, or distinctly approximately as ">"
>> with a central dot
>>
>> The 3x3 grid had some esoterical meaning based on numerology (a legend
>> now propaged by scientology).
>>
>>
>> Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy <verdy_p at wanadoo.fr> a
>> ?crit :
>>
>>> Do you speak about this one?
>>> https://www.magisterdaire.com/magister-symbol-black-sq/
>>> It looks like a graphic personal signature for the author of this
>>> esoteric book, even if it looks like an interesting composition of several
>>> of our existing Unicode symbols, glued together in a vertical ligature,
>>> rather than a pure combining sequence.
>>> Such technics can be used extensively to create lot of other symbols, by
>>> gluing any kind of wellknown glyphs for standard characters.
>>> Mathematics and technologies (but also companies for their private
>>> corporate logos and branding marks) are constantly inventing new symbols
>>> like this.
>>>
>>>
>>> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode <
>>> unicode at unicode.org> a ?crit :
>>>
>>>>
>>>> Mr? / M=?
>>>>
>>>> An image search for "magister symbol" finds many interesting graphics,
>>>> but I couldn't find any resembling the abreviation shown on the post
>>>> card.  (Magister symbol appears to be popular for certain religious and
>>>> gaming uses.)
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/99055765/attachment.html>

From unicode at unicode.org  Sun Oct 28 03:13:26 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 28 Oct 2018 08:13:26 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
Message-ID: <20181028081326.264dc079@JRWUBU2>

On Sat, 27 Oct 2018 05:58:38 -0700
Asmus Freytag via Unicode <unicode at unicode.org> wrote:

> On 10/27/2018 4:10 AM, Janusz S. Bie? via Unicode wrote:

>> you can see 2 occurences of a symbol which is explicitely explained
>> (in Polish) as meaning "Magister".
 
>> First question is: how do you interpret the symbol? For me it is
>> definitely the capital M followed by the superscript "r" (written in
>> an old style no longer used in Poland), but there is something below
>> the superscript. It looks like a small "z", but such an
>> interpretation
>> doesn't make sense for me.
 
>> The second question is: are you familiar with such or a similar
>> symbol? Have you ever seen it in prin> 
>> The third and the last question is: how to encode this symbol in
>> Unicode?

> My suspicion would be that the small "z" is rather a "=" that
> acquired a connecting stroke as part of quick handwriting.

The notation is a quite widespread format for abbreviations.  the
first letter is normal sized, and the subsequent letter is written in
some variety of superscript with a squiggle underneath so that it
doesn't get overlooked.  I have deduced that this is not plain text
because there is no encoding mechanism for it.  For example, our
lecturers would frequently use this treatment to abbreviate function
as 'fn' with the 'n' superscript and supported by a squiggle below
sitting on the baseline.  The squiggle below has meaning; it marks the
word as an abbreviation.

Richard. 


From unicode at unicode.org  Sun Oct 28 03:32:11 2018
From: unicode at unicode.org (arno.schmitt via Unicode)
Date: Sun, 28 Oct 2018 09:32:11 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181028081326.264dc079@JRWUBU2>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
Message-ID: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>

Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode:
> The notation is a quite widespread format for abbreviations.  the
> first letter is normal sized, and the subsequent letter is written in
> some variety of superscript with a squiggle underneath so that it
> doesn't get overlooked.  I have deduced that this is not plain text
> because there is no encoding mechanism for it.  For example, our
> lecturers would frequently use this treatment to abbreviate function
> as 'fn' with the 'n' superscript and supported by a squiggle below
> sitting on the baseline.  The squiggle below has meaning; it marks the
> word as an abbreviation.
> 
> Richard.

Looks to me like  U+2116 ? NUMERO SIGN
which perhaps should not have encoded,
since we have both U+004E LATIN CAPITAL LETTER N and
U+00BA ? MASCULINE ORDINAL INDICATOR

Arn0

From unicode at unicode.org  Sun Oct 28 09:19:50 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 15:19:50 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
Message-ID: <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>

Given the "squiggle" below letters are actually gien distinctive semantics,
I think it should be encoded a combining character (to be written not after
a "superscript" but after any normal base letter, possibly with other
combining characters, or CGJ if needed because of the compatibility
equivalence.
That "squiggle" (which may look like an underscore) would haver the effect
of implicity making the base letter superscript (smaller and elevated). It
would have probably a "combining below" class.

In that case U+2116 ? is perfectly encodable, but still distinct from
<N,o,COMBINING ABBREVIATION MARK> because "?" does not require this mark
(so there's no problem of stability with canonical equivalences, even if
this creates new possible confusable pairs when either the mark is used
after a normal letter: the risk of confusion only exists for "?" which is a
legacy non-decomposable ligature but that has an existing compatibility
equivalence, just like all other subscript letters).

In that case we have other ways to note *semantically* any abbreviations
using distinctive final letters (including for N<os> abbreviating
"Numeros", M<me> for "Madame", M<le> for "Mademoiselle", M<rs>, M<gr> for
Monseigneur, P<r> abbreviating "Professor"/"Professeur", or f<n>
abbreviating "function").

Notes:
* The <o> and <os> are also used in French, instead of <n> or <ns> to
abbreviate a "-tion" or "-tions" suffix (which derives from Latin "-tio" or
"-tios"). But I've also seen other abbreviation marks used for "-tion" and
"-tions".
* we also have in Unicode distinctive codes for dots used as abbreviation
marks (they are not combining, but still encoded distinctly from the
regular punctuation full stop), and for the mathematical binary dot
operator, or the decimal separator, or for implicit mathematical operators
that don't mark anything (i.e. invisible and zero-wdth) but that only break
grapheme clusters and prohibit formation of discretionary ligatures).

Medieval books or mails contained lot of abbreviation marks due to the cost
of paper (or parchment): texts were then frequently "packed" using
combining abbreviation marks in various positions (generally above or
below). The Germanic "Fraktur e" was a remnant of this old practice,
inherited from phonetic annotations added on top of Greek, Hebrew and
Arabic, which later turned into an "umlaut" that Unicode unified with the
diaeresis, even if it breaks the historic link to the letter Latin "e" used
like an abreviation mark or Hebrew vowel point in Fraktur (I think that the
history of the "Germanic Fraktur e" is highly linked to the influence of
Hebrew in today's Germany, or Greek in today's Eastern and Southern Europe
with some Slavic traditions in Cyrillic connected to religious traditions
in Greek).
The introduction of interlinear annotations in Greek was also margely
influenced by Hebrew and Arabic (which however did not turn these marks
into plain letters and avoided the formation of complex ligatures like in
Indian Brahmic scripts), but was the base of the interlinear notation of
actual phonetic.
Even the combining accents in French were created after an initial step
using ligatures of plain letters, before people started to replace these
ligatures by some unstable combining marks (initially not distinguished)
then turned them into plain distinctive accents which became the de facto
standard (made the offical orthography only very late: before that there
was a wide variation between those that wanted to distinguish phonetics,
using different accents, but now French tends to simplify this set: the
circumflkex in French was an abreviation mark for the unwritten letter "s"
which initially was more like the tilde, i.e. a turned small "s"). The
German umlaut written like a diaeresis is also very new (only after the
abandonment of the Fraktut alphabet where the "e" just looked like two
thick vertical strokes


Le dim. 28 oct. 2018 ? 10:41, arno.schmitt via Unicode <unicode at unicode.org>
a ?crit :

> Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode:
> > The notation is a quite widespread format for abbreviations.  the
> > first letter is normal sized, and the subsequent letter is written in
> > some variety of superscript with a squiggle underneath so that it
> > doesn't get overlooked.  I have deduced that this is not plain text
> > because there is no encoding mechanism for it.  For example, our
> > lecturers would frequently use this treatment to abbreviate function
> > as 'fn' with the 'n' superscript and supported by a squiggle below
> > sitting on the baseline.  The squiggle below has meaning; it marks the
> > word as an abbreviation.
> >
> > Richard.
>
> Looks to me like  U+2116 ? NUMERO SIGN
> which perhaps should not have encoded,
> since we have both U+004E LATIN CAPITAL LETTER N and
> U+00BA ? MASCULINE ORDINAL INDICATOR
>
> Arn0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/06d50dc7/attachment.html>

From unicode at unicode.org  Sun Oct 28 12:28:24 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sun, 28 Oct 2018 18:28:24 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 (Philippe Verdy via Unicode's message of "Sun, 28 Oct 2018 15:19:50
 +0100")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
Message-ID: <86in1mgevb.fsf@mimuw.edu.pl>

On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
> Given the "squiggle" below letters are actually gien distinctive
> semantics, I think it should be encoded a combining character (to be
> written not after a "superscript" but after any normal base letter,
> possibly with other combining characters, or CGJ if needed because of
> the compatibility equivalence.  That "squiggle" (which may look like
> an underscore) would haver the effect of implicity making the base
> letter superscript (smaller and elevated). It would have probably a
> "combining below" class.

Seems to me an elegant solution.

[...]

On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> Mr? / M=?

For me only the latter seems acceptable. Using COMBINING LATIN SMALL
LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
the base character. However in the lack of a better solution I can live
with it :-)

An alternative would be to use SMALL EQUALS SIGN, but looks like fonts
supporting it are rather rare. 

>
> Le dim. 28 oct. 2018 ? 10:41, arno.schmitt via Unicode <unicode at unicode.org> a ?crit :

[...]

>  Looks to me like U+2116 ? NUMERO SIGN
>  which perhaps should not have encoded,
>  since we have both U+004E LATIN CAPITAL LETTER N and
>  U+00BA ? MASCULINE ORDINAL INDICATOR    

I'm rather sure it is inherited from a character set used for the
round-trip test.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Sun Oct 28 12:54:34 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 18:54:34 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86in1mgevb.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
Message-ID: <CAGa7JC3V_rPKkQxP8xJcQAO1QzdUi2oQ=Cq4m=p7tFnaTDrh3A@mail.gmail.com>

Le dim. 28 oct. 2018 ? 18:28, Janusz S. Bie? <jsbien at mimuw.edu.pl> a ?crit :

> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
> > Given the "squiggle" below letters are actually gien distinctive
> > semantics, I think it should be encoded a combining character (to be
> > written not after a "superscript" but after any normal base letter,
> > possibly with other combining characters, or CGJ if needed because of
> > the compatibility equivalence.  That "squiggle" (which may look like
> > an underscore) would haver the effect of implicity making the base
> > letter superscript (smaller and elevated). It would have probably a
> > "combining below" class.
>
> Seems to me an elegant solution.
>
> [...]
>
> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
> > Mr? / M=?
>
> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)
>

There's a third alternative, that uses the superscript letter r, followed
by the combining double underline, instead of the normal letter r followed
by the same combining double underline.
However it is still not very elegant if we stil need to use only the
limited set of superscript letters (this still reduces the number of
abbreviations, such as those commonly used in French that needs a
superscript "?")
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/c059b661/attachment.html>

From unicode at unicode.org  Sun Oct 28 13:18:29 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 28 Oct 2018 19:18:29 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <CAGa7JC3V_rPKkQxP8xJcQAO1QzdUi2oQ=Cq4m=p7tFnaTDrh3A@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <CAGa7JC3V_rPKkQxP8xJcQAO1QzdUi2oQ=Cq4m=p7tFnaTDrh3A@mail.gmail.com>
Message-ID: <CAGa7JC0gyid_qXn4688u-V=P6yZB3cK6d7zPko5NaJ8K7kCY=w@mail.gmail.com>

Also if the "combining abbreviation mark" is used only at end of a
combining sequence to transform it, we can avoid all needs of CGJ for that
mark, if the mark is itself assigned the combining class 0.
So
- abbreviating "Mister" as "M<r>" (without the underscore below "r") becomes
  <M><r,COMBINING ABBREVIATION MARK>
- abbreviating "Monseigneur" as "M<gs>" (without the underscore below "g"
and "r") becomes
  <M><g,COMBINING ABBREVIATION MARK><s,COMBINING ABBREVIATION MARK>
- abbreviating "Ditto" as "D<to>" (without the underscore below "to")
becomes
  <D><t,COMBINING ABBREVIATION MARK><o,COMBINING ABBREVIATION MARK>
- abbreviating "Operation" as "Op<tn> (without the underscore below "to")
becomes
  <O><p><t,COMBINING ABBREVIATION MARK><n,COMBINING ABBREVIATION MARK>
- abbreviating "constitutionalit?" as "C<t?> (without the underscore below
"t?") becomes
  <C><t,COMBINING ABBREVIATION MARK><?,COMBINING ABBREVIATION MARK> or
  <C><t,COMBINING ABBREVIATION MARK><e,COMBINING ACUTE><COMBINING
ABBREVIATION MARK>
- abbreviating "Num?ro" as "N<o>" (without the underscore below "o") becomes
  <N><o,COMBINING UNDERLINE,COMBINING ABBREVIATION MARK>
- abbreviating "Magister" as "M<r>" (with the double underscore below "r")
becomes
  <M><r,COMBINING DOUBLE UNDERLINE,COMBINING ABBREVIATION MARK>

It is quite easy for text renderers to infer the selection of a small
superscript for the base (and its other combining characters or extenders
when they support these combinations), before applying the new combiner
mark. If not, they can still render the leading base (and its other
supported combining characters or extenders), followed by some dotted mark
(e.g. a small dotted circle).
Renderers that do not recognize the new combining abbreviation mark will
just render it at end of the sequence as a usual square or rectangular
"tofu"; those that recognize it as a combining character but no support for
it, will render the usual dotted square (meaning "unsupported combining
mark", to distinguish from the meaning as if there was a "missing base
character" to apply before a known combining mark or extender)


Le dim. 28 oct. 2018 ? 18:54, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> Le dim. 28 oct. 2018 ? 18:28, Janusz S. Bie? <jsbien at mimuw.edu.pl> a
> ?crit :
>
>> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote:
>> > Given the "squiggle" below letters are actually gien distinctive
>> > semantics, I think it should be encoded a combining character (to be
>> > written not after a "superscript" but after any normal base letter,
>> > possibly with other combining characters, or CGJ if needed because of
>> > the compatibility equivalence.  That "squiggle" (which may look like
>> > an underscore) would haver the effect of implicity making the base
>> > letter superscript (smaller and elevated). It would have probably a
>> > "combining below" class.
>>
>> Seems to me an elegant solution.
>>
>> [...]
>>
>> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote:
>> > Mr? / M=?
>>
>> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
>> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
>> the base character. However in the lack of a better solution I can live
>> with it :-)
>>
>
> There's a third alternative, that uses the superscript letter r, followed
> by the combining double underline, instead of the normal letter r followed
> by the same combining double underline.
> However it is still not very elegant if we stil need to use only the
> limited set of superscript letters (this still reduces the number of
> abbreviations, such as those commonly used in French that needs a
> superscript "?")
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/b63362ae/attachment.html>

From unicode at unicode.org  Sun Oct 28 15:12:27 2018
From: unicode at unicode.org (Garth Wallace via Unicode)
Date: Sun, 28 Oct 2018 13:12:27 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
Message-ID: <CA+p4_H3u-VE1QNMgVTVawWHv5iv3vYk3Y9kqACO9TJvAn1MB2Q@mail.gmail.com>

On Sun, Oct 28, 2018 at 2:34 AM arno.schmitt via Unicode <
unicode at unicode.org> wrote:

> Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode:
> > The notation is a quite widespread format for abbreviations.  the
> > first letter is normal sized, and the subsequent letter is written in
> > some variety of superscript with a squiggle underneath so that it
> > doesn't get overlooked.  I have deduced that this is not plain text
> > because there is no encoding mechanism for it.  For example, our
> > lecturers would frequently use this treatment to abbreviate function
> > as 'fn' with the 'n' superscript and supported by a squiggle below
> > sitting on the baseline.  The squiggle below has meaning; it marks the
> > word as an abbreviation.
> >
> > Richard.
>
> Looks to me like  U+2116 ? NUMERO SIGN
> which perhaps should not have encoded,
> since we have both U+004E LATIN CAPITAL LETTER N and
> U+00BA ? MASCULINE ORDINAL INDICATOR
>

AIUI, ? was encoded as a compatibility character because it appears in some
East Asian character sets
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181028/54b41a47/attachment.html>

From unicode at unicode.org  Sun Oct 28 15:42:04 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 28 Oct 2018 20:42:04 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86in1mgevb.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
Message-ID: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>

This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative. 

I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.)

I like palaeographic renderings of text very much indeed, and in fact remain in conflict with members of the UTC (who still, alas, do NOT communicate directly about such matters, but only in duelling ballot comments) about some actually salient representations required for medievalist use. The squiggle in your sample, Janusz, does not indicate anything; it is only a decoration, and the abbreviation is the same without it.

Michael Everson

> On 28 Oct 2018, at 17:28, Janusz S. Bie? via Unicode <unicode at unicode.org> wrote:
> 
> For me only the latter seems acceptable. Using COMBINING LATIN SMALL
> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as
> the base character. However in the lack of a better solution I can live
> with it :-)


From unicode at unicode.org  Sun Oct 28 16:47:44 2018
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sun, 28 Oct 2018 21:47:44 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <b1391d307e21842c78aa1a17f4926617@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <b1391d307e21842c78aa1a17f4926617@mail.gmail.com>
Message-ID: <8063207F-0BAF-495E-A95B-BFAAAE4BBAE4@evertype.com>

I think that it is the _superscription_ that indicates the fact that it is an abbreviation. 

In English ??e" was written ?ye? and and ?y?? ?y?? and the last of these might have a dot or a line or a squiggle underneath it, or not, and in no case was that dot or line or squiggle either _meaningful_ or necessary.

Michael Everson

> On 28 Oct 2018, at 21:43, Piotr Karocki <pkar at ieee.org> wrote:
> 
>> The squiggle in your sample, Janusz, does not indicate anything; it is only a decoration, and the abbreviation is the same without it.
> 
> I disagreee. This squiggle means "warning, this is abbreviation", and is
> present in many abbreviations in many centuries (sometimes, although,
> 'abbrev symbol' is rendered differently). So yes, it is important symbol and
> shouldn't be lost in transliteration.
> 
> Piotr Karocki


From unicode at unicode.org  Sun Oct 28 18:57:06 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 28 Oct 2018 23:57:06 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
Message-ID: <e5f9859d-fedc-43d9-6e01-64e984545cde@gmail.com>


The umlauts in the band name "M?tley Cr?e" are decorative, yet the 
difference between "M?tley Cr?e" and "Motley Crue" is one of spelling.? 
Although the tilde in the place name "Rancho Pe?asquitos" is *not* 
decorative, "Rancho Pe?asquitos" vs. "Rancho Penasquitos" is still a 
spelling difference.

Dingbats are both decorative and representable in computer plain text.? 
(??????)

Conventions exist in computer plain text for distinguishing *bold* and 
/italic/ text strings, why not a convention for abbreviation 
superscripts & squiggles?? (At least until something better comes along, 
such as a direct encoding along the lines of Philippe Verdy's earlier 
suggestion.)

"M=?" might render properly (or not, Notepad using Lucida Console fails 
here), but it wouldn't easily accommodate needed superscripted Latin 
small diacriticized letters.

"Mr?" for display purposes may look as daft as "/italics/", but it 
captures the elements of the text of the original manuscript.? And it 
would allow preservation of abbreviations such as for 
"constitutionalit?" ? "Ct???".

If "Mccoy" vs. "McCoy" vs. "MCCOY" vs. "MC COY" represent spelling 
differences, then so do "McCoy" vs "M?Coy".? It's a matter of opinion, 
and opinions often differ.


From unicode at unicode.org  Mon Oct 29 00:21:57 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Mon, 29 Oct 2018 06:21:57 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> (Michael
 Everson's message of "Sun, 28 Oct 2018 20:42:04 +0000")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
Message-ID: <86efc91g5m.fsf@mimuw.edu.pl>

On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson wrote:
> This is no different the Irish name McCoy which can be written M?Coy
> where the raising of the c is actually just decorative, though perhaps
> it was once an abbreviation for Mac. In some styles you can see a line
> or a dot under the raised c. This is purely decorative.
>
> I would encode this as M? if you wanted to make sure your data
> contained the abbreviation mark. 
[...]

> The squiggle in your sample, Janusz, does not indicate anything; it is
> only a decoration, and the abbreviation is the same without it.

I have received off the list even more radical suggestion:

>>>  The third and the last question is: how to encode this symbol in
>>>  Unicode?
>
> Why would you need to? Its plain text content is adequately
> represented by "Mr"

On Sun, Oct 28 2018 at 23:57 GMT, James Kass wrote:
> The umlauts in the band name "M?tley Cr?e" are decorative, yet the
> difference between "M?tley Cr?e" and "Motley Crue" is one of
> spelling.? Although the tilde in the place name "Rancho Pe?asquitos"
> is *not* decorative, "Rancho Pe?asquitos" vs. "Rancho Penasquitos" is
> still a spelling difference.

[...]

> If "Mccoy" vs. "McCoy" vs. "MCCOY" vs. "MC COY" represent spelling
> differences, then so do "McCoy" vs "M?Coy".? It's a matter of opinion,
> and opinions often differ.

Well said, but I make the claim stronger; it depends on the purpose of
the encoding and intended applications.

Handwriting recognition (HWR) is no longer just an abstract possibility,
it's a facility present to everybody e.g. in Transkribus
(https://transkribus.eu/) which I actually use for transcribing the
texts of interest. Do you claim that in the ground-truth for HWR the
squiggle and raising doesn't matter?

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Mon Oct 29 01:50:11 2018
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Mon, 29 Oct 2018 06:50:11 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
Message-ID: <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>

On 2018/10/29 05:42, Michael Everson via Unicode wrote:
> This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative.
> 
> I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.)

I think this may depend on actual writing practice. In German at least, 
it is customary to have dots (periods) at the end of abbreviations, and 
using any other symbol, or not using the dot, would be considered an error.

The question of how to encode that dot is fortunately an easy one, but 
even if it were not, German-writing people would find a sentence such as 
"The dot or ... has no meaning at all." extremely weird. The dot is 
there (and in German, has to be there) because it's an abbreviation.

Regards,   Martin.


From unicode at unicode.org  Mon Oct 29 02:57:45 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 29 Oct 2018 07:57:45 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <86efc91g5m.fsf@mimuw.edu.pl>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <86efc91g5m.fsf@mimuw.edu.pl>
Message-ID: <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com>


Janusz S. Bie? asked,

 > Do you claim that in the ground-truth for HWR the
 > squiggle and raising doesn't matter?

Not me!? "McCoy", "M=?Coy", and "M-?Coy" are three different ways of 
writing the same surname.? If I were entering plain text data from an 
old post card, I'd try to keep the data as close to the source as 
possible.? Because that would be my purpose.? Others might have 
different purposes.? As you state, it depends on the intention. But, if 
there were an existing plain text convention I'd be inclined to use it.? 
Conventions allow for the possibility of interchange, direct encoding 
would ensure it.


From unicode at unicode.org  Mon Oct 29 05:43:57 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Mon, 29 Oct 2018 11:43:57 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com> (James Kass's
 message of "Mon, 29 Oct 2018 07:57:45 +0000")
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <86efc91g5m.fsf@mimuw.edu.pl>
 <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com>
Message-ID: <86bm7duj6a.fsf@mimuw.edu.pl>

On Mon, Oct 29 2018 at  7:57 GMT, James Kass wrote:
> Janusz S. Bie? asked,
>
>> Do you claim that in the ground-truth for HWR the
>> squiggle and raising doesn't matter?
>
> Not me!

I know, sorry if my previous mail was confusing.

> "McCoy", "M=?Coy", and "M-?Coy" are three different ways of
> writing the same surname.? If I were entering plain text data from an
> old post card, I'd try to keep the data as close to the source as
> possible.? Because that would be my purpose.? Others might have
> different purposes.? As you state, it depends on the intention. But,
> if there were an existing plain text convention I'd be inclined to use
> it.? Conventions allow for the possibility of interchange, direct
> encoding would ensure it.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Mon Oct 29 06:36:04 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 29 Oct 2018 04:36:04 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
Message-ID: <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181029/5c1528ab/attachment.html>

From unicode at unicode.org  Mon Oct 29 06:53:01 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 29 Oct 2018 11:53:01 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
Message-ID: <20181029115301.664e62e1@JRWUBU2>

On Sun, 28 Oct 2018 20:42:04 +0000
Michael Everson via Unicode <unicode at unicode.org> wrote:

> I like palaeographic renderings of text very much indeed, and in fact
> remain in conflict with members of the UTC (who still, alas, do NOT
> communicate directly about such matters, but only in duelling ballot
> comments) about some actually salient representations required for
> medievalist use. The squiggle in your sample, Janusz, does not
> indicate anything; it is only a decoration, and the abbreviation is
> the same without it.

I think this is one of the few cases where Multicode may have
advantages over Unicode.  In a mathematical contest, a? would be
interpreted as _a_ applied _n_ times.  As to "f?", ambiguity may be
avoided by the superscript being inappropriate for an exponent.  What
is redundant in one context may be significant in another.

Richard. 


From unicode at unicode.org  Mon Oct 29 14:20:49 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 29 Oct 2018 12:20:49 -0700
Subject: A sign/abbreviation for "magister"
Message-ID: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>

Richard Wordingham wrote:
 
>> I like palaeographic renderings of text very much indeed, and in fact
>> remain in conflict with members of the UTC (who still, alas, do NOT
>> communicate directly about such matters, but only in duelling ballot
>> comments) about some actually salient representations required for
>> medievalist use. The squiggle in your sample, Janusz, does not
>> indicate anything; it is only a decoration, and the abbreviation is
>> the same without it.
>
> I think this is one of the few cases where Multicode may have
> advantages over Unicode. In a mathematical contest, a? would be
> interpreted as _a_ applied _n_ times. As to "f?", ambiguity may be
> avoided by the superscript being inappropriate for an exponent. What
> is redundant in one context may be significant in another.
 
Are you referring to the encoding described in the 1997 paper by
Mudawwar, which "address[es] Unicode's principal drawbacks" by switching
between language-specific character sets? Kind of like ISO 2022, but
less extensible?
 
ObMagister: I agree that trying to reflect every decorative nuance of
handwriting is not what plain text is all about. (I also disagree with
those who insist that superscripted abbreviations are required for
correct spelling in certain languages, and I expect to draw swift
flamage for that stance.) The abbreviation in the postcard, rendered in
plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion just
fuels the recurring demands for every Latin letter (and eventually those
in other scripts) to be duplicated in subscript and superscript, ? la
L2/18-206.

Back into my hole now.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Mon Oct 29 15:20:36 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 29 Oct 2018 21:20:36 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
Message-ID: <397615514.10318.1540844437188.JavaMail.www@wwinf2209>

On 29/10/18 20:29, Doug Ewell via Unicode wrote:
[?]
> ObMagister: I agree that trying to reflect every decorative nuance of
> handwriting is not what plain text is all about.

Agreed.

> (I also disagree with
> those who insist that superscripted abbreviations are required for
> correct spelling in certain languages, and I expect to draw swift
> flamage for that stance.)

It all (no ?flamage?, just trying to understand) depends on how we 
set the level of requirements, and what is understood by ?correct?.
There is even an official position arguing that representing an "?" 
with an "oe" string is correct, and that using the correct "?" is 
not required. 

> The abbreviation in the postcard, rendered in
> plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion

In English, ?Mr? for ?Mister? is correct, because English does not use 
superscript here, according to my knowledge. Ordinal indicators are 
considered different, and require superscript in correct representation.
Thus being trained on English, one cannot easily evaluate what is 
correct and what is required for correctness in a neighbor locale.

> just
> fuels the recurring demands for every Latin letter (and eventually those
> in other scripts) to be duplicated in subscript and superscript, ? la
> L2/18-206.

That is a generic request, unrelated to any locale, based only on a kind 
of criticism of poor rendering systems. The ?fake super-/subscripts? are 
already fixed if only OpenType is supported and fonts are complete.

> 
> Back into my hole now.

No worries. Stay tuned :-) Informed discussion brings advancement.

Best regards,

Marcel


From unicode at unicode.org  Mon Oct 29 20:47:25 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 30 Oct 2018 02:47:25 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
Message-ID: <CAGa7JC06yF7t4NJL=zfW1Nzeht=O_CsznWuO7PjFajpy9dSu8Q@mail.gmail.com>

For the case of "Mister" vs. "Magister", the (double) underlining is not
just a stylistic option but conveys semantics as an explicit abbreviation
mark !
We are here at the line between what is pure visual encoding (e.g. using
superscript letters), and logical encoding (as done eveywhere else in
unicode with combining sequences; the most well known exceptions being for
Thai script which uses the visual model).
Obviously the Latin script should not use any kind of visual encoding, and
even the superscript letters (initially introduced for something else,
notably as distinct symbols for IPA) was not the correct path (it also has
limitation because the superscript letters are quite limited; the same can
be saif about the visual encoding of Mathematic symbols as stylistic
variants transformed as plain characters, which will always be incomplete,
while it could as well be represented logically).
So Unicode does not have a consistent policy (and this inconsistence was
not just introduced due to legacy roundtrip compatibibility, like the
Numero abbreviation or the encoding of the Thai script).


Le lun. 29 oct. 2018 ? 12:44, Asmus Freytag via Unicode <unicode at unicode.org>
a ?crit :

> On 10/28/2018 11:50 PM, Martin J. D?rst via Unicode wrote:
>
> On 2018/10/29 05:42, Michael Everson via Unicode wrote:
>
> This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative.
>
> I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.)
>
>
> I think this may depend on actual writing practice. In German at least,
> it is customary to have dots (periods) at the end of abbreviations, and
> using any other symbol, or not using the dot, would be considered an error.
>
> The question of how to encode that dot is fortunately an easy one, but
> even if it were not, German-writing people would find a sentence such as
> "The dot or ... has no meaning at all." extremely weird. The dot is
> there (and in German, has to be there) because it's an abbreviation.
>
> Swedes employ ":" for abbreviations but often (always?) for eliding
> several word-interior letters. Definitely also a case of a non-optional
> convention.
>
> The use of superscript is tricky, because it can be optional in some
> contexts; if I write "3rd" in English, it will definitely be understood no
> different from "3rd". Likewise with the several marks below superscripts.
> Whether "numero" has an underline or not appears to be a matter of font
> design, with some regional preferences (which also affect the style of the
> N).
>
> I'm very much with James that questions of what is spelling vs. what is
> style (decoration) can be a matter of opinion - or better perhaps, a matter
> of convention and associated expectations. And that there may not always be
> unanimity in the outcome.
>
> In TeX the two transition fluidly. If I was going to transcribe such texts
> in TeX, I would construct a macro for the construct of the entire
> abbreviation and would name it. That macro would raise the "r", and then -
> depending on the desired fidelity of the style of the document, might
> include secondary elements, such as underlining, or a squiggle.
>
> In the standard rich text model of plaintext "back bone" combined with
> font selection (and other styling), the named macro would correspond to
> encoding the semantic of an Mr abbreviation in the "superscript r"
> convention and the details would be handled in the font design.
>
> That system is perhaps not well suited to exact transcriptions because
> unlike Tex, it separates the two aspects, and removes the aspect of
> detailed glyph design from the control of the author, unless the latter is
> also a font-designer.
>
> Nevertheless, I think the use of devices like combining underlines and
> superscript letters in plain text are best avoided.
>
> A./
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181030/da9712f4/attachment.html>

From unicode at unicode.org  Mon Oct 29 22:06:57 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 30 Oct 2018 03:06:57 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
Message-ID: <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>


Asmus Freytag wrote,

 > Nevertheless, I think the use of devices like combining underlines
 > and superscript letters in plain text are best avoided.

That's probably true according to the spirit of the underlying encoding 
principles.? But hasn't that genie already left the bottle?

People write their names as they please.? With the entire repertoire of 
Unicode from which to choose, people are coming up with some amazingly 
unorthodox ways to "spell" their screen names.? Here's six screen names 
copy/pasted from an atypical Twitter account's comments sections:

Jo? ????ic???

I?MAGI?NER?

????

IXOYE444 (?This one included character U+200F, I removed it.)

Q?y ? eT ? Dog ? VOTES?

 ??? ?? ??K?????z ?? ??? ??? (?Note the decorative emoji.)

People are mixing scripts and so forth in order to create distinctive 
screen names.? Those screen names are out there in the wild and are part 
of our stored data which future historians are welcome to scratch their 
heads over.

IIRC, around the time that the math alphanumerics were added to Plane 
One, Michael Everson noted that once characters are encoded people will 
use them as they see fit.? In this present thread, Michael Everson wrote:

 > And I would not encode it as Mr?, firstly because it
 > would never render properly and you might as well
 > encode it as Mr. or M:r, and second because in the
 > IPA at least that character indicates an alveolar
 > realization in disordered speech. (Of course it
 > could be used for anything.)

Yes, it could be used for anything requiring combining-two-lines-below.? 
At some point, if enough people were doing it, it would morph from a 
kludge of hacking alveolar whatevers into an accepted convention.? (Not 
that I am pushing this approach, it's only one suggestion out of many 
possibilities.? I'm in favor of direct encoding.)? I would not encode 
the abbreviation as either "Mr." or "M:r" because neither of those text 
strings appear in the original manuscript.

FAICT, "????" is pronounced just like "Tom", but it ain't spelled the 
same.? Likewise for "McCoy" and "M=?Coy".

It strikes me as perverse if "????" can spell his name as he pleases 
using the UCS but "M=?Coy" mustn't.? Especially since names like 
"M=?Coy" and abbreviations such as "M=?" could be typed on old-style 
mechanical typewriters.? Quintessential plain-text, that.


From unicode at unicode.org  Mon Oct 29 23:46:50 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 29 Oct 2018 21:46:50 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
Message-ID: <28edcd88-2294-741c-e65f-eb52891459ae@att.net>


On 10/29/2018 8:06 PM, James Kass via Unicode wrote:
> could be typed on old-style mechanical typewriters.? Quintessential 
> plain-text, that.

Nope. Typewriters were regularly used for underscoring and for 
strikethrough, both of which are *styling* of text, and not plain text. 
The mere fact that some visual aspect of graphic representation on a 
page of paper can be implemented via a mechanical typewriter does not, 
ipso facto, mean that particular feature is plain text. The fact that I 
could also implement superscripting and subscripting on a mechanical 
typewriter via turning the platen up and down half a line, also does not 
make *those* aspects of text styling plain text. either.

The same reasoning applies to handwriting, only more so.

--Ken


From unicode at unicode.org  Tue Oct 30 03:42:29 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 30 Oct 2018 08:42:29 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
Message-ID: <20181030084229.0f67ce4d@JRWUBU2>

On Mon, 29 Oct 2018 12:20:49 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:

> Richard Wordingham wrote:

> > I think this is one of the few cases where Multicode may have
> > advantages over Unicode. In a mathematical contest, a? would be
> > interpreted as _a_ applied _n_ times. As to "f?", ambiguity may be
> > avoided by the superscript being inappropriate for an exponent. What
> > is redundant in one context may be significant in another.  
>  
> Are you referring to the encoding described in the 1997 paper by
> Mudawwar, which "address[es] Unicode's principal drawbacks" by
> switching between language-specific character sets? Kind of like ISO
> 2022, but less extensible?

More precisely to the principle.  What is an irrelevant, optional
feature in one writing system may be significant in another.  I'm
currently trying to work out the rules for writing Pali in the Sinhala
script - I have to worry about the difference between touching letters
and conjuncts.  A simple ISCII-like encoding for Sinhala Pali would
delegate such matters to the font.

Richard.


From unicode at unicode.org  Tue Oct 30 04:02:53 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 30 Oct 2018 09:02:53 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <28edcd88-2294-741c-e65f-eb52891459ae@att.net>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
 <28edcd88-2294-741c-e65f-eb52891459ae@att.net>
Message-ID: <ff088d2c-00e8-d466-1585-6e48dc81290f@gmail.com>


Ken Whistler replied,

 >> could be typed on old-style mechanical
 >> typewriters.? Quintessential plain-text, that.
 >
 > Nope. Typewriters were regularly used for
 > underscoring and for strikethrough, both of which
 > are *styling* of text, and not plain text. The
 > mere fact that some visual aspect of graphic
 > representation on a page of paper can be
 > implemented via a mechanical typewriter does not,
 > ipso facto, mean that particular feature is plain
 > text. The fact that I could also implement
 > superscripting and subscripting on a mechanical
 > typewriter via turning the platen up and down half
 > a line, also does not make *those* aspects of text
 > styling plain text. either.

Sorry if we disagree.

I've never used a typewriter for producing anything other than text.? 
Just plain old unadorned text.? Plain text.? Colloquially speaking 
rather than speaking technically.? Text existed before the computer age.

A typewriter puts text on paper.? Pressing the "M" key while holding the 
"Shift" key puts "M" on the sheet.? Rolling the platen appropriately and 
striking "r" puts a superscript "r" on the sheet. Hitting the backspace 
key, rolling the platen a bit in the other direction and typing the 
"equals" key finishes this abbreviation in the text on the page.? Then 
the user rolls the platen to its earlier position and resumes typing.? 
(It's way easier to do than to describe.)

If the typist didn't intend to put a superscript "r" on that page with a 
double underline, the typist wouldn't have bothered with all that jive.

It's about the importance one places on respecting authorial intent.

Anything reasonable done on a mechanical typewriter can be replicated in 
an electronic data display.? If necessary I'd use a kludge before I'd 
hold my breath waiting for direct encoding when the desired result is 
for the displayed text on the screen to match the handwritten text in 
the source as closely as possible.? (I've used lots of kludges while 
awaiting the real M=?Coy.)

Sure, underscoring was used for s?t?r?e?s?s?, but it wasn't used *as* a 
stylistic difference as much as it was used *in lieu* of the ability to 
make a stylistic difference, such as bolding or italicizing.? It's the 
"plain text" convention of that time, predating the asterisks or slashes 
used in the modern convention. Underscoring might be stripped without 
messing with the legibility, but so could tatweels and lots of other 
stuff.? If nothing should mung the asterisks and slashes used in the 
modern convention, then the earlier convention's underscoring is every 
bit as worthy of being preserved.? (If I'm not mistaken, there was also 
some kind of underscoring convention for titles which was used instead 
of placing titles in quotes.)

Strikethrough isn't stylistic if it's done to type a character which 
isn't present on one of the keys.? For example, letters with strokes 
used for minority languages, like "?".? I don't see strikethrough as 
"style" if the typist didn't want to waste White Out on a draft, either.

Perhaps I should have referred to typewritten text as seminal plain text 
rather than quintessential plain text, but quintessential scans better.

Speaking of text, computer age or otherwise, the O.E.D. definition of 
text as related to computers appears outdated and/or incomplete:
https://en.oxforddictionaries.com/definition/text
(definition 1.3)


From unicode at unicode.org  Tue Oct 30 06:43:14 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 30 Oct 2018 11:43:14 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <ff088d2c-00e8-d466-1585-6e48dc81290f@gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
 <28edcd88-2294-741c-e65f-eb52891459ae@att.net>
 <ff088d2c-00e8-d466-1585-6e48dc81290f@gmail.com>
Message-ID: <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com>


(Still responding to Ken Whistler's post)

 > The fact that I could also implement superscripting and subscripting on a
 > mechanical typewriter via turning the platen up and down half a line, 
also
 > does not make *those* aspects of text styling plain text. either.

Do you know the difference between H?SO? and H2SO4?? One of them is a 
chemical formula, the other one is a license plate number. T?h?a?t? is 
not a stylistic difference /in my book/.? (Emphasis added.)

But suppose both those strings were *intended* to represent the chemical 
formula?? Then one of them would be optimally correct; the other one... meh.

Now what if we were future historians given the task of encoding both of 
those strings, from two different sources, and had no idea what those 
two strings were supposed to represent?? Wouldn't it be best to preserve 
both strings intact, as they were originally written?


From unicode at unicode.org  Tue Oct 30 08:13:01 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Tue, 30 Oct 2018 13:13:01 +0000 (GMT)
Subject: A sign/abbreviation for "magister"
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
 <28edcd88-2294-741c-e65f-eb52891459ae@att.net>
 <ff088d2c-00e8-d466-1585-6e48dc81290f@gmail.com>
 <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com>
Message-ID: <slrnptgm6s.2hd.jcb@home.stevens-bradfield.com>

On 2018-10-30, James Kass via Unicode <unicode at unicode.org> wrote:
> (Still responding to Ken Whistler's post)
....
> Do you know the difference between H?SO? and H2SO4?? One of them is a 
> chemical formula, the other one is a license plate number. T?h?a?t? is 
> not a stylistic difference /in my book/.? (Emphasis added.)

Yes. In chemical notation, sub/superscripting is semantically
significant.
That's not the case for abbreviations: the choice of Mr or any of its
superscripted and decorated variations is not semantically
significant.
The English abbreviation Mr was also frequently superscripted in the
15th-17th centuries, and that didn't mean anything special either - it
was just part of a general convention of superscripting the final
segment of abbreviations, probably inherited from manuscript practice.

> But suppose both those strings were *intended* to represent the chemical 
> formula?? Then one of them would be optimally correct; the other one... meh.
>
> Now what if we were future historians given the task of encoding both of 
> those strings, from two different sources, and had no idea what those 
> two strings were supposed to represent?? Wouldn't it be best to preserve 
> both strings intact, as they were originally written?

Indeed - and that means an image, not any textual representation. The
typeface might be significant too.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Tue Oct 30 10:52:47 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 30 Oct 2018 16:52:47 +0100 (CET)
Subject: A sign/abbreviation for "magister"
Message-ID: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>

Rather than a dozen individual e-mails, I?m sending this omnibus reply 
for the record, because even if here and in CLDR (SurveyTool forum and 
Trac) everything has already been discussed and fixed, there is still 
a need to stay acknowledging, so as not to fail following up, with 
respect to the oncoming surveys, next of which is to start in 30 days.

First here: On 29/10/2018 at 12:43, Dr Freytag via Unicode wrote:

[?]
> The use of superscript is tricky, because it can be optional in some
> contexts; if I write "3rd" in English, it will definitely be
> understood no different from "3rd". 

[Note that this second instance was actually intended to read "3??", 
but it was formatted using a higher-level protocol.]

[?]
> In TeX the two transition fluidly. If I was going to transcribe such
> texts in TeX, I would construct a macro [?]
[?]
> Nevertheless, I think the use of devices like combining underlines
> and superscript letters in plain text are best avoided.

While most other scripts from Arabic to Duployan are generously granted 
all and everything they need for accurate representation, starting with 
preformatted superscripts and ending with superscripting or subscripting 
format controls, Latin script is often quite deliberately pulled down 
in order to make it unusable outside high-end DTP software, from 
TeX to Adobe InDesign, with the notable exception of sparsely and 
parsimoniously encoded preformatted characters for phoneticists and 
medievalists. E.g. in Arabic script, superscript is considered worth 
encoding and using without any caveat, whereas when Latin script is on, 
superscripts are thrown into the same cauldron as underscoring.

Obviously Unicode don?t apply to Latin script the same principle they 
do to all other scripts, i.e. to free preformatted letters as suitable 
if they are part of a standard representation and in some cases are 
needed to ensure unambiguity. Mediterranean locales had preformatted 
ordinal indicators even in the Latin-1-only era, despite "1a" and "2o" 
may be understood no different from "1?" and 2?". The degree sign, that 
is on French keyboards, is systematically hijacked to represent the 
"n?" abbreviation, unless a string is limited to ASCII-only. Several 
Latin-script-using locales have standard representations and strong 
user demands for superscripts, which instead of being satisfied on 
Unicode level as would be done for any other of the world?s scripts, 
are obstinately rebuffed when not intended for phonetics, or in 
some cases, for palaeography.

I wasn?t digging down to find out about those UTC members who on a 
regular basis are aggressively contradicting ballot comments about 
encoding palaeographic Latin letters, while proving unable to sustain 
any open and honest discussion on this List or elsewhere. Referring to 
what Dr Everson via Unicode wrote on 28/10/2018 at 21:49:

> I like palaeographic renderings of text very much indeed, and in fact
> remain in conflict with members of the UTC (who still, alas, do NOT
> communicate directly about such matters, but only in duelling ballot
> comments) about some actually salient representations required for
> medievalist use.


That said: On 29/10/2018 at 09:09, James Kass via Unicode wrote:
[?]
> If I were entering plain text data from an old post card, I'd try
> to keep the data as close to the source as possible. Because that
> would be my purpose. Others might have different purposes. 
> As you state, it depends on the intention. But, if there were an
> existing plain text convention I'd be inclined to use it. 
> Conventions allow for the possibility of interchange, direct
> encoding would ensure it.

The goal of discouraging Latin superscripts is obviously to ensure 
that reliable document interchange is limited to the PDF. 

If Unicode were allowed to emit an official recommendation to use 
preformatted superscripts in Latin script, too, then font designers 
would implement comprehensive support of combining diacritics, and 
any plain text including superscripted abbreviations could use the 
preformatted characters, in order to gather the interoperability 
that Unicode was designed for. Referring to what Dr Verdy via Unicode 
wrote on 28/10/2018 at 19:01:

[?]
> However it is still not very elegant if we stil need to use only
> the limited set of superscript letters (this still reduces the
> number of abbreviations, such as those commonly used in French
> that needs a superscript "?")

The use of combining diacritics with preformatted superscripts is 
also the reason why Unicode is limiting encoding support to base 
letters, even for preformatted superscript letters. The rule that 
no *new* precomposed letters with acute accent are encoded anymore 
applies to superscripts too. A Unicode-conformant way to represent 
such abbreviations would IMO use U+1D49 followed by U+0301: ,??,.
Other representations may require OpenType support, which in Latin 
script is often turned off, supposedly in order to shift to higher 
level protocols what Unicode makes available in plain text.
Referring to what Dr Kass wrote on 29/10/2018 at 01:05:

[?]
> "Mr?" for display purposes may look as daft as "/italics/", but
> it captures the elements of the text of the original manuscript. 
> And it would allow preservation of abbreviations such as for 
> "constitutionalit?" ? "Ct???".

Using superscripts plus combining diacritics might be a way to 
address the limitations Dr Verdy mentioned on 30/10/2018 at 02:56:

[?]
> Obviously the Latin script should not use any kind of visual
> encoding, and even the superscript letters (initially introduced
> for something else, notably as distinct symbols for IPA) was not
> the correct path (it also has limitation because the superscript
> letters are quite limited; [?]

But for font designers to implement combining diacritics for use 
with preformatted superscripts, Unicode needs to explicitly allow 
or recommend the use of preformatted superscripts in abbreviations.

This use case is different from the use case that led to submit 
the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:

[?]
> The abbreviation in the postcard, rendered in plain text, is "Mr".
> Bringing U+02B3 or U+036C into the discussion just fuels the
> recurring demands for every Latin letter (and eventually those
> in other scripts) to be duplicated in subscript and superscript,
> ? la L2/18-206.

IMO this proposal implodes when considering that the preformatted 
characters are supposed to be inserted by the application rather 
than directly out of keyboard drivers. 

The document L2/18-206 seems to originate from the observation 
of poor fonts and rendering engines in low-end document editing 
software. As previously mentioned, the fix is already available 
using high-end DTP software. That is sustainable as long as no 
locales are impacted. What this thread is about is a digitally 
interoperable representation of actual languages. E.g. small caps 
is out of scope, given the postcard writer did not write the names 
in small caps, that in Latin script are merely a stylistic 
convention intended for scientific publication and so on ? while 
Cyrillic script currently uses ?small caps? to write in lowercase.

Cyrillic also uses the ? sign, that is mapped to the second level 
on key E03 ("3" key) on the Russian and other Cyrillic keyboards.
Russian keyboard layout:
https://docs.microsoft.com/en-us/globalization/keyboards/kbdru.html
Bulgaran (phonetic traditional) keyboard layout:
https://docs.microsoft.com/en-us/globalization/keyboards/kbdbgph1.html

Perhaps the Numero sign is used in Cyrillic after it had been encoded 
for East Asian as Dr Wallace via Unicode hinted on 28/10/2018 at 21:20:

[?]
> AIUI, ? was encoded as a compatibility character because it appears
> in some East Asian character sets

Still ? is also encoded in ISO/IEC 8859-5, at 0xf0.

Further, Dr Whistler via Unicode stated on 30/10/2018 at 05:54:

[?]
> The mere fact that some visual aspect of graphic representation on a 
> page of paper can be implemented via a mechanical typewriter does not, 
> ipso facto, mean that particular feature is plain text. The fact that I 
> could also implement superscripting and subscripting on a mechanical 
> typewriter via turning the platen up and down half a line, also does not 
> make *those* aspects of text styling plain text. either.

The reverse is true, too: The fact that some language representation was 
performed by tweaking the typewriter didn?t tag that representation as not 
plain text. E.g. the LATIN CAPITAL LETTER C WITH CEDILLA couldn?t be typed 
by holding Shift and hitting "?"?key E09, the "9" key?on a French keyboard. 
Nevertheless it is required for legibility when "?" occurs at the start of 
a sentence or in all-caps. 
The workaround was to type a COMMA over LATIN CAPITAL LETTER C. 

Likewise, SUPERSCRIPT TWO was available on French (France) typewriters, 
and Belgian French ones had SUPERSCRIPT THREE, too. Also, again, the now 
MODIFIER LETTER SMALL O was and still is emulated using the DEGREE SIGN 
(on level 2 of key E11). The fact that other superscript letters needed 
turning the platen does not make them belong to rich text, today.

It?s as Dr Kass via Unicode put it on 30/10/2018 at 10:09 when replying 
to Dr Whistler via Unicode (above):

[?]
> If the typist didn't intend to put a superscript "r" on that page with a 
> double underline, the typist wouldn't have bothered with all that jive.
>
> It's about the importance one places on respecting authorial intent.
>
[?]
> [?] Underscoring might be stripped without messing with the legibility,
> but so could tatweels and lots of other stuff. [?]

If the intent of Unicode is to discriminate Arabic script vs Latin script, 
that would be worth mentioning in the Standard. 

Making claims about interoperability and about unambiguous representation 
of all of the world?s scripts, Unicode is expected to do so for Latin, too.

Dr Bie? via Unicode wrote on 29/10/2018 at 06:40:

> > [?] It's a matter of opinion, and opinions often differ.
> 
> Well said, but I make the claim stronger; it depends on the purpose of
> the encoding and intended applications.

Dr Everson via Unicode replied to Dr Karocki on 28/10/2018 at 22:55:
> 
> I think that it is the _superscription_ that indicates the fact that
> it is an abbreviation.

Hence Unicode is expected to fully support the use of plain text 
superscript for those locales using superscript as an abbreviation 
indicator, in the same role as other locales may use colon or period, 
a usage that Dr D?rst via Unicode mentioned on 29/10/2018 at 08:04 
responding to Dr Everson?s 05:42 (same day) e-mail:

[?]
> I think this may depend on actual writing practice. In German at least, 
> it is customary to have dots (periods) at the end of abbreviations, and 
> using any other symbol, or not using the dot, would be considered an error.

So should be, in some locales among which French, not using superscript. 
It?s just that the perception of a superscript-less abbreviation that 
normally uses superscript, is biased by the computer keyboard layouts 
actually still in use (but hopefully soon to be enhanced by more complete 
layouts).

Now is Unicode inspired by typewriting practice when designing the encoding 
of Latin script, unlike what is done for potentially all other scripts?

Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn?t 
know when replying to Dr Ewell on 29/10/2018 at 21:27:

[?]
> The English abbreviation Mr was also frequently superscripted in the
> 15th-17th centuries, and that didn't mean anything special either - it
> was just part of a general convention of superscripting the final
> segment of abbreviations, probably inherited from manuscript practice.

So English dropped the superscript requirement for common abbreviations 
in the 17?? or 18?? century to keep it only for ordinals. Should Unicode 
now take example on English to pull down the representation of French?
Fortunately it does not, as the French ordinal indicators are now a part 
of CLDR, consistently with what the French national body intended when 
setting up again a design process of a locale-conformant keyboard.

The rest of superscript abbreviation letters should follow in CLDR 
when browsers will be using correct fonts for displaying the data.

We remember that The Unicode Standard explicitly specifies that the 
glyphs of all superscript or modifier letters of a script shall be equalized.
No ransom note effect is allowed in Unicode-conformant fonts (except for 
the purpose of artwork, as in Apple?s former San Francisco typeface).


Best regards,

Marcel


From unicode at unicode.org  Tue Oct 30 11:35:09 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 30 Oct 2018 17:35:09 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>
References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>
Message-ID: <1654688647.7700.1540917309959.JavaMail.www@wwinf2209>

On 30/10/18 17:01 I wrote:
> A Unicode-conformant way to represent 
> such abbreviations would IMO use U+1D49 followed by U+0301: ,??,.

Works actually fine in my browser. 
My apologies to font designers and foundries, already supporting the 
combining diacritics with superscript Latin letters. Only in my text editor 
it didn?t work, hence the commas instead of quotes bracketing the literal.

> We remember that The Unicode Standard explicitly specifies that the 
> glyphs of all superscript or modifier letters of a script shall be equalized.

There is too much interpretation in that statement. TUS actually specifies 
that no difference of usage is intended by a difference in naming schemes, 
i.e. MODIFIER LETTERs shall not be discriminated from those letters 
having SUPERSCRIPT in their name.

> No ransom note effect is allowed in Unicode-conformant fonts

It may not be explicitely prohibited, though it is not Unicode conformant.

Best regards,

Marcel


From unicode at unicode.org  Tue Oct 30 12:51:22 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 30 Oct 2018 10:51:22 -0700
Subject: A sign/abbreviation for "magister"
Message-ID: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>

Marcel Schneider wrote:

> This use case is different from the use case that led to submit
> the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:

I guess this is intended as a compliment. While many of the people you
quoted do have doctoral degrees, many others of us do not.

> E.g. small caps is out of scope, given the postcard writer did not
> write the names in small caps, that in Latin script are merely a
> stylistic convention intended for scientific publication and so on ?
> while Cyrillic script currently uses ?small caps? to write in
> lowercase.

You're joking, right?

?? ?? ?? ??

This undermines a lot of what you are claiming to know about writing
systems, and about the difference between case distinctions and styling.

 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Oct 30 13:25:37 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 30 Oct 2018 18:25:37 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <320cc6c3-b698-1359-baee-d73d70075215@gmail.com>
 <28edcd88-2294-741c-e65f-eb52891459ae@att.net>
 <ff088d2c-00e8-d466-1585-6e48dc81290f@gmail.com>
 <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com>
Message-ID: <20181030182537.77eb9c26@JRWUBU2>

On Tue, 30 Oct 2018 11:43:14 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> Now what if we were future historians given the task of encoding both
> of those strings, from two different sources, and had no idea what
> those two strings were supposed to represent?? Wouldn't it be best to
> preserve both strings intact, as they were originally written?

In general, it is not possible to encode text in Unicodeif one has no
knowledge of what the text itself represents.  Some English typewriters
did not distinguish digit ?0? from capital letter ?O? or digit ?1? from
small letter ?l?.

Richard.


From unicode at unicode.org  Tue Oct 30 13:51:06 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 30 Oct 2018 19:51:06 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
Message-ID: <1918728727.9415.1540925466807.JavaMail.www@wwinf2209>

On 30/10/2018 at 18:59, Doug Ewell via Unicode wrote:
>
> Marcel Schneider wrote:
> 
> > This use case is different from the use case that led to submit
> > the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:
> 
> I guess this is intended as a compliment.

Right.

> While many of the people you
> quoted do have doctoral degrees, many others of us do not.

Making a safe distinction is beyond my knowledge, safest is not to discriminate.

> 
> > E.g. small caps is out of scope, given the postcard writer did not
> > write the names in small caps, that in Latin script are merely a
> > stylistic convention intended for scientific publication and so on ?
> > while Cyrillic script currently uses ?small caps? to write in
> > lowercase.
> 
> You're joking, right?

No, I wasn?t, nowhere.

> 
> ?? ?? ?? ??
> 
> This undermines a lot of what you are claiming to know about writing
> systems, and about the difference between case distinctions and styling.

Unfortunately, yes. My apologies to all Cyrillic scriptors hurted while I 
assumed that every Cyrillic capital letter is a big version of its lowercase.
It?s ironic, given I worked hard to revise the French nameslist, including 
the Cyrillic block, where I propose to make more subdivisions, the actual 
heading scheme seems to me as not being respectful enough.

Sorry.

Marcel


From unicode at unicode.org  Tue Oct 30 14:01:09 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 30 Oct 2018 19:01:09 +0000
Subject: Logical Order (was: A sign/abbreviation for "magister")
In-Reply-To: <CAGa7JC06yF7t4NJL=zfW1Nzeht=O_CsznWuO7PjFajpy9dSu8Q@mail.gmail.com>
References: <86tvl7tzkz.fsf@mimuw.edu.pl>
 <f4270c3b-c572-53cb-970b-d288586f7247@ix.netcom.com>
 <20181028081326.264dc079@JRWUBU2>
 <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net>
 <CAGa7JC1Zta1b-B6Gy=04VzywCNhaEv_oBJJZnX+9W+8ZF5OcRw@mail.gmail.com>
 <86in1mgevb.fsf@mimuw.edu.pl>
 <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com>
 <bb2273e1-7373-b674-b478-bc52336254db@it.aoyama.ac.jp>
 <e88791d0-4fe9-da67-c303-c581bc78b5da@ix.netcom.com>
 <CAGa7JC06yF7t4NJL=zfW1Nzeht=O_CsznWuO7PjFajpy9dSu8Q@mail.gmail.com>
Message-ID: <20181030190109.15458137@JRWUBU2>

On Tue, 30 Oct 2018 02:47:25 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> We are here at the line between what is pure visual encoding (e.g.
> using superscript letters), and logical encoding (as done eveywhere
> else in unicode with combining sequences; the most well known
> exceptions being for Thai script which uses the visual model).

For your information, Thai uses the logical encoding, almost by
definition.  The logical order is the order used in the backing store
(See Section 2.2, Unicode Design Principles
<http://www.unicode.org/versions/latest/ch02.pdf#G128>).  In the Thai
?combining sequences? you have in mind, the vowel symbols you have in
mind are classified as letters, so we do not have combining sequences!
There were ill-defined preposed logically following combining marks (in
the charts, but not the tables) in Unicode 1.0, but the problems with
implementing them in the Thai monosyllable ???? were so great that I
wonder if any one succeeded at the time  - <U+0E1E THAI LETTER PHO PHAN,
U+0E3A THAI VOWEL SIGN PHINTHU, U+0E25 THAI LETTER LO LING, U+0E70
THAI PHONETIC ORDER VOWEL SIGN SARA E, U+0E32 THAI VOWEL SIGN SARA
AA> with invisible PHINTHU, as opposed to   <U+0E2B THAI LETTER HO HIP,
U+0E3A THAI VOWEL SIGN PHINTHU, U+0E25 THAI LETTER LO LING, U+0E32 THAI
VOWEL SIGN SARA AA> with visible PHINTHU!

The official disinformation source,  http://www.unicode.org/glossary,
misdefines logical order
<http://www.unicode.org/glossary/#logical_order> to be ?the order in
which text is typed on a keyboard?.  So much for suggestions that one
should design keyboard interfaces to convert visual order to storage
order!

A striking example is New Tai Lue, whose standard ordering was changed
from phonetic order to visual order because it was found that the
logical order, even using the Unicode *character* encoding, was visual
order rather than phonetic order.

Richard.


From unicode at unicode.org  Tue Oct 30 15:26:22 2018
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Tue, 30 Oct 2018 22:26:22 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>
References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>
Message-ID: <20181030202622.GA16380@macbook.localdomain>

On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote:
>               E.g. in Arabic script, superscript is considered worth 
> encoding and using without any caveat, whereas when Latin script is on, 
> superscripts are thrown into the same cauldron as underscoring.

Curious, what Arabic superscripts are encoded in Unicode?

Regards,
Khaled

From unicode at unicode.org  Tue Oct 30 15:26:42 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Tue, 30 Oct 2018 20:26:42 +0000 (GMT)
Subject: A sign/abbreviation for "magister"
References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209>
Message-ID: <slrnpthfk1.oh1.jcb@home.stevens-bradfield.com>

On 2018-10-30, Marcel Schneider via Unicode <unicode at unicode.org> wrote:
> Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn?t 
> know when replying to Dr Ewell on 29/10/2018 at 21:27:

>> The English abbreviation Mr was also frequently superscripted in the
>> 15th-17th centuries, and that didn't mean anything special either - it
>> was just part of a general convention of superscripting the final
>> segment of abbreviations, probably inherited from manuscript practice.
>
> So English dropped the superscript requirement for common abbreviations 

Who said anything about requirement? I didn't.
The practice of using superscripts to end abbreviations is alive and
well in manuscript - I do it myself in writting notes for myself. For
example, "condition" I will often write as "cond<sup>n</sup>", and
"equation" as "eq<sup>n</sup>".

> in the 17?? or 18?? century to keep it only for ordinals. Should Unicode 

What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
is not now (when superscripting looks very old-fashioned) and never
has been any requirement to superscript them, as far as I know -
though since the OED doesn't have an entry for "1st", I can't easily
check.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Tue Oct 30 15:38:18 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 30 Oct 2018 13:38:18 -0700
Subject: A sign/abbreviation for "magister"
Message-ID: <20181030133818.665a7a7059d7ee80bb4d670165c8327d.4cbd4f03b4.wbe@email03.godaddy.com>

Julian Bradfield wrote:
 
>> in the 17?? or 18?? century to keep it only for ordinals. Should
>> Unicode
>
> What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
> is not now (when superscripting looks very old-fashioned) and never
> has been any requirement to superscript them, as far as I know -
> though since the OED doesn't have an entry for "1st", I can't easily
> check.
 
The English Wikipedia article "Ordinal number (linguistics)" does not
show numbers such as 1st, 2nd, etc. with superscripts, though as a
rich-text Web page, it could easily.
 
The article "English numerals" does include a bullet point: "The
suffixes -th, -st, -nd and -rd are occasionally written superscript
above the number itself." Note the word "occasionally."
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Oct 30 16:02:43 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 30 Oct 2018 22:02:43 +0100 (CET)
Subject: A sign/abbreviation for "magister"
Message-ID: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209>

On 30/10/2018? at 21:34, Khaled Hosny via Unicode wrote:
>?
> On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote:
> > E.g. in Arabic script, superscript is considered worth?
> > encoding and using without any caveat, whereas when Latin script is on,?
> > superscripts are thrown into the same cauldron as underscoring.
>?
> Curious, what Arabic superscripts are encoded in Unicode?
?
First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671.
But it is a vowel sign. Many letters put above are called superscript?
when explaining in English.
?
There is the range U+FC5E..U+FC63 (presentation forms).
?
Best regards,
?
Marcel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181030/41fe02e7/attachment.html>

From unicode at unicode.org  Tue Oct 30 16:23:34 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 30 Oct 2018 14:23:34 -0700
Subject: [getting OT] Re: A sign/abbreviation for "magister"
Message-ID: <20181030142334.665a7a7059d7ee80bb4d670165c8327d.50dbbbe7bb.wbe@email03.godaddy.com>

Marcel Schneider replied to Khaled Hosny:

>>> E.g. in Arabic script, superscript is considered worth encoding and
>>> using without any caveat, [...]
>>
>> Curious, what Arabic superscripts are encoded in Unicode?
>
> [...] There is the range U+FC5E..U+FC63 (presentation forms).

Arabic presentation forms are never an example of anything, and their
use is full of caveats.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Oct 30 16:32:57 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 30 Oct 2018 21:32:57 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
Message-ID: <cf090e2f-f581-628d-4845-1d85e79d0107@gmail.com>


Doug Ewell responded to Marcel Schneider,

 >> while Cyrillic script currently uses ?small caps? to write in
 >> lowercase.
 >
 > You're joking, right?
 >
 > ?? ?? ?? ??
 >
 > This undermines a lot of what you are claiming to know
 > about writing systems, and about the difference between
 > case distinctions and styling.

That seems unduly harsh.? None of us are perfect; we all make mistakes.? 
The lowercase part of Cyrillic casing pairs do resemble small caps for 
most letters.? One casual mistake given in an aside does not negate the 
rest of Marcel Schneider's points.? One error about a related script 
(Cyrillic) does not undermine his thoughtful expectations for the Latin 
script as a French language member of the Latin script user community.

As an aside, calling a mister a doctor isn't insulting but calling a 
doctor a mister might be.? I suppose we could all call each other 
magister here, just to be safe, but we can't seem to agree on how to 
encode its abbreviation.


From unicode at unicode.org  Tue Oct 30 16:50:27 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 30 Oct 2018 14:50:27 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <cf090e2f-f581-628d-4845-1d85e79d0107@gmail.com>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
 <cf090e2f-f581-628d-4845-1d85e79d0107@gmail.com>
Message-ID: <e79fbbf6-f7bf-5df4-84b3-5990b94e7389@att.net>


On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
> but we can't seem to agree on how to encode its abbreviation. 

For what it's worth, "mgr" seems to be the usual abbreviation in Polish 
for it.

--Ken


From unicode at unicode.org  Tue Oct 30 16:52:45 2018
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Tue, 30 Oct 2018 23:52:45 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209>
References: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209>
Message-ID: <20181030215245.GB16380@macbook.localdomain>

On Tue, Oct 30, 2018 at 10:02:43PM +0100, Marcel Schneider wrote:
> On 30/10/2018? at 21:34, Khaled Hosny via Unicode wrote:
> >?
> > On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote:
> > > E.g. in Arabic script, superscript is considered worth?
> > > encoding and using without any caveat, whereas when Latin script is on,?
> > > superscripts are thrown into the same cauldron as underscoring.
> >?
> > Curious, what Arabic superscripts are encoded in Unicode?
> ?
> First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671.
> But it is a vowel sign. Many letters put above are called superscript?
> when explaining in English.

As you say, this is a vowel sign not a superscript letter, so the name
is a misnomer at best. It should have been called COMBINING ARABIC
LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic
it is called small or dagger alef.

> There is the range U+FC5E..U+FC63 (presentation forms).

That is a backward compatiplity block no one is supposed to use, there
are many such backward comatipility presentation forms even of Latin
script (U+FB00..U+FB4F).

So I don?t see what makes you think, based on this, that Unicode is
favouring Arabic or other scripts over Latin.

Regards,
Khaled

From unicode at unicode.org  Tue Oct 30 17:41:06 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 30 Oct 2018 23:41:06 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <e79fbbf6-f7bf-5df4-84b3-5990b94e7389@att.net>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
 <cf090e2f-f581-628d-4845-1d85e79d0107@gmail.com>
 <e79fbbf6-f7bf-5df4-84b3-5990b94e7389@att.net>
Message-ID: <F17F3B71-ACBB-4F93-B6DD-C0071F8139A5@telia.com>


> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode <unicode at unicode.org> wrote:
> 
> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
>> but we can't seem to agree on how to encode its abbreviation. 
> 
> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for it.

That seems to be the contemporary usage, but the postcard is from 1917, cf. the OP. Also, the transcription in the followup post suggests that the Polish script at the time, or at least of the author, differed from the commonly taught D'Nealian cursive [1], cf. the "z". A variation of the latter has ended up as the Unicode MATHEMATICAL SCRIPT letters, which is closer to the Swedish cursive [2] for some letters.

1. https://en.wikipedia.org/wiki/D'Nealian
2. https://sv.wikipedia.org/wiki/Skrivstil


From unicode at unicode.org  Wed Oct 31 00:45:13 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Wed, 31 Oct 2018 06:45:13 +0100
Subject: second attempt (was: A sign/abbreviation for "magister")
In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 (Doug Ewell via Unicode's message of "Mon, 29 Oct 2018 12:20:49
 -0700")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
Message-ID: <86k1lypt3q.fsf@mimuw.edu.pl>


My previous attempt to send this mail was rejected by the list as
spam. If this one will not appear on the list, would you be so kind to
forward it to the list and the listmaster?

On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[...]

> The abbreviation in the postcard, rendered in
> plain text, is "Mr".

The relevant fragment of the postcard in a loose translation is

    Use the following address: <Abbreviation1> <Abbreviation2> <name>...
    <Abbreviation1> is the abbreviation of magister.

I don't think your rendering

   Mr is the abbreviation of magister.

has the same meaning.

Please note that I didn't asked *whether* to encode the abbreviation. I
asked *how* to do it.

If you think it is impossible to encode it in Unicode (without using
PUA), just say this explicitely.

BTW, I find it strange that nobody refers to an old thread

https://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0117.html

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Wed Oct 31 02:27:47 2018
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Wed, 31 Oct 2018 07:27:47 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <1918728727.9415.1540925466807.JavaMail.www@wwinf2209>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
 <1918728727.9415.1540925466807.JavaMail.www@wwinf2209>
Message-ID: <02fe068b-b6f5-0bbd-9af2-338f70756806@it.aoyama.ac.jp>

On 2018/10/31 03:51, Marcel Schneider via Unicode wrote:
> On 30/10/2018 at 18:59, Doug Ewell via Unicode wrote:
>>
>> Marcel Schneider wrote:
>>
>>> This use case is different from the use case that led to submit
>>> the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29:
>>
>> I guess this is intended as a compliment.
> 
> Right.
> 
>> While many of the people you
>> quoted do have doctoral degrees, many others of us do not.

And even those who have such degrees don't expect them to be used on a 
mailing list.

> Making a safe distinction is beyond my knowledge, safest is not to discriminate.

Yes. The easiest way to not discriminate is to not use titles in mailing 
list discussions. That's what everybody else does, and what I highly 
recommend.

Regards,    Martin.


From unicode at unicode.org  Wed Oct 31 04:38:25 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Wed, 31 Oct 2018 09:38:25 +0000 (GMT)
Subject: second attempt (was: A sign/abbreviation for "magister")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
Message-ID: <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>

On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode <unicode at unicode.org> wrote:
> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:

[ as did I in private mail ]

>> The abbreviation in the postcard, rendered in
>> plain text, is "Mr".
>
> The relevant fragment of the postcard in a loose translation is
>
>     Use the following address: <Abbreviation1> <Abbreviation2> <name>...
>     <Abbreviation1> is the abbreviation of magister.
>
> I don't think your rendering
>
>    Mr is the abbreviation of magister.
>
> has the same meaning.

I do, for the reasons stated by many.

If the topic were a study of the ways in which people indicate
abbreviations by typographic or manuscript styling, then it would be
important to know the exact form of the marks; but that is not plain
text. One cannot expect to discuss detailed technical questions using only
plain text, other than by using language to describe the details.

> Please note that I didn't asked *whether* to encode the abbreviation. I
> asked *how* to do it.

Doug and I have argued that the encoding is "Mr". Further detail can be
given in natural language as a note. You could use the various hacks
you've discussed, with modifier letters; but that is not "encoding",
that is "abusing Unicode to do markup". At least, that's the view I
take!

Perhaps a more challenging case is that at one time in English, it was
common to write and print "the" as "y<sup>e</sup>" (from older
"?<sup>e</sup>"). Here, there is actually a potential contrast between
the forms "y<sup>e</sup>" ("the") and "ye" (2nd plural pronoun), and
the contrast could be realized: "the/ye idle braggarts are a curse
upon England". Is the encoding of "y<sup>e</sup>" to be "ye" or "the"?
A hard-line plain-texter such as myself would probably argue for
"the".


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Wed Oct 31 05:12:16 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 03:12:16 -0700
Subject: second attempt
In-Reply-To: <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
Message-ID: <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/e0ab090d/attachment.html>

From unicode at unicode.org  Wed Oct 31 06:53:22 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 31 Oct 2018 11:53:22 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
Message-ID: <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com>


Responding to Julian Bradfield,

U+1D49 MODIFIER LETTER SMALL E
General Category:? Letter, Modifier
Decomposition Type <super>
Mapping:? U+0065

It's a spacing superscript Latin lower case "E".? It's a letter. People 
spell with letters.

"One of the goals of the Consortium is to preserve humanity's common 
linguistic heritage and provide universal access for the world's 
languages?past, present, and future."

Superscripts and subscripts are part of the Latin writing system. If the 
source says "y?" or "??", that's what I would enter into the database.? 
Otherwise it's just transcription, IMHO.? If the goal is to preserve the 
past by transcribing it, we could've done? that with ASCII.

Having "y?" or "??" in the database makes the database more 
human-readable than having mark-up such as "y<sup>e</sup>" and takes 
fewer bytes.

DUCET allows for desired collation results.? Searching for "y?" or "??" 
could get only those files which included the specific string and not 
all the files which include strings "ye", "?e", or "the".

The superscript lower case Latin "E" also has "grapheme base" listed as 
one of its binary properties, so it might be OK to add a line or two 
under one, if that's what's desired.

If the superscript lower case Latin letter "E", ("?"), cannot be used in 
this instance because it is supposed to *modify* the preceding 
character, then is its usage in this question a "hack"? It isn't 
modifying that ASCII quote at all.

Providing mark-up solutions isn't universal, but computer plain-text is.

For the OP's question, PUA for perfect display and no guarantee of 
interoperability, "Mr" for transcription, or (what Michael said 
initially) "M?".? I think it would be OK to add something like a 
combining equals sign below to Michael's suggested string and make it 
"M??", but it wouldn't display well unless a font's OpenType tables 
provided for it.


From unicode at unicode.org  Wed Oct 31 07:34:53 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 13:34:53 +0100 (CET)
Subject: A sign/abbreviation for "magister"
Message-ID: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>

Thank you for your feedback.
?
On 30/10/2018 at 22:52, Khaled Hosny wrote:
?
> > First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671.
> > But it is a vowel sign. Many letters put above are called superscript?
> > when explaining in English.
>?
> As you say, this is a vowel sign not a superscript letter, so the name
> is a misnomer at best. It should have been called COMBINING ARABIC
> LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic
> it is called small or dagger alef.
?
Thank you for this information. Indeed the current French translation?
named it:
0670 DIACRITIQUE VOYELLE ARABE ALIF EN CHEF
* l'appellation anglaise de ce caract?re est erron?e
http://hapax.qc.ca/ListeNoms-10.0.0.txt
Translation:
0670 COMBINING ARABIC VOWEL ALEF ABOVE
* the English designation of this character is mistaken
?
Sorry for mistyping its code point, and for forgetting these facts.
What?s surprising, then, may be the facility it was named using SUPERSCRIPT,?
while superscripts seemed to be disliked in the Standard.
?
I note, now, that it should be called COMBINING ARABIC LETTER ALEF ABOVE,
as you indicate. (Translating to French as DIACRITIQUE LETTRE ARABE ALIF EN CHEF).
?
>?
> > There is the range U+FC5E..U+FC63 (presentation forms).
>?
> That is a backward compatiplity block no one is supposed to use, there
> are many such backward comatipility presentation forms even of Latin
> script (U+FB00..U+FB4F).
>?
> So I don?t see what makes you think, based on this, that Unicode is
> favouring Arabic or other scripts over Latin.
?
Indeed it doesn?t. Sorry about my assumption, but I mainly cited Arabic?
first because its name starts with an A, and I remembered it uses a?
?SUPERSCRIPT? in running text.
?
Other scripts have:
10FC MODIFIER LETTER GEORGIAN NAR
# <super> 10DC
2D6F TIFINAGH MODIFIER LETTER LABIALIZATION MARK
# <super> 2D61
A69C MODIFIER LETTER CYRILLIC HARD SIGN
#  044A
A69D MODIFIER LETTER CYRILLIC SOFT SIGN
#  044C
[but the latter two are for dialectology]
These are in the Duployan block:
1BCA2 SHORTHAND FORMAT DOWN STEP
1BCA3 SHORTHAND FORMAT UP STEP
because vertical alignment is significant in stenography.
So it is in Latin script when superscript us used as an?
abbreviation indicator.
However I see that the subjoiners and subjoined letters?
are obeying to another scheme than what led to super- or?
subscript.
?
On 31/07/2018 at 08:27, Martin J. D?rst wrote:
>
> > Making a safe distinction is beyond my knowledge, safest is not to discriminate.
>
> Yes. The easiest way to not discriminate is to not use titles in mailing?
> list discussions. That's what everybody else does, and what I highly?
> recommend.
?
OK. That is sound practice, which I observed a long time, until I felt best using Dr.?
Thanks for clearing it up.
?
On 30/10/2018 at 21:34, Julian Bradfield via Unicode wrote:
?
> The practice of using superscripts to end abbreviations is alive and
> well in manuscript - I do it myself in writting notes for myself. For
> example, "condition" I will often write as "cond<sup>n</sup>", and
> "equation" as "eq<sup>n</sup>".
?
That tends to prove that legibility is suboptimal without superscripts,?
even in note/draft style, and consequently, in machine processed plain text?
?only more so? (quoting an expression from Ken Whistler?s reply to?
James Kass on 30/10/2018 05:54).
?
> > in the 17?? or 18?? century to keep it only for ordinals. Should Unicode?
>?
> What do you mean, for ordinals? If you mean 1st, 2nd etc., then there
> is not now (when superscripting looks very old-fashioned) and never
> has been any requirement to superscript them, as far as I know -
> though since the OED doesn't have an entry for "1st", I can't easily
> check.
?
Then French, Italian, Portuguese and Spanish seem to be the only locales having?
superscript ordinal indicator requirements, or preferences if you prefer.?
?
The following forum has a comprehensive explanation for English, and for Romance?
languages except French:
https://english.stackexchange.com/questions/111265/should-ordinal-indicators-be-inline
Especially it explains where the American English lining ordinal indicators came from.
?
English Wikipedia?s Ordinal indicator article?
https://en.wikipedia.org/wiki/Ordinal_indicator
states that ordinal indicators and superscript letters don?t share the same glyph,?
which would explain why there was an intent to project a proposal for encoding French?
ordinal indicators. (But I advised that that would be a waste of time, as Unicode?s?
preformatted superscripts are working out of the box.)?
?
Preformatted Unicode superscript small letters are meeting the French superscript?
requirement, that is found in:
http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
(in French). This brief article focuses on the spelling of the indicators,?
without questioning the fact that they are superscript.
?
On 31/08/2018 at 06:54, Janusz S. Bie? via Unicode wrote:
[?]
> BTW, I find it strange that nobody refers to an old thread
>?
> https://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0117.html
?
I thought at linking to some of my previous e-mails and would probably have picked?
this one. Thanks for remembering, and for reminding.
?
Best regards,
?
?
Marcel
?


From unicode at unicode.org  Wed Oct 31 09:57:20 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 15:57:20 +0100 (CET)
Subject: A sign/abbreviation for "magister" (was: Re: second attempt)
In-Reply-To: <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
Message-ID: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>

On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
> > You could use the various hacks
> > you've discussed, with modifier letters; but that is not "encoding",
> > that is "abusing Unicode to do markup". At least, that's the view I
> > take!
>
> +1

There seems to be a widespread confusion about what is plain text, and what 
Unicode is for. From an US-QWERTY point of view, a current mental representation 
of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with acute.
Unicode is granting to every language its plain text representation. If superscript
acts as abbreviation indicator in a given language, this is part of the plain text 
representation of that language. 

So far, so good. The core problem is now to determine whether superscript is 
mandatory, and baseline is fallback, or superscript is optional and decorative, 
and baseline is correct. That may be a matter of opinion, as has been suggested. 
However we know now a list of languages where superscript is mandatory, and 
baseline is fallback. Leaving English alone, these languages on themselves need 
the use of preformatted superscript letters being granted to them by the UTC.

Still in the beginning, when early Unicode set up the Standard, superscript
was ruled out of plain text, except when there was sort of a strong lobbying, 
like when Vietnamese precomposed letters were added. Phoneticists have a strong 
lobby, so they got some ranges of preformatted letters. To make sure nobody 
dare use them in running text elsewhere, all *new* superscript letters got names 
on a MODIFIER LETTER basis, while subscript letters got straightforward names 
having SUBSCRIPT in them. Additionally, strong caveats were published in TUS.

And the trick worked, as most of the time, one is now referring to the superscript 
letters using the ?modifier letter? label that Unicode have decked them out with.

That is why, today, any discussion is at risk of being subject to strong biases 
when its result should allow some languages to use their traditional abbreviation 
indicators, in an already encoded and implemented form. Fortunately the front has 
begun to move, as CLDR TC have granted ordinal indicators to the French locale 
per v34. 

Ordinal indicators are one category of abbreviation indicators. Consistently, the
already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in titles
like "S?", "N? S?", as found in the navigation pane of:
http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea

I?m not quite sure whether some people would still argue that that string isn?t 
understood differently from "Na Sa".

> In general, I have a certain sympathy for the position that there is no universal
> answer for the dividing line between plain and styled text; there are some texts
> where the conventional division of plain test and styling means that the plain
> text alone will become somewhat ambiguous.

That is why phonetics need preformatted super- and subscripts, and so do languages
relying on superscript as an abbreviation indicator.

> We know that for mathematics, a different dividing line meant that it is possible
> to create an (almost) plain text version of many (if not most) mathematical
> texts; the conventions of that field are widely shared -- supporting a case for
> allowing a standard encoding to support it.

Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, 
https://www.unicode.org/notes/tn28/
is always a good point in this discussion. UnicodeMath uses the full range of 
superscript digits, because the range is full. It does not use superscript letters, 
because their range is not full. Hence if superscript digits had stopped at the 
legacy range "???", only measurement units like the metric equivalents of sq ft and 
cb ft could be written with superscripts, and that is already allowed according to
TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, 
since phonetics need a full range of superscript and subscript digits, these were 
added to Unicode, and therefore are used in UnicodeMath.

Likewise, phonetics need a nearly-full range of superscript letters, so these were 
added to Unicode, and therefore are used in the digital representation of natural 
languages.

> However, it stops short of 100% support for edge cases, as does the ordinary
> plain text when used for "normal" texts. I think, on balance, that is OK.

That is not clear as long as ?ordinary plain text? is not defined for the purpose 
of this discussion. Since I have superscript small letters on live keys, and the 
superscript "?" even doubled on the same level as the digits (that it is used to 
transform into ordinals for most of them), my French keyboard layout driver allows 
the OS to output ordinary plain text consisting of various signs including 
superscript small Latin letters. 

Now is Unicode making a difference between ?plain text? and ?ordinary plain text??
There are various ways to ?clean up? the UCS, first removing presentation forms, 
then historic letters, then mathematical symbols, then why not emoji, and somewhere 
in-between, phonetic letters, among which superscripts. The result would then be 
?ordinary plain text? ? but to what purpose? Possibly so that all documents must be 
written up using TeX. Following that logic to its end would mean that composed 
letters should be removed, too, given they are accurately represented using escape 
sequences like "e\'" for "?".

> If there were another important notational convention, widely shared, 
> reasonably consistent and so on, then I see no principled objection to considering
> whether it should be supported (minus some edge cases) in its own form of
> plain text (with appropriate additional elements encoded).

I?m pleased to read that. Given the use of superscript in French is important, 
widely shared, and reasonably consistent, we need to know what it should be else. 
Certainly: supported by the local keyboard layout. Hopefully it will be, soon.

> The current case, transcribing a post-card to make the text searchable, for
> example, would fit the use case for ordinary plain text, with the warning against
> simulated effects of markup.

Triggering such a warning would need to first sort out whether a given representation 
is best encoded using plain text or using markup. If it?s plain text, then that is 
not simulating anything. The reverse is true: Markup simulates accurate plain text.
Searchability is ensured by equivalence classes. Google Search has most comprehensive 
equivalence classes, indexing even all mathematical preformatted Latin letters like 
plain ASCII.

> All other uses are better served by markup, whether
> SGML / XML style to capture identified features, or final-form rich text like PDF
> just preserving the appearance.

Agreed.

Best regards,

Marcel


From unicode at unicode.org  Wed Oct 31 10:45:21 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Wed, 31 Oct 2018 15:45:21 +0000 (GMT)
Subject: A sign/abbreviation for "magister" (was: Re: second attempt)
In-Reply-To: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
Message-ID: <9272010.33324.1541000721948.JavaMail.defaultUser@defaultHost>

There was a proposal, in the Bytext Report by Bernard Miller many years ago to introduce arrow parentheses characters, eight of them.

They were stateful, one character to mean that effectively everything following is superscript until told otherwise, and one for everything following is no longer superscript until told otherwise.

There were also pairs for subscript, for the upper limit of an integral and for the the lower limit of an integral and those two latter pairs could also be used with the capital sigma sign used to express the summation of a mathematical series.

Now, I appreciate that the statefulness of those suggested characters may still rule them out for implementation in plain text yet maybe an arrow parenthesis or something like it could be encoded that is like a combining accent character but has the effect of making the one character that it follows be a superscript character, and another similar character for subscripts. That would mean that any Unicode character could be used as a superscript or a subscript in plain text. Maybe another two, or maybe another four, such characters could be added so as to allow the limits of integrals and summations to be expressed in plain text using such a method.

These new characters could have a visible glyph as a fallback display yet not be displayed at all if, as a result of glyph substitution for the two character sequence, a superscript or subscript version of the first character of the two character sequence were displayed. 

William Overington

Wednesday 31 October 2018


----Original message----
>From : unicode at unicode.org
Date : 2018/10/31 - 14:57 (GMTST)
To : unicode at unicode.org
Subject : Re: A sign/abbreviation for &quot;magister&quot; (was: Re: second attempt)

On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
> > You could use the various hacks
> > you've discussed, with modifier letters; but that is not "encoding",
> > that is "abusing Unicode to do markup". At least, that's the view I
> > take!
>
> +1

There seems to be a widespread confusion about what is plain text, and what 
Unicode is for. From an US-QWERTY point of view, a current mental representation 
of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with acute.
Unicode is granting to every language its plain text representation. If superscript
acts as abbreviation indicator in a given language, this is part of the plain text 
representation of that language. 

So far, so good. The core problem is now to determine whether superscript is 
mandatory, and baseline is fallback, or superscript is optional and decorative, 
and baseline is correct. That may be a matter of opinion, as has been suggested. 
However we know now a list of languages where superscript is mandatory, and 
baseline is fallback. Leaving English alone, these languages on themselves need 
the use of preformatted superscript letters being granted to them by the UTC.

Still in the beginning, when early Unicode set up the Standard, superscript
was ruled out of plain text, except when there was sort of a strong lobbying, 
like when Vietnamese precomposed letters were added. Phoneticists have a strong 
lobby, so they got some ranges of preformatted letters. To make sure nobody 
dare use them in running text elsewhere, all *new* superscript letters got names 
on a MODIFIER LETTER basis, while subscript letters got straightforward names 
having SUBSCRIPT in them. Additionally, strong caveats were published in TUS.

And the trick worked, as most of the time, one is now referring to the superscript 
letters using the ?modifier letter? label that Unicode have decked them out with.

That is why, today, any discussion is at risk of being subject to strong biases 
when its result should allow some languages to use their traditional abbreviation 
indicators, in an already encoded and implemented form. Fortunately the front has 
begun to move, as CLDR TC have granted ordinal indicators to the French locale 
per v34. 

Ordinal indicators are one category of abbreviation indicators. Consistently, the
already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in titles
like "S?", "N? S?", as found in the navigation pane of:
http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea

I?m not quite sure whether some people would still argue that that string isn?t 
understood differently from "Na Sa".

> In general, I have a certain sympathy for the position that there is no universal
> answer for the dividing line between plain and styled text; there are some texts
> where the conventional division of plain test and styling means that the plain
> text alone will become somewhat ambiguous.

That is why phonetics need preformatted super- and subscripts, and so do languages
relying on superscript as an abbreviation indicator.

> We know that for mathematics, a different dividing line meant that it is possible
> to create an (almost) plain text version of many (if not most) mathematical
> texts; the conventions of that field are widely shared -- supporting a case for
> allowing a standard encoding to support it.

Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, 
https://www.unicode.org/notes/tn28/
is always a good point in this discussion. UnicodeMath uses the full range of 
superscript digits, because the range is full. It does not use superscript letters, 
because their range is not full. Hence if superscript digits had stopped at the 
legacy range "???", only measurement units like the metric equivalents of sq ft and 
cb ft could be written with superscripts, and that is already allowed according to
TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, 
since phonetics need a full range of superscript and subscript digits, these were 
added to Unicode, and therefore are used in UnicodeMath.

Likewise, phonetics need a nearly-full range of superscript letters, so these were 
added to Unicode, and therefore are used in the digital representation of natural 
languages.

> However, it stops short of 100% support for edge cases, as does the ordinary
> plain text when used for "normal" texts. I think, on balance, that is OK.

That is not clear as long as ?ordinary plain text? is not defined for the purpose 
of this discussion. Since I have superscript small letters on live keys, and the 
superscript "?" even doubled on the same level as the digits (that it is used to 
transform into ordinals for most of them), my French keyboard layout driver allows 
the OS to output ordinary plain text consisting of various signs including 
superscript small Latin letters. 

Now is Unicode making a difference between ?plain text? and ?ordinary plain text??
There are various ways to ?clean up? the UCS, first removing presentation forms, 
then historic letters, then mathematical symbols, then why not emoji, and somewhere 
in-between, phonetic letters, among which superscripts. The result would then be 
?ordinary plain text? ? but to what purpose? Possibly so that all documents must be 
written up using TeX. Following that logic to its end would mean that composed 
letters should be removed, too, given they are accurately represented using escape 
sequences like "e\'" for "?".

> If there were another important notational convention, widely shared, 
> reasonably consistent and so on, then I see no principled objection to considering
> whether it should be supported (minus some edge cases) in its own form of
> plain text (with appropriate additional elements encoded).

I?m pleased to read that. Given the use of superscript in French is important, 
widely shared, and reasonably consistent, we need to know what it should be else. 
Certainly: supported by the local keyboard layout. Hopefully it will be, soon.

> The current case, transcribing a post-card to make the text searchable, for
> example, would fit the use case for ordinary plain text, with the warning against
> simulated effects of markup.

Triggering such a warning would need to first sort out whether a given representation 
is best encoded using plain text or using markup. If it?s plain text, then that is 
not simulating anything. The reverse is true: Markup simulates accurate plain text.
Searchability is ensured by equivalence classes. Google Search has most comprehensive 
equivalence classes, indexing even all mathematical preformatted Latin letters like 
plain ASCII.

> All other uses are better served by markup, whether
> SGML / XML style to capture identified features, or final-form rich text like PDF
> just preserving the appearance.

Agreed.

Best regards,

Marcel


From unicode at unicode.org  Wed Oct 31 11:03:18 2018
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Wed, 31 Oct 2018 18:03:18 +0200
Subject: A sign/abbreviation for "magister" (was: Re: second attempt)
In-Reply-To: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
Message-ID: <20181031160318.GD16380@macbook.localdomain>

On Wed, Oct 31, 2018 at 03:57:20PM +0100, Marcel Schneider via Unicode wrote:
> > We know that for mathematics, a different dividing line meant that it is possible
> > to create an (almost) plain text version of many (if not most) mathematical
> > texts; the conventions of that field are widely shared -- supporting a case for
> > allowing a standard encoding to support it.
> 
> Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, 
> https://www.unicode.org/notes/tn28/
> is always a good point in this discussion. UnicodeMath uses the full range of 
> superscript digits, because the range is full. It does not use superscript letters, 
> because their range is not full. Hence if superscript digits had stopped at the 
> legacy range "???", only measurement units like the metric equivalents of sq ft and 
> cb ft could be written with superscripts, and that is already allowed according to
> TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, 
> since phonetics need a full range of superscript and subscript digits, these were 
> added to Unicode, and therefore are used in UnicodeMath.

A while I was localizing some application to Arabic and the developer
?helpfully? used m? for square meter, but that does not work for Arabic
because there is no superscript ? in Unicode, so I had to contact the
developer and ask for markup to be used for the superscript so that O
can use it as well. That nicely shows one of the problems with encoding
superscript symbols for arbitrary text styling in Unicode, you can?t
stop before duplicating the whole character repertoire or else you will be
discriminating against some writing system or uncommon usage.

Regards,
Khaled

From unicode at unicode.org  Wed Oct 31 11:20:47 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Wed, 31 Oct 2018 16:20:47 +0000 (GMT)
Subject: A sign/abbreviation for "magister"
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>
Message-ID: <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>

On 2018-10-31, Marcel Schneider via Unicode <unicode at unicode.org> wrote:

> Preformatted Unicode superscript small letters are meeting the French superscript?
> requirement, that is found in:
> http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
> (in French). This brief article focuses on the spelling of the indicators,?
> without questioning the fact that they are superscript.

When one does question the Acad?mie about the fact, this is their
reply:

 Le fait de placer en exposant ces mentions est de convention
 typographique ; il convient donc de le faire. Les seules exceptions
 sont pour Mme et Mlle.

which, if my understanding of "convient" is correct, carefully does
quite say that it is *wrong* not to superscript, but that one should
superscript when one can because that is the convention in typography.

My original question was:

 Dans les imprim?s ou dans le manuscrit on ?crit "1<sup>er</sup>, 45<sup>e</sup>"
 etc. (J'utilise l'indication HTML pour les lettres sup?rieures.)

 La question est: est-ce que les lettres sup?rieures sont
 *obligatoires*, ou sont-ils simplement une question de style? C'est ?
 dire, si on ?crit "1er, 45e" etc., est-ce une erreur, ou un style
 simple mais correct? 

I did not think that their Dictionary desk would understand the
concept of plain text, so I didn't ask explicitly for their opinions
on encoding :)

Which takes us back to when typography is plain text...

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Wed Oct 31 12:18:46 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 18:18:46 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181031160318.GD16380@macbook.localdomain>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
Message-ID: <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36>

On 31/10/2018 at 17:03, Khaled Hosny wrote:
>
> A while I was localizing some application to Arabic and the developer
> ?helpfully? used m? for square meter, but that does not work for Arabic
> because there is no superscript ? in Unicode, so I had to contact the
> developer and ask for markup to be used for the superscript so that O
> can use it as well. That nicely shows one of the problems with encoding
> superscript symbols for arbitrary text styling in Unicode, you can?t
> stop before duplicating the whole character repertoire or else you will be
> discriminating against some writing system or uncommon usage.

It seems to me that Arabic is lacking two characters when using Eastern 
Arabic digits, not Western Arabic. Unicode allowing the m? and m? unit 
notations, these should be implemented in any script using the same 
notation. Not the whole UCS, just these two, like Arabic per cent. Or do 
you have use cases in Arabic where superscript is used as an 
abbreviation indicator?

I don?t share the view according to which superscript is arbitrary in Latin.
There is a medieval tradition of superscripting. If it is in Arabic, then it 
would be limited to these two missing digits. Many many symbols were 
encoded for Arabic, notably mirrored arrows, so adding these two is quite
straightforward.

Sad that Arabic ? and ? are still missing.

Best regards,

Marcel


From unicode at unicode.org  Wed Oct 31 12:32:54 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Wed, 31 Oct 2018 18:32:54 +0100
Subject: second attempt
In-Reply-To: <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com> (Julian Bradfield
 via Unicode's message of "Wed, 31 Oct 2018 09:38:25 +0000 (GMT)")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
Message-ID: <86in1im37d.fsf@mimuw.edu.pl>

On Wed, Oct 31 2018 at  9:38 GMT, Julian Bradfield via Unicode wrote:
> On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode <unicode at unicode.org> wrote:
>> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote:
>
> [ as did I in private mail ]
>
>>> The abbreviation in the postcard, rendered in
>>> plain text, is "Mr".
>>
>> The relevant fragment of the postcard in a loose translation is
>>
>>     Use the following address: <Abbreviation1> <Abbreviation2> <name>...
>>     <Abbreviation1> is the abbreviation of magister.
>>
>> I don't think your rendering
>>
>>    Mr is the abbreviation of magister.
>>
>> has the same meaning.
>
> I do, for the reasons stated by many.

How many?

I'm aware only of you and Doug Ewell.

>
> If the topic were a study of the ways in which people indicate
> abbreviations by typographic or manuscript styling, then it would be
> important to know the exact form of the marks; but that is not plain
> text.

Let me remind what plain text is according to the Unicode glossary:

    Computer-encoded text that consists only of a sequence of code
    points from a given standard, with no other formatting or structural
    information.

If you try to use this definition to decide what is and what is not a
character, you get vicious circle.

As mentioned already by others, there is no other generally accepted
definition of plain text.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Wed Oct 31 12:37:56 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Wed, 31 Oct 2018 18:37:56 +0100
Subject: use vs mention (was: second attempt)
In-Reply-To: <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com> (Julian Bradfield
 via Unicode's message of "Wed, 31 Oct 2018 09:38:25 +0000 (GMT)")
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
Message-ID: <86efc6m2yz.fsf_-_@mimuw.edu.pl>

On Wed, Oct 31 2018 at  9:38 GMT, Julian Bradfield via Unicode wrote:
> On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode <unicode at unicode.org> wrote:

[...]

>> The relevant fragment of the postcard in a loose translation is
>>
>>     Use the following address: <Abbreviation1> <Abbreviation2> <name>...
>>     <Abbreviation1> is the abbreviation of magister.
>>
>> I don't think your rendering
>>
>>    Mr is the abbreviation of magister.
>>
>> has the same meaning.
>
> I do

The author of the postcard definitely *referred* to the abbreviation in
the form *used* in the postcard.

We don't know whether the abbreviation "Mr", spelled exactly this way,
already existed in that time and in that geographical area.

You still don't see the difference in the meaning?

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Wed Oct 31 13:10:16 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 19:10:16 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>
 <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>
Message-ID: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>

On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:
> 
> On 2018-10-31, Marcel Schneider via Unicode  wrote:
> 
> > Preformatted Unicode superscript small letters are meeting the French superscript 
> > requirement, that is found in:
> > http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux
> > (in French). This brief article focuses on the spelling of the indicators, 
> > without questioning the fact that they are superscript.
> 
> When one does question the Acad?mie about the fact, this is their
> reply:
> 
> Le fait de placer en exposant ces mentions est de convention
> typographique ; il convient donc de le faire. Les seules exceptions
> sont pour Mme et Mlle.
Translation: 
?Superscripting these mentions is typographical convention; 
consequently it is convenient to do so. The only exceptions are 
for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].?
> 
> which, if my understanding of "convient" is correct, carefully does
> quite say that it is *wrong* not to superscript, but that one should
> superscript when one can because that is the convention in typography.

Draft style may differ from mail style, and this, from typography, only 
due to the limitations imposed by input interfaces. These limitations are 
artificial and mainly the consequence of insufficient development of said 
interfaces. If the computer is anything good for, then that should also 
include the transition from typewriter fallbacks to the true digital 
representation of all natural languages. Latin not excluded.

> 
> My original question was:
> 
> Dans les imprim?s ou dans le manuscrit on ?crit "1er, 45e"
> etc. (J'utilise l'indication HTML pour les lettres sup?rieures.)
> 
> La question est: est-ce que les lettres sup?rieures sont
> *obligatoires*, ou sont-ils simplement une question de style? C'est ?
> dire, si on ?crit "1er, 45e" etc., est-ce une erreur, ou un style
> simple mais correct? 
Translation: 
?In print or handwriting one spells "1<sup>er</sup>, 45<sup>e</sup>", 
and so on. (I?m using HTML tags for the superscript letters.)

The question is: Are the superscript letters *mandatory*, 
or are they simply a matter of style? I.e. when writing "1er, 45e", 
is that a mistake, or a simple but correct style??
> 
> I did not think that their Dictionary desk would understand the
> concept of plain text, so I didn't ask explicitly for their opinions
> on encoding :)

If you don?t think that they would understand character encoding 
and the concept of plain text as described in the Unicode Standard, 
you may wish to explain it to them in detail prior to asking for 
their opinion on the subject.

Thank you anyway for letting us know.

> 
> Which takes us back to when typography is plain text...

When the typographc rendering is congruent with the underlying 
plain text, that means that there is no formatting; but that is quite
impossible given the minimal default settings include a font and 
a font-size. If the plain text is an interoperable representation of 
a natural language, and that language uses superscript as an 
abbreviation indicator, that superscript must be visible when the 
text string is displayed as-is. Else the string referred to as ?plain 
text? is at risk of not being a legible representation of the intended
content. If despite that risk it is, then you are lucky.

Best regards,

Marcel


From unicode at unicode.org  Wed Oct 31 13:27:00 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 11:27:00 -0700
Subject: second attempt
In-Reply-To: <86in1im37d.fsf@mimuw.edu.pl>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <86in1im37d.fsf@mimuw.edu.pl>
Message-ID: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/69c4d801/attachment.html>

From unicode at unicode.org  Wed Oct 31 13:35:19 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 11:35:19 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>
 <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>
 <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
Message-ID: <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/43815eef/attachment.html>

From unicode at unicode.org  Wed Oct 31 14:14:36 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Wed, 31 Oct 2018 12:14:36 -0700
Subject: second attempt
In-Reply-To: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <86in1im37d.fsf@mimuw.edu.pl>
 <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>
Message-ID: <ee24227c-9088-1585-1c5b-a6e024fd5e7d@att.net>


On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote:
> but we don't have an agreement that reproducing all variations in 
> manuscripts is in scope.

In fact, I would say that in the UTC, at least, we have an agreement 
that that clearly is out of scope!

Trying to represent all aspects of text in manuscripts, including 
handwriting conventions, as plain text is hopeless. There is no 
principled line to draw there before you get into arbitrary calligraphic 
conventions.

And while this list is happily deep-ending on handwritten lines under 
superscript Latin letters in Polish abbreviations, keep in mind that 
*Han* characters alone constitute over 64% of the encoded characters in 
Unicode -- and the handwriting, style, and calligraphic conventions for 
Han make Latin look simple. Here:

Japanese Postcard NY Greeting

That is a New Year's greeting snipped from a 1906 Japanese postcard. Oh, 
snap! What are we going to do to represent the *leaves* (or are they 
feathers?) being used for handwritten strokes in that text???

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/0e8f7670/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: JapaneseNY.PNG
Type: image/png
Size: 56011 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/0e8f7670/attachment.png>

From unicode at unicode.org  Wed Oct 31 16:57:37 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 14:57:37 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
 <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36>
Message-ID: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/4b99815a/attachment.html>

From unicode at unicode.org  Wed Oct 31 17:30:27 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 31 Oct 2018 22:30:27 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com>
Message-ID: <0abcbf67-bd82-a761-f21a-eb8780223209@gmail.com>


In my last post I used the word "transcription".? It should have been 
"transliteration".? Sorry for the mistake.? Three times.

FWIW, here's a corrected re-post.

---

Responding to Julian Bradfield,

U+1D49 MODIFIER LETTER SMALL E
General Category:? Letter, Modifier
Decomposition Type <super>
Mapping:? U+0065

It's a spacing superscript Latin lower case "E".? It's a letter. People 
spell with letters.

"One of the goals of the Consortium is to preserve humanity's common 
linguistic heritage and provide universal access for the world's 
languages?past, present, and future."

Superscripts and subscripts are part of the Latin writing system. If the 
source says "y?" or "??", that's what I would enter into the database.? 
Otherwise it's just transliteration, IMHO.? If the goal is to preserve 
the past by transliterating it, we could've done that with ASCII.

Having "y?" or "??" in the database makes the database more 
human-readable than having mark-up such as "y<sup>e</sup>" and takes 
fewer bytes.

DUCET allows for desired collation results.? Searching for "y?" or "??" 
could get only those files which included the specific string and not 
all the files which include strings "ye", "?e", or "the".

The superscript lower case Latin "E" also has "grapheme base" listed as 
one of its binary properties, so it might be OK to add a line or two 
under one, if that's what's desired.

If the superscript lower case Latin letter "E", ("?"), cannot be used in 
this instance because it is supposed to *modify* the preceding 
character, then is its usage in this question a "hack"? It isn't 
modifying that ASCII quote at all.

Providing mark-up solutions isn't universal, but computer plain-text is.

For the OP's question, PUA for perfect display and no guarantee of 
interoperability, "Mr" for transliteration, or (what Michael said 
initially) "M?".? I think it would be OK to add something like a 
combining equals sign below to Michael's suggested string and make it 
"M??", but it wouldn't display well unless a font's OpenType tables 
provided for it.


From unicode at unicode.org  Wed Oct 31 17:32:09 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 15:32:09 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181031160318.GD16380@macbook.localdomain>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
Message-ID: <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/bb5f2ba1/attachment.html>

From unicode at unicode.org  Wed Oct 31 17:34:33 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 23:34:33 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
 <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36>
 <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com>
Message-ID: <1788983878.9257.1541025273955.JavaMail.www@wwinf2209>

On 31/10/18 at 23:05, Asmus Freytag via Unicode wrote:
[?]
> > Sad that Arabic ? and ? are still missing.
>
> How about all the other sets of native digits?

The missing ones are hopefully already on the roadmap.
Or do you refer to the missing ? and ? in all other native digits?
Obviously they need to be encoded if there is a demand like 
for Arabic.

Thanks for the call.

Best regards,

Marcel


From unicode at unicode.org  Wed Oct 31 17:37:13 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 31 Oct 2018 23:37:13 +0100 (CET)
Subject: A sign/abbreviation for "magister"
Message-ID: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209>

On 31/10/2018 19:42, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 11:10 AM, Marcel Schneider via Unicode wrote:
> > 
> > > which, if my understanding of "convient" is correct, carefully does
> > > [not] quite say that it is *wrong* not to superscript, but that one should
> > > superscript when one can because that is the convention in typography.
> >
> > Draft style may differ from mail style, and this, from typography, only 
> > due to the limitations imposed by input interfaces. These limitations are 
> > artificial and mainly the consequence of insufficient development of said 
> > interfaces. If the computer is anything good for, then that should also 
> > include the transition from typewriter fallbacks to the true digital 
> > representation of all natural languages. Latin not excluded.
> 
> It is a fallacy that all text output on a computer should match the convention 
> of "fine typography".
> 
> Much that is written on computers represents an (unedited) first draft. Giving 
> such texts the appearance of texts, which in the day of hot metal typography, 
> was reserved for texts that were fully edited and in many cases intended for 
> posterity is doing a disservice to the reader.
> 

The disconnect is in many people believing the user should be disabled to write 
his or her language without disfiguring it by lack of decent keyboarding, and 
that such input should be considered standard for user input. Making such text 
usable for publishing needs extra work, that today many users cannot afford, 
while the mass of publishing has increased exponentially over the past decades. 
The result is garbage, following the rule of ?garbage in, garbage out.? The real 
disservice to the reader is not to enable the inputting user to write his or her 
language correctly. A draft whose backbone is a string usable as-is for publishing
is not a disservice, but a service to the reader, paying the reader due respect. 
Such a draft is also a service to the user, enabling him or her to streamline the 
workflow. Such streamlining brings monetary and reputational benefit to the user.

That disconnect seems to originate from the time where the computer became a tool 
empowering the user to write in all of the world?s languages thanks to Unicode. 
The concept of ?fine typography? was then used to draw a borderline between what 
the user is supposed to input, and what he or she needs to get for publication. 
In the same move, that concept was extended in a way that it should include the 
quality of the string, additionally to what _fine typography_ really is: fine 
tuning of the page layout, such as vertical justification, slight variations in 
the width of non-breakable spaces, and of course, discretionary ligatures.

Producing a plain text string usable for publishing was then put out of reach 
of most common mortals, by using the lever of deficient keyboarding, but also 
supposedly by an ?encoding error? (scare quotes) in the line break property of 
U+2008 PUNCTUATION SPACE, that should be non-breakable like its siblings 
U+2007 FIGURE SPACE (still?as per UAX #14?recommended for use in numbers) and 
U+2012 FIGURE DASH to gain the narrow non-breaking space needed to space the 
triads in numbers using space as a group separator, and to space big punctuation 
in a Latin script using locale, where JTC1/SC2/WG2 had some meetings for the UCS:
French.

For everybody having beneath his or her hands a keyboard whose layout driver is 
programmed in a fully usable way, the disconnect implodes. At encoding and input 
levels (the only ones that are really on-topic in this thread) the sorcery called 
fine typography sums then up to nothing else than having the keyboard inserting 
fully diacriticized letters, right punctuation, accurate space characters, and 
superscript letters as ordinal indicators and abbreviation endings, depending 
on the requirements.

Now was I talking about ?all text output on a computer?? No, I wasn?t. 

The computer is able to accept input of publishing-ready strings, since we have 
Unicode. Precluding the user from using the needed characters by setting up 
caveats and prohibitions in the Unicode Standard seems to me nothing else than 
an outdated operating mode. U+202F NARROW NO-BREAK SPACE, encoded in 1999 for 
Mongolian [1][2], has been readily ripped off by the French graphic industry. 
In 2014, TUS started mentioning its use in French [3]; in 2018, it put it on 
top [4]. 
That seems to me a striking example of how things encoded for other purposes 
are reused (or following a certain usage, ?abused?, ?hacked?, ?hijacked?) in 
locales like French. If it wasn?t an insult to minority languages, that 
language could be called, too, ?digitally disfavored? in a certain sense.

> On the other hand, I'm a firm believer in applying certain styling attributes 
> to things like e-mail or discussion papers. Well-placed emphasis can make such 
> texts more readable (without requiring that they pay attention to all other 
> facets of "fine typography".)

The parenthesized sidenote (that is probably the intended main content?) makes 
this paragraph wrong. I?d buy it if either the parenthesis is removed or if it 
comes after the following.

With due respect, I need to add that the disconnect in that is visible only to 
French readers. Without NNBSP, punctuation ? la fran?aise in e-mails is messed 
up because even NBSP is ignored (I don?t know what exactly happens at backend; 
anyway at frontend it?s like a normal space in at least one e-mail client and 
in several if not all browsers, and if pasted in plain text from MS Word, it?s 
truly replaced with SP. All that makes e-mails harder to read. Correct spacing 
with punctuation in French is often considered ?fine-tuning?, but only if that 
punctuation spacing is not supported by the keyboard driver, and that?s still 
almost always the case, except on the updated version 1.1 of the b?po layout 
(and some personal prototypes not yet released).

Not using angle quotation marks doesn?t fix it, given four other punctuation 
marks still need spacing (and are almost forcibly spaced with SP by lack of 
anything better), and given not using angle quotation marks makes any French 
text harder to read when there is no means to distinguish citation quotes 
????? and scare quotes ??? following a scheme that may not be well known yet. 
See already [5] (with the reader comments) for an overview of the problem.

Thank you for your attention.

Best regards,

Marcel

[1] TUS version 3, chapter 6, page 150, table:
https://www.unicode.org/versions/Unicode3.0.0/ch06.pdf#%5B%7B%22num%22%3A4%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2Cnull%2C
214%2Cnull%5D

[2] TUS version 10 (the last one having detailed bookmarks), ch. 13, p. 534:
https://www.unicode.org/versions/Unicode10.0.0/ch13.pdf#I1.27802

[3] TUS version 7, chapter 6, page 265:
https://www.unicode.org/versions/Unicode7.0.0/ch06.pdf#G17097

[4] TUS version 11, chapter 6, page 265 (no direct link):
https://www.unicode.org/versions/Unicode11.0.0/ch06.pdf#G1834

[5] ??Les antiguillemets comme symboles de la postv?rit???, /Le Devoir/, 2016-12-30 (in French):
https://www.ledevoir.com/societe/actualites-en-societe/488139/mises-aux-points-les-antiguillemets-comme-symboles-de-la-postverite


From unicode at unicode.org  Wed Oct 31 17:58:12 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 31 Oct 2018 22:58:12 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <ee24227c-9088-1585-1c5b-a6e024fd5e7d@att.net>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <86in1im37d.fsf@mimuw.edu.pl>
 <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com>
 <ee24227c-9088-1585-1c5b-a6e024fd5e7d@att.net>
Message-ID: <51ead4ad-27c1-9e12-e5d8-f2c84da0b1c8@gmail.com>


Ken Whistler wrote,

 > Trying to represent all aspects of text in manuscripts,
 > including handwriting conventions, as plain text is
 > hopeless. There is no principled line to draw there
 > before you get into arbitrary calligraphic conventions.

Very much agree.? The post card in question is in cursive, for one 
thing, and the "t" in the spelled out word "Magister" isn't crossed.

It's all about where we draw the line.? I'd draw it on the "t" in this 
case, and enter the word into the data accordingly.


From unicode at unicode.org  Wed Oct 31 17:35:06 2018
From: unicode at unicode.org (Piotr Karocki via Unicode)
Date: Wed, 31 Oct 2018 23:35:06 +0100
Subject: use vs mention (was: second attempt)
Message-ID: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com>

>We don't know whether the abbreviation "Mr", spelled exactly this way,
>already existed in that time and in that geographical area.
>
>You still don't see the difference in the meaning?

 Maybe another example, from chemistry:

<sup>14</sup>C = isotope of carbon (carbon 14)
14C = 14 units of carbon (mole, atoms, molecule)
C<sub>14</sub> = 14 atoms of carbon
C<sup>I</sup> = carbon on first oxidation
CI = molecule of carbon and iodine
C<sup>V</sup> = carbon on fifth oxidation
CV = molecule of carbon and vanadium
CV<sup>V</sup> = molecule of carbon and vanadium, with vanadium on fifth
oxidation
C<sup>V</sup>V molecule of carbon and vanadium, with vanadium on fifth
oxidation, with carbon on fifth oxidation

Ca<sup>2+</sup> = plus sign means cation (of calcium with electrical charge
2)
Ca<sub>2</sub>+ = plus sign means adding something to molecule of two atoms
of calcium

etc.
So, what means 'plaintext' 14C? Which of two possible meanings?
So, what means 'plaintext' CVV? What means "Ca2+"?

Letter, digit, etc., placed as <sub> has different meanings than <sup>, and
different than no-sup and no-sub.

These are only examples of changes in meaning with <sup> or <sub>, not all
of these examples can really exist - but, then, another question: can we
know what author means? And as carbon and iodine cannot exist, then of
course CI should be interpreted as carbon on first oxidation? But maybe
author is student, taking exam, and he/she thinks about molecule of carbon
and iodine?


---8<---
Piotr Karocki

From unicode at unicode.org  Wed Oct 31 18:11:39 2018
From: unicode at unicode.org (Khaled Hosny via Unicode)
Date: Thu, 1 Nov 2018 01:11:39 +0200
Subject: A sign/abbreviation for "magister"
In-Reply-To: <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl>
 <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
 <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com>
Message-ID: <20181031231055.GJ16380@macbook.localdomain>

On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote:
> On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote:
> 
>     A while I was localizing some application to Arabic and the developer
>     ?helpfully? used m? for square meter, but that does not work for Arabic
>     because there is no superscript ? in Unicode, so I had to contact the
>     developer and ask for markup to be used for the superscript so that O
>     can use it as well.
> 
> This just pushes the issue down one level.
> 
> Because it assumes that the presence/absence of markup is locale-independent.
> 
> For translation of general text I know this is not true. There are instances
> where some words in certain languages are customarily italicized in a way that
> is not lexical, therefore not something where the source language would ever
> supply markup.

That was a while ago, but IIRC, the markup was enabled for that
particular widget unconditionally. The localizer is now free to use the
markup or not use it, the string was translatable as whole with the
embedded markup. It should be possible to enable markup for any widget,
it is just an option to tick off in the UI designer, but may experience
is that markup is seldom needed in computer UIs, but I may be biased
with the kind of UIs and locales I?m most familiar with.

Regards,
Khaled

From unicode at unicode.org  Wed Oct 31 18:33:20 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Wed, 31 Oct 2018 16:33:20 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <20181031231055.GJ16380@macbook.localdomain>
References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com>
 <86k1lypt3q.fsf@mimuw.edu.pl> <slrnptiu0g.2c4.jcb@home.stevens-bradfield.com>
 <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com>
 <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36>
 <20181031160318.GD16380@macbook.localdomain>
 <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com>
 <20181031231055.GJ16380@macbook.localdomain>
Message-ID: <3221dbb9-86fb-75c4-abb7-d8bb292ec553@ix.netcom.com>

On 10/31/2018 4:11 PM, Khaled Hosny wrote:
> On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote:
>> On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote:
>>
>>      A while I was localizing some application to Arabic and the developer
>>      ?helpfully? used m? for square meter, but that does not work for Arabic
>>      because there is no superscript ? in Unicode, so I had to contact the
>>      developer and ask for markup to be used for the superscript so that O
>>      can use it as well.
>>
>> This just pushes the issue down one level.
>>
>> Because it assumes that the presence/absence of markup is locale-independent.
>>
>> For translation of general text I know this is not true. There are instances
>> where some words in certain languages are customarily italicized in a way that
>> is not lexical, therefore not something where the source language would ever
>> supply markup.
> That was a while ago, but IIRC, the markup was enabled for that
> particular widget unconditionally. The localizer is now free to use the
> markup or not use it, the string was translatable as whole with the
> embedded markup. It should be possible to enable markup for any widget,
> it is just an option to tick off in the UI designer, but may experience
> is that markup is seldom needed in computer UIs, but I may be biased
> with the kind of UIs and locales I?m most familiar with.

All makes sense now.

A./

>
> Regards,
> Khaled
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/3f2c7de6/attachment.html>

From unicode at unicode.org  Wed Oct 31 18:35:14 2018
From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode)
Date: Wed, 31 Oct 2018 23:35:14 +0000
Subject: A sign/abbreviation for "magister"
In-Reply-To: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>
 <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>
 <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
Message-ID: <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp>

On 2018/11/01 03:10, Marcel Schneider via Unicode wrote:
> On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:

>> When one does question the Acad?mie about the fact, this is their
>> reply:
>>
>> Le fait de placer en exposant ces mentions est de convention
>> typographique ; il convient donc de le faire. Les seules exceptions
>> sont pour Mme et Mlle.
> Translation:
> ?Superscripting these mentions is typographical convention;
> consequently it is convenient to do so. The only exceptions are
> for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].?
>>
>> which, if my understanding of "convient" is correct, carefully does
>> quite say that it is *wrong* not to superscript, but that one should
>> superscript when one can because that is the convention in typography.

As for translation of "il convient", I think Julian is closer to the 
intended meaning. The verb "convenir" has several meanings (see e.g. 
https://www.collinsdictionary.com/dictionary/french-english/convenir), 
but especially in this impersonal usage, the meaning "it is advisable, 
it is right to, it is proper to" seems to be most appropriate in this 
context.

It may not at all be convenient (=practical) to use the superscripts, 
e.g. if they are not easily available on a keyboard.

Regards,   Martin.

(French isn't my native language, and nor is English)


From unicode at unicode.org  Wed Oct 31 19:21:08 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 17:21:08 -0700
Subject: A sign/abbreviation for "magister"
In-Reply-To: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209>
References: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209>
Message-ID: <c730a54f-71ef-f024-56b2-1e76f2e52635@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/32e3b3b7/attachment.html>

From unicode at unicode.org  Wed Oct 31 19:24:26 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 1 Nov 2018 01:24:26 +0100 (CET)
Subject: A sign/abbreviation for "magister"
In-Reply-To: <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp>
References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36>
 <slrnptjliv.fn4.jcb@home.stevens-bradfield.com>
 <23350023.8867.1541009416477.JavaMail.www@wwinf1d36>
 <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp>
Message-ID: <1579143918.9351.1541031866725.JavaMail.www@wwinf2209>

On 01/11/2018 at 00:41, Martin J. D?rst wrote:
> 
> On 2018/11/01 03:10, Marcel Schneider via Unicode wrote:
> > On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote:
> 
> >> When one does question the Acad?mie about the fact, this is their
> >> reply:
> >>
> >> Le fait de placer en exposant ces mentions est de convention
> >> typographique ; il convient donc de le faire. Les seules exceptions
> >> sont pour Mme et Mlle.
> > Translation:
> > ?Superscripting these mentions is typographical convention;
> > consequently it is convenient to do so. The only exceptions are
> > for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].?
> >>
> >> which, if my understanding of "convient" is correct, carefully does
> >> quite say that it is *wrong* not to superscript, but that one should
> >> superscript when one can because that is the convention in typography.
> 
> As for translation of "il convient", I think Julian is closer to the 
> intended meaning. The verb "convenir" has several meanings (see e.g. 
> https://www.collinsdictionary.com/dictionary/french-english/convenir), 
> but especially in this impersonal usage, the meaning "it is advisable, 
> it is right to, it is proper to" seems to be most appropriate in this 
> context.
> 
> It may not at all be convenient (=practical) to use the superscripts, 
> e.g. if they are not easily available on a keyboard.

Very good, thank you. I forgot about the meaning of ?convenient?, and
didn?t think at ?advisable? nor at ?right to, proper to?.

The point about keyboarding is essential. As long as superscripts are 
considered exotic or at least very special and need to be grabbed off 
a character picker, there is no point in bothering users with inputting 
them. But since that is going to change, it would be fine that Unicode 
be ready to back the corresponding keyboard layouts so that they 
won?t get challenged by the sort of considerations prevailing among 
hardliners. Partly, i.e. for fr(-FR) ordinal indicators, Unicode is ready.

Best regards,

Marcel
> 
> (French isn't my native language, and nor is English)
(Neither is mine either, but I?m based in France since a long time.)


From unicode at unicode.org  Wed Oct 31 20:01:51 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 31 Oct 2018 18:01:51 -0700
Subject: use vs mention (was: second attempt)
In-Reply-To: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com>
References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com>
Message-ID: <9a9790f7-39ca-5ddb-58c0-50dfb8cca6b8@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181031/4101e20b/attachment.html>

From unicode at unicode.org  Wed Oct 31 21:51:24 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 1 Nov 2018 03:51:24 +0100
Subject: A sign/abbreviation for "magister"
In-Reply-To: <e79fbbf6-f7bf-5df4-84b3-5990b94e7389@att.net>
References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com>
 <cf090e2f-f581-628d-4845-1d85e79d0107@gmail.com>
 <e79fbbf6-f7bf-5df4-84b3-5990b94e7389@att.net>
Message-ID: <CAGa7JC14F3iFhw_=C09XVH1TCBgO5x9aokzDjaR8PAo2Ay+=aQ@mail.gmail.com>

As is "M<sup>gr</sup>" for Monseigneur in French ("Mgr" without
superscripts makes little sense, and if "Mr" is sometimes found as an
abbreviation for "Monsieur", its standard abbreviation is "M.", and its
plural "Messieurs" is noted "MM" without any abbreviation dot or
superscript, but normally never as "Mrs" or "M<sup>rs</sup>"). If someone
finds "Mgr" without the superscript, it could think it is an English
abbreviation for "Manager" (a term now frequently used in the modern
"Frenglish" language used in French business)...

Le mar. 30 oct. 2018 ? 22:58, Ken Whistler via Unicode <unicode at unicode.org>
a ?crit :

>
> On 10/30/2018 2:32 PM, James Kass via Unicode wrote:
> > but we can't seem to agree on how to encode its abbreviation.
>
> For what it's worth, "mgr" seems to be the usual abbreviation in Polish
> for it.
>
> --Ken
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181101/aa69edee/attachment.html>