From unicode at unicode.org Mon Oct 1 05:23:47 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 1 Oct 2018 11:23:47 +0100 (BST) Subject: Teletext graphics characters Message-ID: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost> In the minutes of the recent meeting of the Unicode Technical Committee, document http://www.unicode.org/L2/L2018/18272.htm there is the following. quote E.2 Proposal to add characters from legacy computers and teletext to the UCS [Ewell, et al, L2/18-275R] On phone: Doug Ewell. Discussion. UTC took no action at this time. end quote Could someone possibly say please why the teletext graphics characters have still not been encoded as the change requested to their proposed encoding by the Unicode Technical Committee had been made and a revised document submitted before at least two of the previous meetings of the Unicode Technical Committee took place? Do the teletext graphics characters need to be resubmitted in a proposal document on their own for them to become encoded? As teletext is a great United Kingdom invention, does it need the United Kingdom National Body to propose their inclusion directly to the International Standards Organization? William Overington Monday 1 October 2018 From unicode at unicode.org Mon Oct 1 10:49:51 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Mon, 1 Oct 2018 08:49:51 -0700 Subject: Teletext graphics characters In-Reply-To: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost> References: <1332500.16009.1538389427263.JavaMail.defaultUser@defaultHost> Message-ID: There hasn't even been a response yet from the UTC members regarding the evidence they requested for encoding FOUR-BY-FOUR CHECKER BOARD as a distinct character from MEDIUM SHADE. They are most likely busy with other Unicode business and/or their personal lives. These things take time. Be patient. -- Rebecca Bettencourt On Mon, Oct 1, 2018 at 8:02 AM William_J_G Overington via Unicode < unicode at unicode.org> wrote: > In the minutes of the recent meeting of the Unicode Technical Committee, > document http://www.unicode.org/L2/L2018/18272.htm there is the following. > > quote > > E.2 Proposal to add characters from legacy computers and teletext to the > UCS [Ewell, et al, L2/18-275R] > > On phone: Doug Ewell. > > Discussion. UTC took no action at this time. > > end quote > > Could someone possibly say please why the teletext graphics characters > have still not been encoded as the change requested to their proposed > encoding by the Unicode Technical Committee had been made and a revised > document submitted before at least two of the previous meetings of the > Unicode Technical Committee took place? > > Do the teletext graphics characters need to be resubmitted in a proposal > document on their own for them to become encoded? > > As teletext is a great United Kingdom invention, does it need the United > Kingdom National Body to propose their inclusion directly to the > International Standards Organization? > > William Overington > > Monday 1 October 2018 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 02:45:31 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 2 Oct 2018 16:45:31 +0900 Subject: Dealing with Georgian capitalization in programming languages Message-ID: Since the last discussion on Georgian (Mtavruli) on this mailing list, I have been looking into how to implement it in the Programming language Ruby. Ruby has four case-conversion operations for its class String: upcase: convert all characters to upper case downcase: convert all characters to lower case swapcase: switch upper to lower and lower to upper case capitalize: uppercase (or title-case) the first character of the string, lowercase the rest 'upcase' and 'downcase' don't pose problems. 'swapcase' doesn't cause problems assuming the input doesn't have any problems. The only operation that can cause problems is 'capitalize'. When I say "cause problems", I mean producing mixed-case output. I originally thought that 'capitalize' would be fine. It is fine for lowercase input: I stays lowercase because Unicode Data indicates that titlecase for lowercase Georgian letters is the letter itself. But it will produce the apparently undesirable Mixed Case for ALL UPPERCASE input. My questions here are: - Has this been considered when Georgian Mtavruli was discussed in the UTC? - How have any other implementers (ICU,...) addressed this, in particular the operation that's called 'capitalize' in Ruby? Many thanks in advance for your input, Regards, Martin. From unicode at unicode.org Tue Oct 2 07:03:22 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:03:22 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Thanks to all for comments. Just revised the text in https://goo.gl/neguxb. Mark On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ?? wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 07:03:25 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:03:25 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Thanks, added a quote from you on that; see if it looks ok. Mark On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote: > This paper makes the default assumption that the internal storage of a > string is a featureless array. If this assumption is abandoned, it is > possible to get O(1) indexes with fairly low space overhead. The Scheme > language has recently adopted immutable strings called "texts" as a > supplement to its pre-existing mutable strings, and the sample > implementation for this feature uses a vector of either native strings or > bytevectors (char[] vectors in C/Java terms). I would urge anyone > interested in the question of storing and accessing mutable strings to read > the following parts of SRFI 135 at < > https://srfi.schemers.org/srfi-135/srfi-135.html>: Abstract, Rationale, > Specification / Basic concepts, and Implementation. In addition, the > design notes at , > though not up to date (in particular, UTF-16 internals are now allowed as > an alternative to UTF-8), are of interest: unfortunately, the link to the > span API has rotted. > > On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ?? via Unicore < > unicore at unicode.org> wrote: > >> I recently did some extensive revisions of a paper on Unicode string >> models (APIs). Comments are welcome. >> >> >> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# >> >> Mark >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 07:03:38 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:03:38 +0200 Subject: Unicode String Models In-Reply-To: <20180909085929.2d4ff0d2@JRWUBU2> References: <20180909085929.2d4ff0d2@JRWUBU2> Message-ID: Mark On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ?? via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > > Theoretically at least, the cost of indexing a big string by codepoint > is negligible. For example, cost of accessing the middle character is > O(1)*, not O(n), where n is the length of the string. The trick is to > use a proportionately small amount of memory to store and maintain a > partial conversion table from character index to byte index. For > example, Emacs claims to offer O(1) access to a UTF-8 buffer by > character number, and I can't significantly fault the claim. > > *There may be some creep, but it doesn't matter for strings that can be > stored within a galaxy. > > Of course, the coefficients implied by big-oh notation also matter. > For example, it can be very easy to forget that a bubble sort is often > the quickest sorting algorithm. > Thanks, added a quote from you on that; see if it looks ok. > You keep muttering that a a sequence of 8-bit code units can contain > invalid sequences, but often forget that that is also true of sequences > of 16-bit code units. Do emoji now ensure that confusion between > codepoints and code units rapidly comes to light? > I didn't neglect that, had a [TBD] for it. While UTF16 invalid unpaired surrogates don't complicate processing much if they are treated as unassigned characters, allowing UTF8 invalid sequences are more troublesome. See, for example, the convolutions needed in ICU methods that allow ill-formed UTF8. > You seem to keep forgetting that grapheme clusters are not how some > people people work. Does the English word 'caf?' contain the letter > 'e'? Yes or no? I maintain that it does. I can't help thinking that > one might want to look for the letter '?' in Vietnamese and find it > whatever the associated tone mark is. > I'm pretty familiar with the situation, thanks for asking. Often you want to find out more about the components of grapheme clusters, so you always need to be able to iterate through the code points it contains. One might think that iterating by grapheme cluster is hiding features of the text. For example, with *fox?* (fox\u{301}) it is easy to find that the text contains an *x* by iterating through code points. But code points often don't reveal their components: does the word *tambi?n* contain the letter *e*? A reasonable question, but iterating by code point rather than grapheme cluster doesn't help, since it is typically encoded as a single U+00E9. And even decomposing to NFD doesn't always help, as with cases like *r?dgr?d*. > > You didn't discuss substrings. I did. But if you mean a definition of substring that lets you access internal components of substrings, I'm afraid that is quite a specialized usage. One could do it, but it would burden down the general use case. > I'm interested in how subsequences of > strings are defined, as the concept of 'substring' isn't really Unicode > compliant. Again, expressing '?' as a subsequence of the Vietnamese > word 'n?ng' ought to be possible, whether one is using NFD (easier) or > NFC. (And there are alternative normalisations that are compatible > with canonical equivalence.) I'm most interested in subsequences X of a > word W where W is the same as AXB for some strings A and B. > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 07:03:48 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:03:48 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Mark On Sun, Sep 9, 2018 at 3:42 PM Daniel B?nzli wrote: > Hello, > > I find your notion of "model" and presentation a bit confusing since it > conflates what I would call the internal representation and the API. > > The internal representation defines how the Unicode text is stored and > should not really matter to the end user of the string data structure. The > API defines how the Unicode text is accessed, expressed by what is the > result of an indexing operation on the string. The latter is really what > matters for the end-user and what I would call the "model". > Because of performance and storage consideration, you need to consider the possible internal data structures when you are looking at something as low-level as strings. But most of the 'model's in the document are only really distinguished by API, only the "Code Point model" discussions are segmented by internal storage, as with "Code Point Model: UTF-32" > I think the presentation would benefit from making a clear distinction > between the internal representation and the API; you could then easily > summarize them in a table which would make a nice summary of the design > space. > That's an interesting suggestion, I'll mull it over. > > I also think you are missing one API which is the one with ECG I would > favour: indexing returns Unicode scalar values, internally be it whatever > you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended > by the "Code Point Model: Internal 8/16/32" but that's not what it says, > the distinction between code point and scalar value is an important one and > I think it would be good to insist on it to clarify the minds in such > documents. > In reality, most APIs are not even going to be in terms of code points: they will return int32's. So not only are they not scalar values, 99.97% are not even code points. Of course, values above 10FFFF or below 0 shouldn't ever be stored in strings, but in practice treating non-scalar-value-code-points as "permanently unassigned" characters doesn't really cause problems in processing. > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 07:04:09 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:04:09 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Mark On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode < unicode at unicode.org> wrote: > On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ?? via Unicode > wrote: > > > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > * The Grapheme Cluster Model seems to have a couple of disadvantages > that are not mentioned: > 1) The subunit of string is also a string (a short string conforming > to particular constraints). There's a need for *another* more atomic > mechanism for examining the internals of the grapheme cluster string. > I did mention this. > 2) The way an arbitrary string is divided into units when iterating > over it changes when the program is executed on a newer version of the > language runtime that is aware of newly-assigned codepoints from a > newer version of Unicode. > Good point. I did mention the EGC definitions changing, but should point out that if you have a string with unassigned characters in it, they may be clustered on future versions. Will add. > * The Python 3.3 model mentions the disadvantages of memory usage > cliffs but doesn't mention the associated perfomance cliffs. It would > be good to also mention that when a string manipulation causes the > storage to expand or contract, there's a performance impact that's not > apparent from the nature of the operation if the programmer's > intuition works on the assumption that the programmer is dealing with > UTF-32. > The focus was on immutable string models, but I didn't make that clear. Added some text. > > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM > text node storage in Gecko, (I believe but am not 100% sure) V8 and, > optionally, HotSpot > ( > https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A > ). > That is, text has UTF-16 semantics, but if the high half of every code > unit in a string is zero, only the lower half is stored. This has > properties analogous to the Python 3.3 model, except non-BMP doesn't > expand to UTF-32 but uses UTF-16 surrogate pairs. > Thanks, will add. > > * I think the fact that systems that chose UTF-16 or UTF-32 have > implemented models that try to save storage by omitting leading zeros > and gaining complexity and performance cliffs as a result is a strong > indication that UTF-8 should be recommended for newly-designed systems > that don't suffer from a forceful legacy need to expose UTF-16 or > UTF-32 semantics. > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting byte-oriented > data. Byte buffers and text buffers are type-wise ambiguous. Only > iterating over byte data by code point gives the data the UTF-8 > interpretation. Unless the data is cleaned up as a side effect of such > iteration, malformed sequences in input survive into output. > > 2) UTF-8 without full trust in ability to retain validity (the model > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > common UTF-8 model for C and C++, but I don't have evidence to back > this up): When data is ingested with text semantics, it is converted > to UTF-8. For data that's supposed to already be in UTF-8, this means > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > data is valid UTF-8 right after input. However, iteration by code > point doesn't trust ability of other code to retain UTF-8 validity > perfectly and has "else" branches in order not to blow up if invalid > UTF-8 creeps into the system. > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > have a different type in the type system than byte buffers. To go from > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > has been tagged as valid UTF-8, the validity is trusted completely so > that iteration by code point does not have "else" branches for > malformed sequences. If data that the type system indicates to be > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > language has a default "safe" side and an opt-in "unsafe" side. The > unsafe side is for performing low-level operations in a way where the > responsibility of upholding invariants is moved from the compiler to > the programmer. It's impossible to violate the UTF-8 validity > invariant using the safe part of the language. > Added a quote based on this; please check if it is ok. > > * After working with different string models, I'd recommend the Rust > model for newly-designed programming languages. (Not because I work > for Mozilla but because I believe Rust's way of dealing with Unicode > is the best I've seen.) Rust's standard library provides Unicode > version-independent iterations over strings: by code unit and by code > point. Iteration by extended grapheme cluster is provided by a library > that's easy to include due to the nature of Rust package management > (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8 > buffer as a read-only byte buffer has zero run-time cost and allows > for maximally fast guaranteed-valid-UTF-8 output. > > -- > Henri Sivonen > hsivonen at hsivonen.fi > https://hsivonen.fi/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 07:04:40 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Oct 2018 14:04:40 +0200 Subject: Unicode String Models In-Reply-To: <868t4b3v80.fsf@mimuw.edu.pl> References: <868t4b3v80.fsf@mimuw.edu.pl> Message-ID: Whether or not it is well suited, that's probably water under the bridge at this point. Think of it as a jargon at this point; after all, there are lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly a hit. Mark On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bie? wrote: > On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ?? via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > It's a good opportunity to propose a better term for "extended grapheme > cluster", which usually are neither extended nor clusters, it's also not > obvious that they are always graphemes. > > Cf.the earlier threads > > https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html > https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 13:31:02 2018 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Tue, 2 Oct 2018 20:31:02 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: On 2 October 2018 at 14:03:48, Mark Davis ?? via Unicode (unicode at unicode.org) wrote: > Because of performance and storage consideration, you need to consider the > possible internal data structures when you are looking at something as > low-level as strings. But most of the 'model's in the document are only > really distinguished by API, only the "Code Point model" discussions are > segmented by internal storage, as with "Code Point Model: UTF-32" I guess my gripe with the presentation of that document is that it perpetuates the problem of confusing "unicode characters" (or integers, or scalar values) and their *encoding* (how to represent these integers as byte sequences) which a source of endless confusion among programmers.? This confusion is easy lifted once you explain that there exists certain integers, the scalar values, which are your actual characters and then you have different ways of encoding your characters; one can then explain that a surrogate is not a character per se, it's a hack and there's no point in indexing them except if you want trouble. This may also suggest another taxonomy of classification for the APIs, those in which you work directly with the character data (the scalar values) and those in which you work with an encoding of the actual character data (e.g. a JavaScript string). > In reality, most APIs are not even going to be in terms of code points: > they will return int32's.? That reality depends on your programming language. If the latter supports type abstraction you can define an abstract type for scalar values (whose implementation may simply be an integer). If you always go through the constructor to create these "integers" you can maintain the invariant that a value of this type is an integer in the ranges [0x0000;0xD7FF] and [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you feed your "character" data to other processes like UTF-X encoders: it guarantees the correctness of their outputs regardless of what the programmer does. Best,? Daniel From unicode at unicode.org Tue Oct 2 15:12:36 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Tue, 2 Oct 2018 13:12:36 -0700 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: References: Message-ID: On Tue, Oct 2, 2018 at 12:50 AM Martin J. D?rst via Unicode < unicode at unicode.org> wrote: > ... The only > operation that can cause problems is 'capitalize'. > > When I say "cause problems", I mean producing mixed-case output. I > originally thought that 'capitalize' would be fine. It is fine for > lowercase input: I stays lowercase because Unicode Data indicates that > titlecase for lowercase Georgian letters is the letter itself. But it > will produce the apparently undesirable Mixed Case for ALL UPPERCASE input. > > My questions here are: > - Has this been considered when Georgian Mtavruli was discussed in the > UTC? > - How have any other implementers (ICU,...) addressed this, in > particular the operation that's called 'capitalize' in Ruby? > By default, ICU toTitle() functions titlecase at word boundaries (with adjustment) and lowercase all else. That is, we implement Unicode chapter 3.13 Default Case Conversions R3 toTitlecase(x), except that we modified the default boundary adjustment. You can customize the boundaries (e.g., only the start of the string). We have options for whether and how to adjust the boundaries (e.g., adjust to the next cased letter) and for copying, not lowercasing, the other characters. See C++ and Java class CaseMap and the relevant options. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 16:07:56 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 2 Oct 2018 23:07:56 +0200 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: References: Message-ID: I see no easy way to convert ALL UPPERCASE text with consistant casing as there's no rule, except by using dictionnary lookups. In reality data should be input using default casing (as in dictionnary entries), independantly of their position in sentences, paragraphs or titles, and the contextual conversion of some or all characters to uppercase being done algorithmically (this is safe for conversion to ALL UPPERCASE, and quite reliable for conversion to Tile Case, with just a few dictionnary lookups for a small set of knows words per language. Note that title casing works differently in English (which is most often abusing by putting capitales on every word), while most other languages capitalize only selected words, or just the first selected word in French (in addition to the possible first letter of non-selected words such as definite and indefinite articles at start of the sentence). Capitalization of initials on every word is wrong in German which uses capitalisation even more strictly than French or Italian: when in doubts, do not perform any titlecasing, and allow data to provide the actual capitalization of titles directly (it is OK and even recommanded in German to have section headings, or even book titles, written as if they were in the middle of sentences, and you capitalize only titles and headings that are full sentences grammatically, but not simple nominal groups. So title casing should not even be promoted by the UCD standard (where it is in fact using only very basic, simplistic rules) and applicable only in some applications for some languages and in specific technical or rendering contexts. Le mar. 2 oct. 2018 ? 22:21, Markus Scherer via Unicode a ?crit : > On Tue, Oct 2, 2018 at 12:50 AM Martin J. D?rst via Unicode < > unicode at unicode.org> wrote: > >> ... The only >> operation that can cause problems is 'capitalize'. >> >> When I say "cause problems", I mean producing mixed-case output. I >> originally thought that 'capitalize' would be fine. It is fine for >> lowercase input: I stays lowercase because Unicode Data indicates that >> titlecase for lowercase Georgian letters is the letter itself. But it >> will produce the apparently undesirable Mixed Case for ALL UPPERCASE >> input. >> >> My questions here are: >> - Has this been considered when Georgian Mtavruli was discussed in the >> UTC? >> - How have any other implementers (ICU,...) addressed this, in >> particular the operation that's called 'capitalize' in Ruby? >> > > By default, ICU toTitle() functions titlecase at word boundaries (with > adjustment) and lowercase all else. > That is, we implement Unicode chapter 3.13 Default Case Conversions R3 > toTitlecase(x), except that we modified the default boundary adjustment. > > You can customize the boundaries (e.g., only the start of the string). > We have options for whether and how to adjust the boundaries (e.g., adjust > to the next cased letter) and for copying, not lowercasing, the other > characters. > See C++ and Java class CaseMap and the relevant options. > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 2 16:43:27 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 2 Oct 2018 14:43:27 -0700 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: References: Message-ID: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> On 10/2/2018 12:45 AM, Martin J. D?rst via Unicode wrote: > capitalize: uppercase (or title-case) the first character of the > string, lowercase the rest > > > When I say "cause problems", I mean producing mixed-case output. I > originally thought that 'capitalize' would be fine. It is fine for > lowercase input: I stays lowercase because Unicode Data indicates that > titlecase for lowercase Georgian letters is the letter itself. But it > will produce the apparently undesirable Mixed Case for ALL UPPERCASE > input. > > My questions here are: > - Has this been considered when Georgian Mtavruli was discussed in the > ? UTC? > Not explicitly, that I recall. The whole issue of titlecasing came up very late in the preparation of case mapping tables for Mtavruli and Mkhedruli for 11.0. But it seems to me that the problem you are citing can be avoided if you simply rethink what your "capitalize" means. It really should be conceived of as first lowercasing the *entire* string, and then titlecasing the *eligible* letters -- i.e., usually the first letter. (Note that this allows for the concept that titlecasing might then be localized on a per-writing-system basis -- the issue would devolve to determining what the rules are for "eligible" letters.) But the simple default would just be to titlecase the initial letter of each "word" segment of a string. Note that conceived this way, for the Georgian mappings, where the titlecase mapping for Mkhedruli is simply the letter itself, this approach ends up with: capitalize(mkhedrulistring) --> mkhedrulistring capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> mkhedrulistring Thus avoiding any mixed case. --Ken From unicode at unicode.org Wed Oct 3 02:17:10 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Oct 2018 09:17:10 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Mark On Tue, Oct 2, 2018 at 8:31 PM Daniel B?nzli wrote: > On 2 October 2018 at 14:03:48, Mark Davis ?? via Unicode ( > unicode at unicode.org) wrote: > > > Because of performance and storage consideration, you need to consider > the > > possible internal data structures when you are looking at something as > > low-level as strings. But most of the 'model's in the document are only > > really distinguished by API, only the "Code Point model" discussions are > > segmented by internal storage, as with "Code Point Model: UTF-32" > > I guess my gripe with the presentation of that document is that it > perpetuates the problem of confusing "unicode characters" (or integers, or > scalar values) and their *encoding* (how to represent these integers as > byte sequences) which a source of endless confusion among programmers. > > This confusion is easy lifted once you explain that there exists certain > integers, the scalar values, which are your actual characters and then you > have different ways of encoding your characters; one can then explain that > a surrogate is not a character per se, it's a hack and there's no point in > indexing them except if you want trouble. > > This may also suggest another taxonomy of classification for the APIs, > those in which you work directly with the character data (the scalar > values) and those in which you work with an encoding of the actual > character data (e.g. a JavaScript string). > Thanks for the feedback. It is worth adding a discussion of the issues, perhaps something like: A code-point-based API takes and returns int32's, although only a small subset of the values are valid code points, namely 0x0..0x10FFFF. (In practice some APIs may support returning -1 to signal an error or termination, such as before or after the end of a string.) A surrogate code point is one in U+D800..U+DFFF; these reflect a range of special code units used in pairs in UTF-16 for representing code points above U+FFFF. A scalar value is a code point that is not a surrogate. A scalar-value API for immutable strings requires that no surrogate code points are ever returned. In practice, the main advantage of that API is that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked surrogate code point is relatively harmless: Unicode properties are devised so that clients can essentially treat them as (permanently) unassigned characters. Warning: an iterator should *never* avoid returning surrogate code points by skipping them; that can cause security problems; see https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences and https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters. There are two main choices for a scalar-value API: 1. Guarantee that the storage never contains surrogates. This is the simplest model. 2. Substitute U+FFFD for surrogates when the API returns code points. This can be done where #1 is not feasible, such as where the API is a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units that are not guaranteed to be UTF-16. The cost is extra tests on every code point access. > > In reality, most APIs are not even going to be in terms of code points: > > they will return int32's. > > That reality depends on your programming language. If the latter supports > type abstraction you can define an abstract type for scalar values (whose > implementation may simply be an integer). If you always go through the > constructor to create these "integers" you can maintain the invariant that > a value of this type is an integer in the ranges [0x0000;0xD7FF] and > [0xE000;0x10FFFF]. Knowing this invariant holds is quite useful when you > feed your "character" data to other processes like UTF-X encoders: it > guarantees the correctness of their outputs regardless of what the > programmer does. > If the programming language provides for such a primitive datatype, that is possible. That would mean at a minimum that casting/converting to that datatype from other numerical datatypes would require bounds-checking and throwing an exception for values outside of [0x0000..0xD7FF 0xE000..0x10FFFF]. Most common-use programming languages that I know of don't support that for primitives; the API would have to use a class, which would be so very painful for performance/storage. If you (or others) know of languages that do have such a cheap primitive datatype, that would be worth mentioning! > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 3 08:01:15 2018 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 3 Oct 2018 15:01:15 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: On 3 October 2018 at 09:17:10, Mark Davis ?? via Unicode (unicode at unicode.org) wrote: > There are two main choices for a scalar-value API: > > 1. Guarantee that the storage never contains surrogates. This is the > simplest model. > 2. Substitute U+FFFD for surrogates when the API returns code > points. This can be done where #1 is not feasible, such as where the API is > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units > that are not guaranteed to be UTF-16. The cost is extra tests on every code > point access. I'm not sure 2. really makes sense in pratice: it would mean you can't access scalar values? which needs surrogates to be encoded.? Also regarding 1. you can always defines an API that has this property regardless of the actual storage, it's only that your indexing operations might be costly as they do not directly map to the underlying storage array. That being said I don't think direct indexing/iterating for Unicode text is such an interesting operation due of course to the normalization/segmentation issues. Basically if your API provides them I only see these indexes as useful ways to define substrings. APIs that identify/iterate boundaries (and thus substrings) are more interesting due to the nature of Unicode text. > If the programming language provides for such a primitive datatype, that is > possible. That would mean at a minimum that casting/converting to that > datatype from other numerical datatypes would require bounds-checking and > throwing an exception for values outside of [0x0000..0xD7FF > 0xE000..0x10FFFF].? Yes. But note that in practice if you are in 1. above you usually perform this only at the point of decoding where you are already performing a lot of other checks. Once done you no longer need to check anything as long as the operations you perform on the values preserve the invariant.?Also converting back to an integer if you need one is a no-op: it's the identity function.? The OCaml Uchar module does this. This is the interface:? ??https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli which defines the type t as abstract and here is the implementation:? ??https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml which defines the implementation of type t = int which means values of this type are an *unboxed* OCaml integer (and will be stored as such in say an OCaml array). However since the module system enforces type abstraction the only way of creating such values is to use the constants or the constructors (e.g. of_int) which all maintain the scalar value invariant (if you disregard the unsafe_* functions).? Note that it would perfectly be possible to adopt a similar approach in C via a typedef though given C's rather loose type system a little bit more discipline would be required from the programmer (always go through the constructor functions to create values of the type). Best,? Daniel From unicode at unicode.org Wed Oct 3 08:41:42 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Oct 2018 15:41:42 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Mark On Wed, Oct 3, 2018 at 3:01 PM Daniel B?nzli wrote: > On 3 October 2018 at 09:17:10, Mark Davis ?? via Unicode ( > unicode at unicode.org) wrote: > > > There are two main choices for a scalar-value API: > > > > 1. Guarantee that the storage never contains surrogates. This is the > > simplest model. > > 2. Substitute U+FFFD for surrogates when the API returns code > > points. This can be done where #1 is not feasible, such as where the API > is > > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code > units > > that are not guaranteed to be UTF-16. The cost is extra tests on every > code > > point access. > > I'm not sure 2. really makes sense in pratice: it would mean you can't > access scalar values > which needs surrogates to be encoded. > Let me clear that up; I meant that "the underlying storage never contains something that would need to be represented as a surrogate code point." Of course, UTF-16 does need surrogate code units. What #1 would be excluding in the case of UTF-16 would be unpaired surrogates. That is, suppose the underlying storage is UTF-16 code units that don't satisfy #1. 0061 D83D DC7D 0061 D83D A code point API would return for those a sequence of 4 values, the last of which would be a surrogate code point. 00000061, 0001F47D, 00000061, 0000D83D A scalar value API would return for those also 4 values, but since we aren't in #1, it would need to remap. 00000061, 0001F47D, 00000061, 0000FFFD > > Also regarding 1. you can always defines an API that has this property > regardless of the actual storage, it's only that your indexing operations > might be costly as they do not directly map to the underlying storage array. > That being said I don't think direct indexing/iterating for Unicode text > is such an interesting operation due of course to the > normalization/segmentation issues. Basically if your API provides them I > only see these indexes as useful ways to define substrings. APIs that > identify/iterate boundaries (and thus substrings) are more interesting due > to the nature of Unicode text. > I agree that iteration is a very common case. But quite often implementations need to have at least opaque indexes (as discussed). > > > If the programming language provides for such a primitive datatype, that > is > > possible. That would mean at a minimum that casting/converting to that > > datatype from other numerical datatypes would require bounds-checking and > > throwing an exception for values outside of [0x0000..0xD7FF > > 0xE000..0x10FFFF]. > > Yes. But note that in practice if you are in 1. above you usually perform > this only at the point of decoding where you are already performing a lot > of other checks. Once done you no longer need to check anything as long as > the operations you perform on the values preserve the invariant. Also > converting back to an integer if you need one is a no-op: it's the identity > function. > If it is a real datatype, with strong guarantees that it *never* contains values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion from number will require checking. And in my experience, without a strong guarantee the datatype is in practice pretty useless. > > The OCaml Uchar module does this. This is the interface: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli > > which defines the type t as abstract and here is the implementation: > > https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml > > which defines the implementation of type t = int which means values of > this type are an *unboxed* OCaml integer (and will be stored as such in say > an OCaml array). However since the module system enforces type abstraction > the only way of creating such values is to use the constants or the > constructors (e.g. of_int) which all maintain the scalar value invariant > (if you disregard the unsafe_* functions). > > Note that it would perfectly be possible to adopt a similar approach in C > via a typedef though given C's rather loose type system a little bit more > discipline would be required from the programmer (always go through the > constructor functions to create values of the type). That's the C motto: "requiring a 'bit more' discipline from programmers" > > Best, > > Daniel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 3 09:15:55 2018 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Wed, 3 Oct 2018 16:15:55 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: On 3 October 2018 at 15:41:42, Mark Davis ?? via Unicode (unicode at unicode.org) wrote: ? > Let me clear that up; I meant that "the underlying storage never contains > something that would need to be represented as a surrogate code point." Of > course, UTF-16 does need surrogate code units. What #1 would be excluding > in the case of UTF-16 would be unpaired surrogates. That is, suppose the > underlying storage is UTF-16 code units that don't satisfy #1. > > 0061 D83D DC7D 0061 D83D > > A code point API would return for those a sequence of 4 values, the last of > which would be a surrogate code point. > > 00000061, 0001F47D, 00000061, 0000D83D > > A scalar value API would return for those also 4 values, but since we > aren't in #1, it would need to remap. > > 00000061, 0001F47D, 00000061, 0000FFFD Ok understood. But I think that if you go to the length of providing a scalar-value API you would also prevent the construction of strings that have such anomalities in the first place (e.g. by erroring in the constructor if you provide it with malformed UTF-X data), i.e. maintain 1. From a programmer's perspective I really don't get anything from 2. except confusion. > If it is a real datatype, with strong guarantees that it *never* contains > values outside of [0x0000..0xD7FF 0xE000..0x10FFFF], then every conversion > from number will require checking. And in my experience, without a strong > guarantee the datatype is in practice pretty useless. Sure. My point was that the places where you perform this check are few in practice. Namely mainly at the IO boundary of your program where you actually need to deal with encodings and, additionally, whenever you define scalar value constants (a check that could actually be performed by your compiler if your language provides a literal notation for values of this type). Best,? Daniel From unicode at unicode.org Thu Oct 4 04:37:25 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Thu, 4 Oct 2018 18:37:25 +0900 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> References: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> Message-ID: Ken, Markus, Many thanks for your ideas, which I noted at https://bugs.ruby-lang.org/issues/14839. Regards, Martin. On 2018/10/03 06:43, Ken Whistler wrote: > > On 10/2/2018 12:45 AM, Martin J. D?rst via Unicode wrote: >> My questions here are: >> - Has this been considered when Georgian Mtavruli was discussed in the >> ? UTC? >> > Not explicitly, that I recall. The whole issue of titlecasing came up > very late in the preparation of case mapping tables for Mtavruli and > Mkhedruli for 11.0. > > But it seems to me that the problem you are citing can be avoided if you > simply rethink what your "capitalize" means. It really should be > conceived of as first lowercasing the *entire* string, and then > titlecasing the *eligible* letters -- i.e., usually the first letter. > (Note that this allows for the concept that titlecasing might then be > localized on a per-writing-system basis -- the issue would devolve to > determining what the rules are for "eligible" letters.) But the simple > default would just be to titlecase the initial letter of each "word" > segment of a string. > > Note that conceived this way, for the Georgian mappings, where the > titlecase mapping for Mkhedruli is simply the letter itself, this > approach ends up with: > > capitalize(mkhedrulistring) --> mkhedrulistring > > capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> > mkhedrulistring > > Thus avoiding any mixed case. From unicode at unicode.org Thu Oct 4 10:40:16 2018 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Thu, 04 Oct 2018 08:40:16 -0700 Subject: Unicode CLDR 34 beta available for testing Message-ID: <5BB63460.3040102@unicode.org> The *beta* version of Unicode CLDR 34 is available for testing. The final release is expected on October 12. CLDR 34 provides an update to the key building blocks for software supporting the world?s languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks. CLDR 34 included a full Survey Tool data collection phase. Other enhancements include several changes to prepare for the new Japanese calendar era starting 2019-05-01; updated emoji names, annotations, collation and grouping; and other specific fixes. The draft release page at http://cldr.unicode.org/index/downloads/cldr-34 lists the major features, and has pointers to the newest data and charts. It will be fleshed out over the coming weeks with more details, migration issues, known problems, and so on. Particularly useful for review are: * Delta Charts - the data that changed during the release * By-Type Charts - a side-by-side comparison of data from different locales * Annotation Charts - new emoji names and keywords Please report any problems that you find using a CLDR ticket . We?d also appreciate it if programmatic users of CLDR data download the xml files and do a trial integration to see if any problems arise. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 9 02:47:14 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 9 Oct 2018 16:47:14 +0900 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> References: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> Message-ID: Hello Ken, others, On 2018/10/03 06:43, Ken Whistler wrote: > But it seems to me that the problem you are citing can be avoided if you > simply rethink what your "capitalize" means. It really should be > conceived of as first lowercasing the *entire* string, and then > titlecasing the *eligible* letters -- i.e., usually the first letter. > (Note that this allows for the concept that titlecasing might then be > localized on a per-writing-system basis -- the issue would devolve to > determining what the rules are for "eligible" letters.) But the simple > default would just be to titlecase the initial letter of each "word" > segment of a string. > > Note that conceived this way, for the Georgian mappings, where the > titlecase mapping for Mkhedruli is simply the letter itself, this > approach ends up with: > > capitalize(mkhedrulistring) --> mkhedrulistring > > capitalize(MTAVRULISTRING) ==> titlecase(lowercase(MTAVRULISTRING)) --> > mkhedrulistring > > Thus avoiding any mixed case. I have been thinking through this. It seems quite appealing. But I'm concerned there may be some edge cases. I have been able to come up with two so far: - Applying this to a string starting with upper-case SZ (U+1E9E). This may change SZ ? ? ? Ss. - Using the 'capitalize' method to (try to) get the titlecase property of a MTAVRULI character. (There's no other way currently in Ruby to get the titlecase property.) There may be others. If you have some ideas, I'd appreciate to know about them. This lets me wonder why the UTC didn't simply declare the titlecase property of MTAVRULI to be mkhedruli. Was this considered or not? The way things are currently set up, there seems to be no benefit of MTAVRULI being its own titlecase, because in actual use, that requires additional processing. Regards, Martin. From unicode at unicode.org Tue Oct 9 03:22:25 2018 From: unicode at unicode.org (Marius Spix via Unicode) Date: Tue, 9 Oct 2018 10:22:25 +0200 Subject: Aw: Re: Dealing with Georgian capitalization in programming languages In-Reply-To: References: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> Message-ID: The capital ? (U+1E9E) has been officially approved by the Council for the German Language since July 2018. However, there is no word starting with ?, that means the character is only relevant for full-capitalized words. It may only stand alone in spaced type, when there is no available italic font-style. In the Ruby bug tracker that there is also an issue with Dutch ij ? IJ. The dedicated ligatures ??(U+0133) and ? (U+0133) are not recommended and thus never used, but leading ij must always be capitalized to IJ, as in IJSBERG ? ijsberg ? IJsberg. The actual problem is that the current capitalization algorithm is based on a regular grammar (type 3). It has to be adjusted for a context-sensitive (type 1) grammar. Regards, Marius ? On 2018/10/09 09:47, Martin J. D?rst wrote: > I have been thinking through this. It seems quite appealing. > > But I'm concerned there may be some edge cases. I have been able to come > up with two so far: > > - Applying this to a string starting with upper-case SZ (U+1E9E). > This may change SZ ? ? ? Ss. > - Using the 'capitalize' method to (try to) get the titlecase > property of a MTAVRULI character. (There's no other way > currently in Ruby to get the titlecase property.) > > There may be others. If you have some ideas, I'd appreciate to know > about them. > > This lets me wonder why the UTC didn't simply declare the titlecase > property of MTAVRULI to be mkhedruli. Was this considered or not? The > way things are currently set up, there seems to be no benefit of > MTAVRULI being its own titlecase, because in actual use, that requires > additional processing. > > Regards, Martin. From unicode at unicode.org Tue Oct 9 14:49:09 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 9 Oct 2018 12:49:09 -0700 Subject: Dealing with Georgian capitalization in programming languages In-Reply-To: References: <3bc9a840-9518-0fad-46ad-45ac70a5ba3a@att.net> Message-ID: Martin, On 10/9/2018 12:47 AM, Martin J. D?rst via Unicode wrote: > - Using the 'capitalize' method to (try to) get the titlecase > ? property of a MTAVRULI character. (There's no other way > ? currently in Ruby to get the titlecase property.) > > There may be others. If you have some ideas, I'd appreciate to know > about them. > > This lets me wonder why the UTC didn't simply declare the titlecase > property of MTAVRULI to be mkhedruli. Was this considered or not? The > way things are currently set up, there seems to be no benefit of > MTAVRULI being its own titlecase, because in actual use, that requires > additional processing. Titlecasing for Georgian was not completely thought through before Mtavruli was added. As I noted in my earlier comment on this thread, the titlecase mapping values for Mkhredruli were added late in the process, when it became clear that not doing so would result in inappropriate outcomes for existing Mkhredruli text. I don't think there is a fully-worked out position on this, but adding a Simple_Titlecase mapping for Mtavruli to Mkhedruli would, I suspect, just further muddy waters for implementers, because it would be in effect saying that an uppercase letter titlecases by shifting to its lowercase mapping. A headscratcher, at the very least. Note that with the current mappings as they are, Changes_When_Titlecased is False for all Mkhedruli and for all Mtavruli characters, which I think is the desired state of affairs. A titlecasing string operation of Mtavruli that does something other than just leave the string alone should, IMO, be documented as doing something extra and *should* have to do additional processing. --Ken From unicode at unicode.org Wed Oct 10 03:14:12 2018 From: unicode at unicode.org (arno.schmitt via Unicode) Date: Wed, 10 Oct 2018 09:14:12 +0100 Subject: Unicode Arabic Mark Rendering UTR #53 Now Published In-Reply-To: <5BBD0263.2040006@unicode.org> References: <5BBD0263.2040006@unicode.org> Message-ID: <63e9abb7-394b-ed3c-84d6-a39969359f34@gmx.net> The paper adopted treats the word shown (fa-?ul??ika) writing with an unkown letter + kasra below + hamza below. I thought, in Unicode I should use 'ARABIC LETTER YEH WITH HAMZA ABOVE' (U+0626) or its phonological equivilant 'ARABIC LETTER YEH WITH HAMZA BELOW' (U+0826) or the basic letter 'ARABIC LETTER YEH WITH HAMZA' (U+0825). My error or an inconsistency in Unicode? Am 09.10.2018 um 21:32 schrieb announcements at unicode.org: > exampleThe combining classes of Arabic combining characters in Unicode > are different than combining classes in most other scripts. They are a > mixture of special classes for specific marks plus two more generalized > classes for all the other marks. This has resulted in inconsistent > and/or incorrect rendering for sequences with multiple combining marks > since Unicode 2.0. > > > The Arabic Mark Transient Reordering Algorithm (AMTRA) described in UTR > #53 is the recommended solution > to achieving correct and consistent rendering of Arabic combining mark > sequences. This algorithm provides results that match user expectations > and assures that canonically equivalent sequences are rendered > identically, independent of the order of the combining marks. > > > The concepts in this algorithm were first proposed four years ago by > Roozbeh Pournader. We are pleased it has now been published as an > official Technical Report. > From unicode at unicode.org Fri Oct 12 05:54:57 2018 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Fri, 12 Oct 2018 10:54:57 +0000 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Message-ID: Hi Unicode Experts, Suppose base64 encoding is applied to m to yield base64 text t. Next, suppose base64 encoding is applied to m' to yield base64 text t'. If m is not equal to m', then t will not equal t'. In other words, given different inputs, base64 encoding always yields different base64 texts. True or false? How about the opposite direction: If m is base64 encoded to yield t and then t is base64 decoded to yield n, will it always be the case that m equals n? /Roger From unicode at unicode.org Fri Oct 12 06:08:40 2018 From: unicode at unicode.org (J Decker via Unicode) Date: Fri, 12 Oct 2018 04:08:40 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: Message-ID: On Fri, Oct 12, 2018 at 3:57 AM Costello, Roger L. via Unicode < unicode at unicode.org> wrote: > Hi Unicode Experts, > > Suppose base64 encoding is applied to m to yield base64 text t. > > Next, suppose base64 encoding is applied to m' to yield base64 text t'. > > If m is not equal to m', then t will not equal t'. > > In other words, given different inputs, base64 encoding always yields > different base64 texts. > > True or false? > true. base64 to and from is always the same thing. > > How about the opposite direction: If m is base64 encoded to yield t and > then t is base64 decoded to yield n, will it always be the case that m > equals n? > False. Canonical translation may occur which the different base64 may be the same sort of string... https://en.wikipedia.org/wiki/Unicode_equivalence https://en.wikipedia.org/wiki/Canonical_form > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 12 11:17:59 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 12 Oct 2018 09:17:59 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or =?UTF-8?Q?false=3F?= Message-ID: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> J Decker wrote: >> How about the opposite direction: If m is base64 encoded to yield t >> and then t is base64 decoded to yield n, will it always be the case >> that m equals n? > > False. > Canonical translation may occur which the different base64 may be the > same sort of string... Base64 is a binary-to-text encoding. Neither encoding nor decoding should presume any special knowledge of the meaning of the binary data, or do anything extra based on that presumption. Converting Unicode text to and from base64 should not perform any sort of Unicode normalization, convert between UTFs, insert or remove BOMs, etc. This is like saying that converting a JPEG image to and from base64 should not resize or rescale the image, change its color depth, convert it to another graphic format, etc. So I'd say "true" to Roger's question. I touched on this a little bit in UTN #14, from the standpoint of trying to improve compression by normalizing the Unicode text first. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Oct 12 11:29:29 2018 From: unicode at unicode.org (J Decker via Unicode) Date: Fri, 12 Oct 2018 09:29:29 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> Message-ID: On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode wrote: > J Decker wrote: > > >> How about the opposite direction: If m is base64 encoded to yield t > >> and then t is base64 decoded to yield n, will it always be the case > >> that m equals n? > > > > False. > > Canonical translation may occur which the different base64 may be the > > same sort of string... > > Base64 is a binary-to-text encoding. Neither encoding nor decoding > should presume any special knowledge of the meaning of the binary data, > or do anything extra based on that presumption. > > Converting Unicode text to and from base64 should not perform any sort > of Unicode normalization, convert between UTFs, insert or remove BOMs, > etc. This is like saying that converting a JPEG image to and from base64 > should not resize or rescale the image, change its color depth, convert > it to another graphic format, etc. > > So I'd say "true" to Roger's question. > On the first side (X to base64) definitely true. But there is potential that text resulting from some decoded buffer is translated, resulting in a 'congruent' string that's not exactly the same... and the base64 will be different. Comparing some base64 string with some other base64 string shows a binary difference, but may be still the 'same' string. > > I touched on this a little bit in UTN #14, from the standpoint of trying > to improve compression by normalizing the Unicode text first. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 12 14:26:45 2018 From: unicode at unicode.org (Tex via Unicode) Date: Fri, 12 Oct 2018 12:26:45 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> Message-ID: <007601d46261$8454d990$8cfe8cb0$@xencraft.com> I agree with Doug. Base64 maps each byte of the source string to unique bytes in the destination string. Decoding is also a unique mapping. If the encoded string is ?translated? in some way by additional processes, canonical or otherwise, then all bets are off. If you disagree, please offer an example or additional details of how 2 base64 strings might be equivalent. Tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of J Decker via Unicode Sent: Friday, October 12, 2018 9:29 AM To: doug at ewellic.org Cc: Unicode Discussion Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? On Fri, Oct 12, 2018 at 9:23 AM Doug Ewell via Unicode wrote: J Decker wrote: >> How about the opposite direction: If m is base64 encoded to yield t >> and then t is base64 decoded to yield n, will it always be the case >> that m equals n? > > False. > Canonical translation may occur which the different base64 may be the > same sort of string... Base64 is a binary-to-text encoding. Neither encoding nor decoding should presume any special knowledge of the meaning of the binary data, or do anything extra based on that presumption. Converting Unicode text to and from base64 should not perform any sort of Unicode normalization, convert between UTFs, insert or remove BOMs, etc. This is like saying that converting a JPEG image to and from base64 should not resize or rescale the image, change its color depth, convert it to another graphic format, etc. So I'd say "true" to Roger's question. On the first side (X to base64) definitely true. But there is potential that text resulting from some decoded buffer is translated, resulting in a 'congruent' string that's not exactly the same... and the base64 will be different. Comparing some base64 string with some other base64 string shows a binary difference, but may be still the 'same' string. I touched on this a little bit in UTN #14, from the standpoint of trying to improve compression by normalizing the Unicode text first. -- Doug Ewell | Thornton, CO, US | ewellic.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Oct 12 20:12:40 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 13 Oct 2018 03:12:40 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> References: <20181012091759.665a7a7059d7ee80bb4d670165c8327d.2cf01d0deb.wbe@email03.godaddy.com> Message-ID: I also think the reverse is also true ! Decoding a Base64 entity does not warranty it will return valid text in any known encoding. So Unicode normalization of the output cannot apply. Even if it represents text, nothing indicates that the result will be encoded with some Unicode encoding form (unless this is tagged separately, like in MIME). If you use Base64 for decoding MIME contents (e.g. for emails), the Base-64 decoding itself will not transform the encoding, but then the email parser will have to ensure that the text encoding is valid, at which time it will have to transform it (possibly replace some invalid sequences or truncate it), and then only it may apply normalization to help render that text. But these transforms are part of the MIME application and independant of whever you used Base-64 or any another binary encoding or transport syntax. In other words: "If m is not equal to m', then t will not equal t'" is reversible, but nothing indicates that m or m' Base64-decoded are texts, they are just opaque binary objects which are still equal in value like their t or t' Base64-encodings. Note: some Base64 envelope formats (like MIME) allow multiple representations t and t' from the same message m, by adding paddings or transport syntaxes like line-splitting (with varaible length). Base64 alone does not allow that variation (it normally uses a static alphabet), but there are variants that accept decoding extended alphabets as binary equivalent. So you may have two MIME-encoded texts that have different encodings (with Base64 or Quopted-Printable, with variable line lengths) but that represent the same source binary object, and decoding these different encoded messages will yeld the same binary object: this does not depend on Base64 but on the permissivity/flexibility of decoders for these envelope formats (using **extensions** of Base64 specific to the envelope format). Le ven. 12 oct. 2018 ? 18:27, Doug Ewell via Unicode a ?crit : > J Decker wrote: > > >> How about the opposite direction: If m is base64 encoded to yield t > >> and then t is base64 decoded to yield n, will it always be the case > >> that m equals n? > > > > False. > > Canonical translation may occur which the different base64 may be the > > same sort of string... > > Base64 is a binary-to-text encoding. Neither encoding nor decoding > should presume any special knowledge of the meaning of the binary data, > or do anything extra based on that presumption. > > Converting Unicode text to and from base64 should not perform any sort > of Unicode normalization, convert between UTFs, insert or remove BOMs, > etc. This is like saying that converting a JPEG image to and from base64 > should not resize or rescale the image, change its color depth, convert > it to another graphic format, etc. > > So I'd say "true" to Roger's question. > > I touched on this a little bit in UTN #14, from the standpoint of trying > to improve compression by normalizing the Unicode text first. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 13 09:16:59 2018 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Sat, 13 Oct 2018 14:16:59 +0000 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: Message-ID: Hi Folks, Thank you for your outstanding responses! Below is a summary of what I learned. Are there any errors in the summary? Is there anything you would add? Please let me know of anything that is not clear. /Roger 1. While base64 encoding is usually applied to binary, it is also sometimes applied to text, such as Unicode text. Note: Since base64 encoding may be applied to both binary and text, in the following bullets I use the more generic term "data". For example, "Data d is base64-encoded to yield ..." 2. Neither base64 encoding nor decoding should presume any special knowledge of the meaning of the data or do anything extra based on that presumption. For example, converting Unicode text to and from base64 should not perform any sort of Unicode normalization, convert between UTFs, insert or remove BOMs, etc. This is like saying that converting a JPEG image to and from base64 should not resize or rescale the image, change its color depth, convert it to another graphic format, etc. If you use base64 for encoding MIME content (e.g. emails), the base64 decoding will not transform the content. The email parser must ensure that the content is valid, so the parser might have to transform the content (possibly replacing some invalid sequences or truncating), and then apply Unicode normalization to render the text. These transforms are part of the MIME application and are independent of whether you use base64 or any another encoding or transport syntax. 3. If data d is different than d', then the base64 text resulting from encoding d is different than the base64 text resulting from encoding d'. 4. If base64 text t is different than t', then the data resulting from decoding t is different than the data resulting from decoding t'. 5. For every data d there is exactly one base64 encoding t. 6. Every base64 text t is an encoding of exactly one data d. 7. For all data d, Base64_Decode[Base64_Encode[d]] = d From unicode at unicode.org Sat Oct 13 09:45:10 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 13 Oct 2018 16:45:10 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: Message-ID: You forget that Base64 (as used in MIME) does not follow these rules as it allows multiple different encodings for the same source binary. MIME actually splits a binary object into multiple fragments at random positions, and then encodes these fragments separately. Also MIME uses an extension of Base64 where it allows some variations in the encoding alphabet (so even the same fragment of the same length may have two disting encodings). Base64 in MIME is different from standard Base64 (which never splits the binary object before encoding it, and uses a strict alphabet of 64 ASCII characters, allowing no variation). So MIME requires special handling: the assumpton that a binary message is encoded the same is wrong, but MIME still requires that this non unique Base64 encoding will be decoded back to the same initial (unsplitted) binary object (independantly of its size and independantly of the splitting boundaries used in the transport, which may change during the transport). This also applies to the Base64 encoding used in HTTP transport syntax, and notably in the HTTP/1.1 streaming feature where fragment sizes are also variable. Le sam. 13 oct. 2018 ? 16:27, Costello, Roger L. via Unicode < unicode at unicode.org> a ?crit : > Hi Folks, > > Thank you for your outstanding responses! > > Below is a summary of what I learned. Are there any errors in the summary? > Is there anything you would add? Please let me know of anything that is not > clear. /Roger > > 1. While base64 encoding is usually applied to binary, it is also > sometimes applied to text, such as Unicode text. > > Note: Since base64 encoding may be applied to both binary and text, in the > following bullets I use the more generic term "data". For example, "Data d > is base64-encoded to yield ..." > > 2. Neither base64 encoding nor decoding should presume any special > knowledge of the meaning of the data or do anything extra based on that > presumption. > > For example, converting Unicode text to and from base64 should not perform > any sort of Unicode normalization, convert between UTFs, insert or remove > BOMs, etc. This is like saying that converting a JPEG image to and from > base64 should not resize or rescale the image, change its color depth, > convert it to another graphic format, etc. > > If you use base64 for encoding MIME content (e.g. emails), the base64 > decoding will not transform the content. The email parser must ensure that > the content is valid, so the parser might have to transform the content > (possibly replacing some invalid sequences or truncating), and then apply > Unicode normalization to render the text. These transforms are part of the > MIME application and are independent of whether you use base64 or any > another encoding or transport syntax. > > 3. If data d is different than d', then the base64 text resulting from > encoding d is different than the base64 text resulting from encoding d'. > > 4. If base64 text t is different than t', then the data resulting from > decoding t is different than the data resulting from decoding t'. > > 5. For every data d there is exactly one base64 encoding t. > > 6. Every base64 text t is an encoding of exactly one data d. > > 7. For all data d, Base64_Decode[Base64_Encode[d]] = d > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 13 09:51:50 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 13 Oct 2018 16:51:50 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: Message-ID: In summary, two disating implementations are allowed to return different values t and t' of Base64_Encode(d) from the same message d, but both Base64_Decode(t') and Base64_Decode(t) will be equal and will MUST return d exactly. There's an allowed choice of implementation for Base64_Encode() but Base64_Decode() must then be updated to be permissive/flexible and ensure that in all cases, Base64_Decode[Base64_Encode[d]] = d, for every value of d. The reverse is not true because of this flexibility (needed for various transport protocols that have different requirements, notably on the allowed set of characters, and on their maximum line lengths): Base64_Encode[Base64_Decode[t]] = t may be false. Le sam. 13 oct. 2018 ? 16:45, Philippe Verdy a ?crit : > You forget that Base64 (as used in MIME) does not follow these rules as it > allows multiple different encodings for the same source binary. MIME > actually splits a binary object into multiple fragments at random > positions, and then encodes these fragments separately. Also MIME uses an > extension of Base64 where it allows some variations in the encoding > alphabet (so even the same fragment of the same length may have two disting > encodings). > > Base64 in MIME is different from standard Base64 (which never splits the > binary object before encoding it, and uses a strict alphabet of 64 ASCII > characters, allowing no variation). So MIME requires special handling: the > assumpton that a binary message is encoded the same is wrong, but MIME > still requires that this non unique Base64 encoding will be decoded back to > the same initial (unsplitted) binary object (independantly of its size and > independantly of the splitting boundaries used in the transport, which may > change during the transport). > > This also applies to the Base64 encoding used in HTTP transport syntax, > and notably in the HTTP/1.1 streaming feature where fragment sizes are also > variable. > > > Le sam. 13 oct. 2018 ? 16:27, Costello, Roger L. via Unicode < > unicode at unicode.org> a ?crit : > >> Hi Folks, >> >> Thank you for your outstanding responses! >> >> Below is a summary of what I learned. Are there any errors in the >> summary? Is there anything you would add? Please let me know of anything >> that is not clear. /Roger >> >> 1. While base64 encoding is usually applied to binary, it is also >> sometimes applied to text, such as Unicode text. >> >> Note: Since base64 encoding may be applied to both binary and text, in >> the following bullets I use the more generic term "data". For example, >> "Data d is base64-encoded to yield ..." >> >> 2. Neither base64 encoding nor decoding should presume any special >> knowledge of the meaning of the data or do anything extra based on that >> presumption. >> >> For example, converting Unicode text to and from base64 should not >> perform any sort of Unicode normalization, convert between UTFs, insert or >> remove BOMs, etc. This is like saying that converting a JPEG image to and >> from base64 should not resize or rescale the image, change its color depth, >> convert it to another graphic format, etc. >> >> If you use base64 for encoding MIME content (e.g. emails), the base64 >> decoding will not transform the content. The email parser must ensure that >> the content is valid, so the parser might have to transform the content >> (possibly replacing some invalid sequences or truncating), and then apply >> Unicode normalization to render the text. These transforms are part of the >> MIME application and are independent of whether you use base64 or any >> another encoding or transport syntax. >> >> 3. If data d is different than d', then the base64 text resulting from >> encoding d is different than the base64 text resulting from encoding d'. >> >> 4. If base64 text t is different than t', then the data resulting from >> decoding t is different than the data resulting from decoding t'. >> >> 5. For every data d there is exactly one base64 encoding t. >> >> 6. Every base64 text t is an encoding of exactly one data d. >> >> 7. For all data d, Base64_Decode[Base64_Encode[d]] = d >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 13 11:50:19 2018 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Sat, 13 Oct 2018 18:50:19 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: Message-ID: <20181013165019.sxGzV%steffen@sdaoden.eu> Philippe Verdy via Unicode wrote in : |You forget that Base64 (as used in MIME) does not follow these rules \ |as it allows multiple different encodings for the same source binary. \ |MIME actually |splits a binary object into multiple fragments at random positions, \ |and then encodes these fragments separately. Also MIME uses an extension \ |of Base64 |where it allows some variations in the encoding alphabet (so even the \ |same fragment of the same length may have two disting encodings). | |Base64 in MIME is different from standard Base64 (which never splits \ |the binary object before encoding it, and uses a strict alphabet of \ |64 ASCII |characters, allowing no variation). So MIME requires special handling: \ |the assumpton that a binary message is encoded the same is wrong, but \ |MIME still |requires that this non unique Base64 encoding will be decoded back \ |to the same initial (unsplitted) binary object (independantly of its \ |size and |independantly of the splitting boundaries used in the transport, which \ |may change during the transport). Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies). It is a content-transfer-encoding and encodes any data transparently into a 7 bit clean ASCII _and_ EBCDIC compatible (the authors commemorate that) text. When decoding it reverts this representation into its original form. Ok, there is the CRLF newline problem, as below. What do you mean by "splitting"? ... The only variance is described as: Care must be taken to use the proper octets for line breaks if base64 encoding is applied directly to text material that has not been converted to canonical form. In particular, text line breaks must be converted into CRLF sequences prior to base64 encoding. The important thing to note is that this may be done directly by the encoder rather than in a prior canonicalization step in some implementations. This is MIME, it specifies (in the same RFC): 2.10. Lines "Lines" are defined as sequences of octets separated by a CRLF sequences. This is consistent with both RFC 821 and RFC 822. "Lines" only refers to a unit of data in a message, which may or may not correspond to something that is actually displayed by a user agent. and furthermore 6.5. Translating Encodings The quoted-printable and base64 encodings are designed so that conversion between them is possible. The only issue that arises in such a conversion is the handling of hard line breaks in quoted- printable encoding output. When converting from quoted-printable to base64 a hard line break in the quoted-printable form represents a CRLF sequence in the canonical form of the data. It must therefore be converted to a corresponding encoded CRLF in the base64 form of the data. Similarly, a CRLF sequence in the canonical form of the data obtained after base64 decoding must be converted to a quoted- printable hard line break, but ONLY when converting text data. So we go over 6.6. Canonical Encoding Model There was some confusion, in the previous versions of this RFC, regarding the model for when email data was to be converted to canonical form and encoded, and in particular how this process would affect the treatment of CRLFs, given that the representation of newlines varies greatly from system to system, and the relationship between content-transfer-encodings and character sets. A canonical model for encoding is presented in RFC 2049 for this reason. to RFC 2049 where we find For example, in the case of text/plain data, the text must be converted to a supported character set and lines must be delimited with CRLF delimiters in accordance with RFC 822. Note that the restriction on line lengths implied by RFC 822 is eliminated if the next step employs either quoted-printable or base64 encoding. and, later Conversion from entity form to local form is accomplished by reversing these steps. Note that reversal of these steps may produce differing results since there is no guarantee that the original and final local forms are the same. and, later NOTE: Some confusion has been caused by systems that represent messages in a format which uses local newline conventions which differ from the RFC822 CRLF convention. It is important to note that these formats are not canonical RFC822/MIME. These formats are instead *encodings* of RFC822, where CRLF sequences in the canonical representation of the message are encoded as the local newline convention. Note that formats which encode CRLF sequences as, for example, LF are not capable of representing MIME messages containing binary data which contains LF octets not part of CRLF line separation sequences. Whoever understands this emojibake. My MUA still gnaws at antiquated structures (i am too lazy), but in quoted-printable we encode CRLF in the raw text to "=0D=0A=", i.e., a trailing soft line break so that data is decoded as plain CRLF again. Something like that it should be i think. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Sat Oct 13 18:37:35 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 14 Oct 2018 01:37:35 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <20181013165019.sxGzV%steffen@sdaoden.eu> References: <20181013165019.sxGzV%steffen@sdaoden.eu> Message-ID: Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < unicode at unicode.org> a ?crit : > Philippe Verdy via Unicode wrote in w9+jEARW4Ghyk8hg at mail.gmail.com>: > |You forget that Base64 (as used in MIME) does not follow these rules \ > |as it allows multiple different encodings for the same source binary. \ > |MIME actually > |splits a binary object into multiple fragments at random positions, \ > |and then encodes these fragments separately. Also MIME uses an extension > \ > |of Base64 > |where it allows some variations in the encoding alphabet (so even the \ > |same fragment of the same length may have two disting encodings). > | > |Base64 in MIME is different from standard Base64 (which never splits \ > |the binary object before encoding it, and uses a strict alphabet of \ > |64 ASCII > |characters, allowing no variation). So MIME requires special handling: \ > |the assumpton that a binary message is encoded the same is wrong, but \ > |MIME still > |requires that this non unique Base64 encoding will be decoded back \ > |to the same initial (unsplitted) binary object (independantly of its \ > |size and > |independantly of the splitting boundaries used in the transport, which \ > |may change during the transport). > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail > Extensions (MIME) Part One: Format of Internet Message Bodies). > It is a content-transfer-encoding and encodes any data > transparently into a 7 bit clean ASCII _and_ EBCDIC compatible > (the authors commemorate that) text. > When decoding it reverts this representation into its original form. > Ok, there is the CRLF newline problem, as below. > What do you mean by "splitting"? > > ... > The only variance is described as: > > Care must be taken to use the proper octets for line breaks if base64 > encoding is applied directly to text material that has not been > converted to canonical form. In particular, text line breaks must be > converted into CRLF sequences prior to base64 encoding. The > important thing to note is that this may be done directly by the > encoder rather than in a prior canonicalization step in some > implementations. > > This is MIME, it specifies (in the same RFC): I've not spoken aboutr the encoding of new lines **in the actual encoded text**: - if their existing text-encoding ever gets converted to Base64 as if the whole text was an opaque binary object, their initial text-encoding will be preserved (so yes it will preserve the way these embedded newlines are encoded as CR, LF, CR+LF, NL...) I spoke about newlines used in the transport syntax to split the initial binary object (which may actually contain text but it does not matter). MIME defines this operation and even requires splitting the binary object in fragments with maximum binary size so that these binary fragments can be converted with Base64 into lines with maximum length. In the MIME Base64 representation you can insert newlines anywhere between fragments encoded separately. The maximum size of fragment is not fixed (it is usually about 60 binary octets, that are converted to lines of 80 ASCII characters, followed by a newline (CR+LF is strongly suggested for MIME, but it is admitted to use other newline sequences). Email forwarding agents frequently needed these line lengths to process the mail properly (not just the MIME headers but as well the content body, where they want at least some whitespace or newline in the middle where they can freely rearrange the line lines by compressing whitespaces or splitting lines to shorter length as necessary to their processing; this is much less frequent today because most mail agents are 8-bit clean and allow arbitrary line lengths... except in MIME headers). In MIME headers the situation is different, there's really a maximum line-length there, and if a header is too long, it has to be split on multiple lines (using continuation sequences, i.e. a newline (CR+LF is standard here) followed by at least one space (this insertion/change/removal of whitespaces is permitted everywhere in the MIME header after the header type, but even before the colon that follows the header type). So a MIME header value whose included text gets encoded with Base64 will be split using "=?" sequences starting the indication that the fragment is Base64 encoded (instead of being QuotedPrintable-encoded) and then a separator and the encapsulated Base-64 encoding of a fragment, and a single header may have multiple Base64-encoded fragments in the same header value, and there's large freedom about where to split the value to isolate fragments with convenient size that satisfies the MIME requirements. These multiple fragemetns may then occur on the same line (separated by whitespace) or on multiple line (separated by continuation sequences). In that case, the same initial text can have multiple valid representation in a MIME envelope format using Base64: it is not Base64 itself that splits the message, but the MIME transport syntax (which itself does not alter the initial text-encoding of the initial text... except in parts that are NOT binary-encoded using Base64 or QuotedPrintable). We are in a case where Base64 is not applied uniquely, because it is driven not by the actual transported text, but by the MIME transport syntax, and MIME allows freely changing the Base64 fragment sizes (or even switch to another encoding) as long as it preserves the binary value of the embedded object, and also to change the text-encoding (UTF-8, ISO 8859-*, etc.) if encoded fragments are identified to actually contain text (this does not apply to content bodies, unless they are declared with a "text/*" MIME type in the headers; but this applies for known headers whose value is necessarily a text type (such as in headers with types "From:", "To:", "Cc:", "Subject:", "Date:" ...) MIME defines two distinct syntaxes, one for declaration headers, another for content bodies. Each one can use Base64 encoding and split the content (but differently). HTTP also has a mechanism for splitting a large body into fragments (this allows notably to create streaming protocols where fragments can be easily multiplexed with parallel streams, or to include digital fingerprints or security signatures for individual fragments to secure the stream. This fragmentation is independant of the network transport (generally TCP, but not only) which has its own transparent MTUs at session layer, link layers, and also can be itself be encapsulated through tunnels transported by other means with different MTUs and fragmentation : HTTP does not have to manage that lower layer). Both MIME (for mails) and HTTP define allowed transformations to drive how Base64 will be used. Both have enough flexibility to allow variable fragment sizes, and even allow them to be changed as needed for the transport (this is challending for data signatures of the exchanged contents, but both MIME and HTTP can safely preserve the content without breaking these signatures in the middle): the recipient may not recieve exactly the same Base-64 encoded message, but it will get the same message content (once it is Base64 decoded) Base64 is used exactly to support this flexibility in transport (or storage) without altering any bit of the initial content once it is decoded. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 13 19:02:59 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 14 Oct 2018 01:02:59 +0100 Subject: Fallback for Sinhala Consonant Clusters Message-ID: <20181014010259.4fb5436a@JRWUBU2> Are there fallback rules for Sinhala consonant clusters? There are fallback rules for Devanagari, but I'm not sure if they read across. The problem I am seeing is that the Pali syllable 'ndhe' ????? is being rendered identically to a hypothetical Sinhalese 'n?dha' ??? , which in NFD is , when I use a font that lacks the conjunct. (Most fonts lack the conjunct.) The Devanagari rules and my preference would lead to a fallback rendering as ???? (Sinhalese 'ndhe'), which is encoded as . Is the rendering I am getting technically wrong, or is it merely undesirable? The ambiguity arises in part because, like the Brahmi script, the Sinhala script uses its virama character as a vowel length indicator. Missing touching consonants are being rendered almost as though there were no ZWJ, but the combination of consonant and al-lakuna is being rendered badly. Richard. From unicode at unicode.org Sat Oct 13 20:39:04 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Sun, 14 Oct 2018 03:39:04 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> Message-ID: <20181014013904.idfomqt5s65wnqro@angband.pl> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > unicode at unicode.org> a ?crit : > > The only variance is described as: > > > > Care must be taken to use the proper octets for line breaks if base64 > > encoding is applied directly to text material that has not been > > converted to canonical form. In particular, text line breaks must be > > converted into CRLF sequences prior to base64 encoding. The > > important thing to note is that this may be done directly by the > > encoder rather than in a prior canonicalization step in some > > implementations. > > > > This is MIME, it specifies (in the same RFC): > > I've not spoken aboutr the encoding of new lines **in the actual encoded > text**: > - if their existing text-encoding ever gets converted to Base64 as if the > whole text was an opaque binary object, their initial text-encoding will be > preserved (so yes it will preserve the way these embedded newlines are > encoded as CR, LF, CR+LF, NL...) > > I spoke about newlines used in the transport syntax to split the initial > binary object (which may actually contain text but it does not matter). > MIME defines this operation and even requires splitting the binary object > in fragments with maximum binary size so that these binary fragments can be > converted with Base64 into lines with maximum length. In the MIME Base64 > representation you can insert newlines anywhere between fragments encoded > separately. There's another kind of fragmentation that can make the encoding differ (but still decode to the same payload): The data stream gets split into 3-byte internal, 4-byte external packets. Any packet may contain less than those 3 bytes, in which cases it is padded with = characters: 3 bytes XXXX 2 bytes XXX= 1 byte XX== Usually, such smaller packets happen only at the end of a message, but to support encoding a stream piecewise, they are allowed at any point. For example: "meow" is bWVvdw== "me""ow" is bWU=b3c= yet both carry the same payload. > Base64 is used exactly to support this flexibility in transport (or > storage) without altering any bit of the initial content once it is > decoded. Right, any such variations are in packaging only. ???? -- ??????? ??????? 10 people enter a bar: 1 who understands binary, ??????? 1 who doesn't, D who prefer to write it as hex, ??????? and 1 who narrowly avoided an off-by-one error. From unicode at unicode.org Sun Oct 14 03:15:26 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Sun, 14 Oct 2018 17:15:26 +0900 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <20181014010259.4fb5436a@JRWUBU2> References: <20181014010259.4fb5436a@JRWUBU2> Message-ID: <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp> Hello Richard, On 2018/10/14 09:02, Richard Wordingham via Unicode wrote: > Are there fallback rules for Sinhala consonant clusters? There are > fallback rules for Devanagari, but I'm not sure if they read across. > > The problem I am seeing is that the Pali syllable 'ndhe' ????? NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9 > KOMBUVA> Let's label this as (1) > is being rendered identically to a hypothetical Sinhalese > 'n?dha' ??? , It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1). Your mail is written as if you are speaking about a general phenomenon, but I guess there are differences depending on the font and rendering stack. > which in NFD is > , when I use a font that lacks the > conjunct. (Most fonts lack the conjunct.) The Devanagari rules and my > preference would lead to a fallback rendering as ???? (Sinhalese > 'ndhe'), Here, this (3) looks like it has the same three components as (2), but the first two are exchanged, so that the piece that looks like @ is now in the middle (it was at the left in (1) and (2)). Hope this helps. Regards, Martin. > which is encoded as MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>. Is the rendering I am getting > technically wrong, or is it merely undesirable? > > The ambiguity arises in part because, like the Brahmi script, the > Sinhala script uses its virama character as a vowel length indicator. > > Missing touching consonants are being rendered almost as though there > were no ZWJ, but the combination of consonant and al-lakuna is being > rendered badly. > > Richard. > > . > -- Prof. Dr.sc. Martin J. D?rst Department of Intelligent Information Technology College of Science and Engineering Aoyama Gakuin University Fuchinobe 5-1-10, Chuo-ku, Sagamihara 252-5258 Japan From unicode at unicode.org Sun Oct 14 03:41:28 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 14 Oct 2018 10:41:28 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <20181014013904.idfomqt5s65wnqro@angband.pl> References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> Message-ID: Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails). So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries). Also: - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol) - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol) So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload. Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start): - Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream; - for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input). - these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself. You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done: - as a separate payload (not encoded by Base64), or - by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding. - by appending 3 bits at end of the bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added) Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload. All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded). Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units. However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder. But many other encodings are still possible (and can be conforming to Unicode, provided they preserve each Unicode scalar value, or at least the code point identity because an encoder/decoder is not required to support non-character code points such as surrogates or U+FFFE), where Base64 may be used for internally generated octets-streams. Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode a ?crit : > On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > > unicode at unicode.org> a ?crit : > > > The only variance is described as: > > > > > > Care must be taken to use the proper octets for line breaks if base64 > > > encoding is applied directly to text material that has not been > > > converted to canonical form. In particular, text line breaks must be > > > converted into CRLF sequences prior to base64 encoding. The > > > important thing to note is that this may be done directly by the > > > encoder rather than in a prior canonicalization step in some > > > implementations. > > > > > > This is MIME, it specifies (in the same RFC): > > > > I've not spoken aboutr the encoding of new lines **in the actual encoded > > text**: > > - if their existing text-encoding ever gets converted to Base64 as if > the > > whole text was an opaque binary object, their initial text-encoding will > be > > preserved (so yes it will preserve the way these embedded newlines are > > encoded as CR, LF, CR+LF, NL...) > > > > I spoke about newlines used in the transport syntax to split the initial > > binary object (which may actually contain text but it does not matter). > > MIME defines this operation and even requires splitting the binary object > > in fragments with maximum binary size so that these binary fragments can > be > > converted with Base64 into lines with maximum length. In the MIME Base64 > > representation you can insert newlines anywhere between fragments encoded > > separately. > > There's another kind of fragmentation that can make the encoding differ > (but > still decode to the same payload): > > The data stream gets split into 3-byte internal, 4-byte external packets. > Any packet may contain less than those 3 bytes, in which cases it is padded > with = characters: > 3 bytes XXXX > 2 bytes XXX= > 1 byte XX== > > Usually, such smaller packets happen only at the end of a message, but to > support encoding a stream piecewise, they are allowed at any point. > > For example: > "meow" is bWVvdw== > "me""ow" is bWU=b3c= > yet both carry the same payload. > > > Base64 is used exactly to support this flexibility in transport (or > > storage) without altering any bit of the initial content once it is > > decoded. > > Right, any such variations are in packaging only. > > > ???? > -- > ??????? > ??????? 10 people enter a bar: 1 who understands binary, > ??????? 1 who doesn't, D who prefer to write it as hex, > ??????? and 1 who narrowly avoided an off-by-one error. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 14 06:44:56 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 14 Oct 2018 12:44:56 +0100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp> References: <20181014010259.4fb5436a@JRWUBU2> <5284f868-e642-3be1-bb91-b5a65d93a8de@it.aoyama.ac.jp> Message-ID: <20181014124456.459cdef0@JRWUBU2> On Sun, 14 Oct 2018 17:15:26 +0900 "Martin J. D?rst via Unicode" wrote: > Hello Richard, > > On 2018/10/14 09:02, Richard Wordingham via Unicode wrote: > > Are there fallback rules for Sinhala consonant clusters? There are > > fallback rules for Devanagari, but I'm not sure if they read across. > > > > The problem I am seeing is that the Pali syllable 'ndhe' ????? > > > DAYANNA, U+0DD9 > > KOMBUVA> > > Let's label this as (1) > > > is being rendered identically to a hypothetical Sinhalese > > 'n?dha' ??? , > > It (2) doesn't look identically to (1) here (Thunderbird on Win 8.1). > > Your mail is written as if you are speaking about a general > phenomenon, but I guess there are differences depending on the font > and rendering stack. The critical one is whether the font has the conjunct. The default Sinhala font on supported Windows, Iskoola Pota, has the conjunct. For an example that should illustrate my points with that font (at least, as on Windows 7) and the HarfBuzz renderer (as I believe in Thunderbird), we have 1') Pali thve ????? It's a very rare syllable - it only occurs in sandhi, and I have only a single example. Iskoola Pota has neither the conjunct nor the touching form; I would actually expect it to be the touching form that exists. 2') Misleading look-alike th?va ??? 3') Preferred fallback appearance thve ???? . My question is, 'What should a rendering stack that claims to support the Sinhala script display when it lacks the conjunct in the font being used?' Now what does get displayed does depend on the rendering stack. HarfBuzz (e.g. Firefox, Google Chrome, LibreOffice, and most Linux) and Notepad on Windows 7 move the vowel to the left and display al-lakuna, the display I object to. iPhone and Notepad on Windows 10 display the vowel in the middle and display al-lakuna (possibly ligated), which is the solution I prefer. > Hope this helps. Well, it has prompted me to find a 'me-too' argument for improving the rendering. I wanted a standards-based argument. >> Missing touching consonants are being rendered almost as though >> there were no ZWJ, but the combination of consonant and al-lakuna >> is being rendered badly. This looks like a common font problem. Iskoola Pota does not suffer from it. Richard. From unicode at unicode.org Sun Oct 14 09:55:24 2018 From: unicode at unicode.org (Harshula via Unicode) Date: Mon, 15 Oct 2018 01:55:24 +1100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <20181014010259.4fb5436a@JRWUBU2> References: <20181014010259.4fb5436a@JRWUBU2> Message-ID: <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> Hi Richard, 1) From a pronunciation perspective, your first and third examples will be similar. Your second example will be pronounced very differently. I did some quick testing on Linux and reproduced the behaviour that you observed. 2) Going back more than a decade, the state tables used by some layout/shaping engines used the same 'virama' rules for North Indian scripts and Sinhala. This resulted in undesirable *implicit* conjuncts being created for Sinhala consonant clusters. That then resulted in undesirable positioning of dependent vowels. e.g. https://bugzilla.gnome.org/show_bug.cgi?id=161981 3) However, what you have observed is an issue with *explicit* conjunct creation. After the segmentation is completed, the layout/shaping engine needs to first check if there is a corresponding lookup for the explicit conjunct, if not, then it needs to remove the ZWJ and redo the segmentation and lookup(s). Perhaps that is not happening in Harfbuzz. 4) I've been out of the loop for many years, so I have CC'd Ruvan & Harsha who may already be aware of what you have observed. cya, # On 14/10/18 11:02 am, Richard Wordingham via Unicode wrote: > Are there fallback rules for Sinhala consonant clusters? There are > fallback rules for Devanagari, but I'm not sure if they read across. > > The problem I am seeing is that the Pali syllable 'ndhe' ????? NAYANNA, U+0DCA AL-LAKUNA, 200D ZWJ, U+0DB0 MAHAPRAANA DAYANNA, U+0DD9 > KOMBUVA> is being rendered identically to a hypothetical Sinhalese > 'n?dha' ??? , which in NFD is > , when I use a font that lacks the > conjunct. (Most fonts lack the conjunct.) The Devanagari rules and my > preference would lead to a fallback rendering as ???? (Sinhalese > 'ndhe'), which is encoded as MAHAPRAANA DAYANNA, U+0DD9 KOMBUVA>. Is the rendering I am getting > technically wrong, or is it merely undesirable? > > The ambiguity arises in part because, like the Brahmi script, the > Sinhala script uses its virama character as a vowel length indicator. > > Missing touching consonants are being rendered almost as though there > were no ZWJ, but the combination of consonant and al-lakuna is being > rendered badly. > > Richard. > From unicode at unicode.org Sun Oct 14 14:10:45 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sun, 14 Oct 2018 13:10:45 -0600 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Message-ID: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> Steffen Nurpmeso wrote: > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > (MIME) Part One: Format of Internet Message Bodies). Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data Encodings." RFC 2045 defines a particular implementation of base64, specific to transporting Internet mail in a 7-bit environment. RFC 4648 discusses many of the "higher-level protocol" topics that some people are focusing on, such as separating the base64-encoded output into lines of length 72 (or other), alternative target code unit sets or "alphabets," and padding characters. It would be helpful for everyone to read this particular RFC before concluding that these topics have not been considered, or that they compromise round-tripping or other characteristics of base64. I had assumed that when Roger asked about "base64 encoding," he was asking about the basic definition of base64. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Oct 14 16:50:52 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 14 Oct 2018 23:50:52 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> Message-ID: It's also interesting to look at https://tools.ietf.org/html/rfc3501 - which defines (for IMAP v4) another "BASE64" encoding, - and also defines a "Modified UTF-7" encoding using it, deviating from Unicode's definition of UTF-7, - and adding other requirements (which forbids alternate encodings permitted in UTF-7 and all other Base64 variants, including those used in MIME/RFC 2045 or SMTP, used in strong relations with IMAP !). And nothing in RFC 4648 is clear about the fact that it only covers the encoding of "octets streams" and not "bits streams". It also does not discuss the adaptation for "Base64" for transport and storage (needed for MIME, IMAP, but also in HTTP, and in several file/data formats including XML, or digital signatures). That RFC 4648 is only superficial, and does not cover everything (even Unicode has its own definition for UTF-7 and also allows variations). As we are on this Unicode list, the definition used by Unicode (more in line with MIME), does not follow at all those in RFC 4648. Most uses of Base64 encodings are based on the original MIME definition, and all of them perform new adaptations. (Even the definition of "Base16" in RFC4648 contradicts most other definitions). Le dim. 14 oct. 2018 ? 21:21, Doug Ewell via Unicode a ?crit : > Steffen Nurpmeso wrote: > > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > > (MIME) Part One: Format of Internet Message Bodies). > > Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data > Encodings." RFC 2045 defines a particular implementation of base64, > specific to transporting Internet mail in a 7-bit environment. > > RFC 4648 discusses many of the "higher-level protocol" topics that some > people are focusing on, such as separating the base64-encoded output > into lines of length 72 (or other), alternative target code unit sets or > "alphabets," and padding characters. It would be helpful for everyone to > read this particular RFC before concluding that these topics have not > been considered, or that they compromise round-tripping or other > characteristics of base64. > > I had assumed that when Roger asked about "base64 encoding," he was > asking about the basic definition of base64. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 14 16:59:05 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 14 Oct 2018 23:59:05 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> Message-ID: Le dim. 14 oct. 2018 ? 21:21, Doug Ewell via Unicode a ?crit : > Steffen Nurpmeso wrote: > > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > > (MIME) Part One: Format of Internet Message Bodies). > > Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data > Encodings." RFC 2045 defines a particular implementation of base64, > specific to transporting Internet mail in a 7-bit environment. > Wrong, this is "specific" to transporting Internet mail in any 7 bit or 8 bit environment (today almost all mail agents are operating in 8 bit), and then it is referenced directly by HTTP (and its HTTPS variant). So this is no so "specific". MIME is extremely popular, RFC 4648 is extremely exotic (and RFC 4648 is wrong when saying that IMAP is very specific as it is now a very popular protocol, widely used as well). MIME is so frequently used, that almost all people refer to it when they look for Base64, or do not explicitly state that another definition (found in an exotic RFC) is explicitly used. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 14 20:56:15 2018 From: unicode at unicode.org (Tex via Unicode) Date: Sun, 14 Oct 2018 18:56:15 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> Message-ID: <000601d4642a$4274ec70$c75ec550$@xencraft.com> Philippe, Where is the use of whitespace or the idea that 1-byte pieces do not need all the equal sign paddings documented? I read the rfc 3501 you pointed at, I don?t see it there. Are these part of any standards? Or are you claiming these are practices despite the standards? If so, are these just tolerated by parsers, or are they actually generated by encoders? What would be the rationale for supporting unnecessary whitespace? If linebreaks are forced at some line length they can presumably be removed at that length and not treated as part of the encoding. Maybe we differ on define where the encoding begins and ends, and where higher level protocols prescribe how they are embedded within the protocol. Tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode Sent: Sunday, October 14, 2018 1:41 AM To: Adam Borowski Cc: unicode Unicode Discussion Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails). So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries). Also: - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol) - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol) So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload. Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start): - Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream; - for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input). - these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself. You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done: - as a separate payload (not encoded by Base64), or - by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding. - by appending 3 bits at end of the bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added) Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload. All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded). Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units. However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder. But many other encodings are still possible (and can be conforming to Unicode, provided they preserve each Unicode scalar value, or at least the code point identity because an encoder/decoder is not required to support non-character code points such as surrogates or U+FFFE), where Base64 may be used for internally generated octets-streams. Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode a ?crit : On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > unicode at unicode.org> a ?crit : > > The only variance is described as: > > > > Care must be taken to use the proper octets for line breaks if base64 > > encoding is applied directly to text material that has not been > > converted to canonical form. In particular, text line breaks must be > > converted into CRLF sequences prior to base64 encoding. The > > important thing to note is that this may be done directly by the > > encoder rather than in a prior canonicalization step in some > > implementations. > > > > This is MIME, it specifies (in the same RFC): > > I've not spoken aboutr the encoding of new lines **in the actual encoded > text**: > - if their existing text-encoding ever gets converted to Base64 as if the > whole text was an opaque binary object, their initial text-encoding will be > preserved (so yes it will preserve the way these embedded newlines are > encoded as CR, LF, CR+LF, NL...) > > I spoke about newlines used in the transport syntax to split the initial > binary object (which may actually contain text but it does not matter). > MIME defines this operation and even requires splitting the binary object > in fragments with maximum binary size so that these binary fragments can be > converted with Base64 into lines with maximum length. In the MIME Base64 > representation you can insert newlines anywhere between fragments encoded > separately. There's another kind of fragmentation that can make the encoding differ (but still decode to the same payload): The data stream gets split into 3-byte internal, 4-byte external packets. Any packet may contain less than those 3 bytes, in which cases it is padded with = characters: 3 bytes XXXX 2 bytes XXX= 1 byte XX== Usually, such smaller packets happen only at the end of a message, but to support encoding a stream piecewise, they are allowed at any point. For example: "meow" is bWVvdw== "me""ow" is bWU=b3c= yet both carry the same payload. > Base64 is used exactly to support this flexibility in transport (or > storage) without altering any bit of the initial content once it is > decoded. Right, any such variations are in packaging only. ???? -- ??????? ??????? 10 people enter a bar: 1 who understands binary, ??????? 1 who doesn't, D who prefer to write it as hex, ??????? and 1 who narrowly avoided an off-by-one error. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 02:53:59 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 Oct 2018 08:53:59 +0100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> Message-ID: <20181015085359.339c5747@JRWUBU2> On Mon, 15 Oct 2018 01:55:24 +1100 Harshula via Unicode wrote: > 3) However, what you have observed is an issue with *explicit* > conjunct creation. After the segmentation is completed, the > layout/shaping engine needs to first check if there is a > corresponding lookup for the explicit conjunct, if not, then it needs > to remove the ZWJ and redo the segmentation and lookup(s). Perhaps > that is not happening in Harfbuzz. This indeed seems to be the problem with HarfBuzz and with Windows 7 Uniscribe. Curiously, they almost adopt this behaviour when touching letters are not available. (The ZWJ seems not to be completely removed - in HarfBuzz at least it can result in the al-lakuna not interacting properly with the base character.) But where is this usually useful behaviour specified? 1. There may be nothing but time and money to stop fallbacks being built into the font. For example, what prohibits the rendering of a conjunct falling back to touching letters or a missing glyph symbol? 2. One could argue that the current behaviour falls back to a display; Pali in Thai script does use sequences of . The problem is that al-lakuna also acts as a vowel modifier. 3. What stops one arguing that a conjunct is an abstract character and that to render it with a sequence using a visible al-lakuna would violate its identity? Richard. From unicode at unicode.org Mon Oct 15 06:13:41 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 15 Oct 2018 13:13:41 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <000601d4642a$4274ec70$c75ec550$@xencraft.com> References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> Message-ID: Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st sentence, it is explicitly stated : In some circumstances, the use of padding ("=") in base-encoded data is not required or used. Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : > Philippe, > > > > Where is the use of whitespace or the idea that 1-byte pieces do not need > all the equal sign paddings documented? > > I read the rfc 3501 you pointed at, I don?t see it there. > > > > Are these part of any standards? Or are you claiming these are practices > despite the standards? If so, are these just tolerated by parsers, or are > they actually generated by encoders? > > > > What would be the rationale for supporting unnecessary whitespace? If > linebreaks are forced at some line length they can presumably be removed at > that length and not treated as part of the encoding. > > Maybe we differ on define where the encoding begins and ends, and where > higher level protocols prescribe how they are embedded within the protocol. > > > > Tex > > > > > > > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy via Unicode > *Sent:* Sunday, October 14, 2018 1:41 AM > *To:* Adam Borowski > *Cc:* unicode Unicode Discussion > *Subject:* Re: Base64 encoding applied to different unicode texts always > yields different base64 texts ... true or false? > > > > Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is > enough to indicate the end of an octets-span. The extra = after it do not > add any other octet. and as well you're allowed to insert whitespaces > anywhere in the encoded stream (this is what ensures that the > Base64-encoded octets-stream will not be altered if line breaks are forced > anywhere (notably within the body of emails). > > > > So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, > LF, NEL) in the middle is non-significant and ignorable on decoding (their > "encoded" bit length is 0 and they don't terminate an octets-span, unlike > "=" which discards extra bits remaining from the encoded stream before that > are not on 8-bit boundaries). > > > > Also: > > - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol > before "=" can vary in its 4 lowest bits (which are then ignored/discarded > by the "=" symbol) > > - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" > symbol before "=" can vary in its 2 lowest bits (which are then > ignored/discarded by the "=" symbol) > > > > So you can use Base64 by encoding each octet in separate pieces, as one > Base64 symbol followed by an "=" symbol, and even insert any number of > whitespaces between them: there's a infinite number of valid Base64 > encodings for representing the same octets-stream payload. > > > > Base64 allows encoding any octets streams but not directly any > bits-streams : it assumes that the effective bits-stream has a binary > length multiple of 8. To encode a bits-stream with an exact number of bits > (not multiple of 8), you need to encode an extra payload to indicate the > effective number of bits to keep at end of the encoded octets-stream (or at > start): > > - Base64 does not specify how you convert a bitstream of arbitrary length > to an octets-stream; > > - for that purpose, you may need to pad the bits-stream at start or at end > with 1 to 6 bits (so that it the resulting bitstream has a length multiple > of 8, then encodable with Base64 which takes only octets on input). > > - these extra padding bits are not significant for the original bitstream, > but are significant for the Base64 encoder/decoder, they will be discarded > by the bitstream decoder built on top of the Base64 decoder, but not by the > Base64 decoder itself. > > > > You need to encode somewhere with the bitstream encoder how many padding > bits (0 to 7) are present at start or end of the octets-stream; this can be > done: > > - as a separate payload (not encoded by Base64), or > > - by prepending 3 bits at start of the bits-stream then padded at end with > 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 > encoding. > > - by appending 3 bits at end of the bits-stream, just after 1 to 7 random > bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. > > Finally your bits-stream decoder will be able to use this padding count to > discard these random padding bits (and possibly realign the stream on > different byte-boundaries when the effective bitlength bits-stream payload > is not a multiple of 8 and padding bits were added) > > > > Base64 also does not specify how bits of the original bits-stream payload > are packed into the octets-stream input suitable for Base64-encoding, > notably it does not specify their order and endian-ness. The same remark > applies as well for MIME, HTTP. So lot of network protocols and file > formats need to how to properly encode which possible option is used to > encode bits-streams of arbitrary length, or need to specify which default > choice to apply if this option is not encoded, or which option must be used > (with no possible variation). And this also adds to the number of distinct > encodings that are possible but are still equivalent for the same effective > bits-stream payload. > > > > All these allowed variations are from the encoder perspective. For > interoperability, the decoder has to be flexible and to support various > options to be compatible with different implementations of the encoder, > notably when the encoder was run on a different system. And this is the > case for the MIME transport by mail, or for HTTP and FTP transports, or > file/media storage formats even if the file is stored on the same system, > because it may actually be a copy stored locally but coming from another > system where the file was actually encoded). > > > > Now if we come back to the encoding of plain-text payloads, Unicode just > specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code > points (it actually does not mandate an exact bit-length because the range > does not fully fit exactly to 21 bits and an encoder can still pack > multiple code points together into more compact code units. > > > > However Unicode provides and standardizes several encodings (UTF-8/16/32) > which use code units whose size is directly suitable as input for an > octets-stream, so that they are directly encodable with Base64, without > having to specify an extra layer for the bits-stream encoder/decoder. > > > > But many other encodings are still possible (and can be conforming to > Unicode, provided they preserve each Unicode scalar value, or at least the > code point identity because an encoder/decoder is not required to support > non-character code points such as surrogates or U+FFFE), where Base64 may > be used for internally generated octets-streams. > > > > > > Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode < > unicode at unicode.org> a ?crit : > > On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > > unicode at unicode.org> a ?crit : > > > The only variance is described as: > > > > > > Care must be taken to use the proper octets for line breaks if base64 > > > encoding is applied directly to text material that has not been > > > converted to canonical form. In particular, text line breaks must be > > > converted into CRLF sequences prior to base64 encoding. The > > > important thing to note is that this may be done directly by the > > > encoder rather than in a prior canonicalization step in some > > > implementations. > > > > > > This is MIME, it specifies (in the same RFC): > > > > I've not spoken aboutr the encoding of new lines **in the actual encoded > > text**: > > - if their existing text-encoding ever gets converted to Base64 as if > the > > whole text was an opaque binary object, their initial text-encoding will > be > > preserved (so yes it will preserve the way these embedded newlines are > > encoded as CR, LF, CR+LF, NL...) > > > > I spoke about newlines used in the transport syntax to split the initial > > binary object (which may actually contain text but it does not matter). > > MIME defines this operation and even requires splitting the binary object > > in fragments with maximum binary size so that these binary fragments can > be > > converted with Base64 into lines with maximum length. In the MIME Base64 > > representation you can insert newlines anywhere between fragments encoded > > separately. > > There's another kind of fragmentation that can make the encoding differ > (but > still decode to the same payload): > > The data stream gets split into 3-byte internal, 4-byte external packets. > Any packet may contain less than those 3 bytes, in which cases it is padded > with = characters: > 3 bytes XXXX > 2 bytes XXX= > 1 byte XX== > > Usually, such smaller packets happen only at the end of a message, but to > support encoding a stream piecewise, they are allowed at any point. > > For example: > "meow" is bWVvdw== > "me""ow" is bWU=b3c= > yet both carry the same payload. > > > Base64 is used exactly to support this flexibility in transport (or > > storage) without altering any bit of the initial content once it is > > decoded. > > Right, any such variations are in packaging only. > > > ???? > -- > ??????? > ??????? 10 people enter a bar: 1 who understands binary, > ??????? 1 who doesn't, D who prefer to write it as hex, > ??????? and 1 who narrowly avoided an off-by-one error. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 06:24:38 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 15 Oct 2018 13:24:38 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <000601d4642a$4274ec70$c75ec550$@xencraft.com> References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> Message-ID: Also the rationale for supporting "unnecessary" whitespace is found in MIME's version of Base64, also in RFCs describing encoding formats for digital certificates, or for exchanging public keys in encryption algorithms like PGP (notably, but not only, as texts in the body of emails or in documentations and websites). Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : > Philippe, > > > > Where is the use of whitespace or the idea that 1-byte pieces do not need > all the equal sign paddings documented? > > I read the rfc 3501 you pointed at, I don?t see it there. > > > > Are these part of any standards? Or are you claiming these are practices > despite the standards? If so, are these just tolerated by parsers, or are > they actually generated by encoders? > > > > What would be the rationale for supporting unnecessary whitespace? If > linebreaks are forced at some line length they can presumably be removed at > that length and not treated as part of the encoding. > > Maybe we differ on define where the encoding begins and ends, and where > higher level protocols prescribe how they are embedded within the protocol. > > > > Tex > > > > > > > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy via Unicode > *Sent:* Sunday, October 14, 2018 1:41 AM > *To:* Adam Borowski > *Cc:* unicode Unicode Discussion > *Subject:* Re: Base64 encoding applied to different unicode texts always > yields different base64 texts ... true or false? > > > > Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is > enough to indicate the end of an octets-span. The extra = after it do not > add any other octet. and as well you're allowed to insert whitespaces > anywhere in the encoded stream (this is what ensures that the > Base64-encoded octets-stream will not be altered if line breaks are forced > anywhere (notably within the body of emails). > > > > So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, > LF, NEL) in the middle is non-significant and ignorable on decoding (their > "encoded" bit length is 0 and they don't terminate an octets-span, unlike > "=" which discards extra bits remaining from the encoded stream before that > are not on 8-bit boundaries). > > > > Also: > > - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol > before "=" can vary in its 4 lowest bits (which are then ignored/discarded > by the "=" symbol) > > - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" > symbol before "=" can vary in its 2 lowest bits (which are then > ignored/discarded by the "=" symbol) > > > > So you can use Base64 by encoding each octet in separate pieces, as one > Base64 symbol followed by an "=" symbol, and even insert any number of > whitespaces between them: there's a infinite number of valid Base64 > encodings for representing the same octets-stream payload. > > > > Base64 allows encoding any octets streams but not directly any > bits-streams : it assumes that the effective bits-stream has a binary > length multiple of 8. To encode a bits-stream with an exact number of bits > (not multiple of 8), you need to encode an extra payload to indicate the > effective number of bits to keep at end of the encoded octets-stream (or at > start): > > - Base64 does not specify how you convert a bitstream of arbitrary length > to an octets-stream; > > - for that purpose, you may need to pad the bits-stream at start or at end > with 1 to 6 bits (so that it the resulting bitstream has a length multiple > of 8, then encodable with Base64 which takes only octets on input). > > - these extra padding bits are not significant for the original bitstream, > but are significant for the Base64 encoder/decoder, they will be discarded > by the bitstream decoder built on top of the Base64 decoder, but not by the > Base64 decoder itself. > > > > You need to encode somewhere with the bitstream encoder how many padding > bits (0 to 7) are present at start or end of the octets-stream; this can be > done: > > - as a separate payload (not encoded by Base64), or > > - by prepending 3 bits at start of the bits-stream then padded at end with > 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 > encoding. > > - by appending 3 bits at end of the bits-stream, just after 1 to 7 random > bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. > > Finally your bits-stream decoder will be able to use this padding count to > discard these random padding bits (and possibly realign the stream on > different byte-boundaries when the effective bitlength bits-stream payload > is not a multiple of 8 and padding bits were added) > > > > Base64 also does not specify how bits of the original bits-stream payload > are packed into the octets-stream input suitable for Base64-encoding, > notably it does not specify their order and endian-ness. The same remark > applies as well for MIME, HTTP. So lot of network protocols and file > formats need to how to properly encode which possible option is used to > encode bits-streams of arbitrary length, or need to specify which default > choice to apply if this option is not encoded, or which option must be used > (with no possible variation). And this also adds to the number of distinct > encodings that are possible but are still equivalent for the same effective > bits-stream payload. > > > > All these allowed variations are from the encoder perspective. For > interoperability, the decoder has to be flexible and to support various > options to be compatible with different implementations of the encoder, > notably when the encoder was run on a different system. And this is the > case for the MIME transport by mail, or for HTTP and FTP transports, or > file/media storage formats even if the file is stored on the same system, > because it may actually be a copy stored locally but coming from another > system where the file was actually encoded). > > > > Now if we come back to the encoding of plain-text payloads, Unicode just > specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code > points (it actually does not mandate an exact bit-length because the range > does not fully fit exactly to 21 bits and an encoder can still pack > multiple code points together into more compact code units. > > > > However Unicode provides and standardizes several encodings (UTF-8/16/32) > which use code units whose size is directly suitable as input for an > octets-stream, so that they are directly encodable with Base64, without > having to specify an extra layer for the bits-stream encoder/decoder. > > > > But many other encodings are still possible (and can be conforming to > Unicode, provided they preserve each Unicode scalar value, or at least the > code point identity because an encoder/decoder is not required to support > non-character code points such as surrogates or U+FFFE), where Base64 may > be used for internally generated octets-streams. > > > > > > Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode < > unicode at unicode.org> a ?crit : > > On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > > unicode at unicode.org> a ?crit : > > > The only variance is described as: > > > > > > Care must be taken to use the proper octets for line breaks if base64 > > > encoding is applied directly to text material that has not been > > > converted to canonical form. In particular, text line breaks must be > > > converted into CRLF sequences prior to base64 encoding. The > > > important thing to note is that this may be done directly by the > > > encoder rather than in a prior canonicalization step in some > > > implementations. > > > > > > This is MIME, it specifies (in the same RFC): > > > > I've not spoken aboutr the encoding of new lines **in the actual encoded > > text**: > > - if their existing text-encoding ever gets converted to Base64 as if > the > > whole text was an opaque binary object, their initial text-encoding will > be > > preserved (so yes it will preserve the way these embedded newlines are > > encoded as CR, LF, CR+LF, NL...) > > > > I spoke about newlines used in the transport syntax to split the initial > > binary object (which may actually contain text but it does not matter). > > MIME defines this operation and even requires splitting the binary object > > in fragments with maximum binary size so that these binary fragments can > be > > converted with Base64 into lines with maximum length. In the MIME Base64 > > representation you can insert newlines anywhere between fragments encoded > > separately. > > There's another kind of fragmentation that can make the encoding differ > (but > still decode to the same payload): > > The data stream gets split into 3-byte internal, 4-byte external packets. > Any packet may contain less than those 3 bytes, in which cases it is padded > with = characters: > 3 bytes XXXX > 2 bytes XXX= > 1 byte XX== > > Usually, such smaller packets happen only at the end of a message, but to > support encoding a stream piecewise, they are allowed at any point. > > For example: > "meow" is bWVvdw== > "me""ow" is bWU=b3c= > yet both carry the same payload. > > > Base64 is used exactly to support this flexibility in transport (or > > storage) without altering any bit of the initial content once it is > > decoded. > > Right, any such variations are in packaging only. > > > ???? > -- > ??????? > ??????? 10 people enter a bar: 1 who understands binary, > ??????? 1 who doesn't, D who prefer to write it as hex, > ??????? and 1 who narrowly avoided an off-by-one error. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 06:57:14 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 15 Oct 2018 13:57:14 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> Message-ID: If you want an example where padding with "=" is not used at all, - look into URL-shortening schemes - look into database fields or data input forms and numerous data formats where the "=" sign is restricted (just like in URLs and file paths, or in identifiers) Padding is not used anywhere in the middle of the binary encoding or even at end, only the 64 symbols of the encoding alphabet are needed and the extra 2 or 4 lowest bits that may be encoded in the last character of the encoded sequence are discarded by the decoder (these extra bits are not necessarily set to 0 by encoders in the last symbol, even if this is the canonical form recommanded in encoders, their value is simply ignored by decoders). Some Base64 encoders do not necessarily encode binary octets-streams, but bits-streams whose length in bits is not necessarily multiple of 8, in which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last symbol of the encoded sequence. Other encoders use streams of binary code units that are larger than 8 bits, and may want to encode more padding symbols to force the alignment of data required in their associated decoders, or will choose to not use any padding at all, letting the decoder discard the trailing bits themselves at end of the encoded stream. Le lun. 15 oct. 2018 ? 13:24, Philippe Verdy a ?crit : > Also the rationale for supporting "unnecessary" whitespace is found in > MIME's version of Base64, also in RFCs describing encoding formats for > digital certificates, or for exchanging public keys in encryption > algorithms like PGP (notably, but not only, as texts in the body of emails > or in documentations and websites). > > Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : > >> Philippe, >> >> >> >> Where is the use of whitespace or the idea that 1-byte pieces do not need >> all the equal sign paddings documented? >> >> I read the rfc 3501 you pointed at, I don?t see it there. >> >> >> >> Are these part of any standards? Or are you claiming these are practices >> despite the standards? If so, are these just tolerated by parsers, or are >> they actually generated by encoders? >> >> >> >> What would be the rationale for supporting unnecessary whitespace? If >> linebreaks are forced at some line length they can presumably be removed at >> that length and not treated as part of the encoding. >> >> Maybe we differ on define where the encoding begins and ends, and where >> higher level protocols prescribe how they are embedded within the protocol. >> >> >> >> Tex >> >> >> >> >> >> >> >> >> >> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe >> Verdy via Unicode >> *Sent:* Sunday, October 14, 2018 1:41 AM >> *To:* Adam Borowski >> *Cc:* unicode Unicode Discussion >> *Subject:* Re: Base64 encoding applied to different unicode texts always >> yields different base64 texts ... true or false? >> >> >> >> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is >> enough to indicate the end of an octets-span. The extra = after it do not >> add any other octet. and as well you're allowed to insert whitespaces >> anywhere in the encoded stream (this is what ensures that the >> Base64-encoded octets-stream will not be altered if line breaks are forced >> anywhere (notably within the body of emails). >> >> >> >> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, >> CR, LF, NEL) in the middle is non-significant and ignorable on decoding >> (their "encoded" bit length is 0 and they don't terminate an octets-span, >> unlike "=" which discards extra bits remaining from the encoded stream >> before that are not on 8-bit boundaries). >> >> >> >> Also: >> >> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" >> symbol before "=" can vary in its 4 lowest bits (which are then >> ignored/discarded by the "=" symbol) >> >> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" >> symbol before "=" can vary in its 2 lowest bits (which are then >> ignored/discarded by the "=" symbol) >> >> >> >> So you can use Base64 by encoding each octet in separate pieces, as one >> Base64 symbol followed by an "=" symbol, and even insert any number of >> whitespaces between them: there's a infinite number of valid Base64 >> encodings for representing the same octets-stream payload. >> >> >> >> Base64 allows encoding any octets streams but not directly any >> bits-streams : it assumes that the effective bits-stream has a binary >> length multiple of 8. To encode a bits-stream with an exact number of bits >> (not multiple of 8), you need to encode an extra payload to indicate the >> effective number of bits to keep at end of the encoded octets-stream (or at >> start): >> >> - Base64 does not specify how you convert a bitstream of arbitrary length >> to an octets-stream; >> >> - for that purpose, you may need to pad the bits-stream at start or at >> end with 1 to 6 bits (so that it the resulting bitstream has a length >> multiple of 8, then encodable with Base64 which takes only octets on input). >> >> - these extra padding bits are not significant for the original >> bitstream, but are significant for the Base64 encoder/decoder, they will be >> discarded by the bitstream decoder built on top of the Base64 decoder, but >> not by the Base64 decoder itself. >> >> >> >> You need to encode somewhere with the bitstream encoder how many padding >> bits (0 to 7) are present at start or end of the octets-stream; this can be >> done: >> >> - as a separate payload (not encoded by Base64), or >> >> - by prepending 3 bits at start of the bits-stream then padded at end >> with 1 to 7 random bits to get a bit-length multiple of 8 suitable for >> Base64 encoding. >> >> - by appending 3 bits at end of the bits-stream, just after 1 to 7 >> random bits needed to get a bit-length multiple of 8 suitable for Base64 >> encoding. >> >> Finally your bits-stream decoder will be able to use this padding count >> to discard these random padding bits (and possibly realign the stream on >> different byte-boundaries when the effective bitlength bits-stream payload >> is not a multiple of 8 and padding bits were added) >> >> >> >> Base64 also does not specify how bits of the original bits-stream payload >> are packed into the octets-stream input suitable for Base64-encoding, >> notably it does not specify their order and endian-ness. The same remark >> applies as well for MIME, HTTP. So lot of network protocols and file >> formats need to how to properly encode which possible option is used to >> encode bits-streams of arbitrary length, or need to specify which default >> choice to apply if this option is not encoded, or which option must be used >> (with no possible variation). And this also adds to the number of distinct >> encodings that are possible but are still equivalent for the same effective >> bits-stream payload. >> >> >> >> All these allowed variations are from the encoder perspective. For >> interoperability, the decoder has to be flexible and to support various >> options to be compatible with different implementations of the encoder, >> notably when the encoder was run on a different system. And this is the >> case for the MIME transport by mail, or for HTTP and FTP transports, or >> file/media storage formats even if the file is stored on the same system, >> because it may actually be a copy stored locally but coming from another >> system where the file was actually encoded). >> >> >> >> Now if we come back to the encoding of plain-text payloads, Unicode just >> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code >> points (it actually does not mandate an exact bit-length because the range >> does not fully fit exactly to 21 bits and an encoder can still pack >> multiple code points together into more compact code units. >> >> >> >> However Unicode provides and standardizes several encodings (UTF-8/16/32) >> which use code units whose size is directly suitable as input for an >> octets-stream, so that they are directly encodable with Base64, without >> having to specify an extra layer for the bits-stream encoder/decoder. >> >> >> >> But many other encodings are still possible (and can be conforming to >> Unicode, provided they preserve each Unicode scalar value, or at least the >> code point identity because an encoder/decoder is not required to support >> non-character code points such as surrogates or U+FFFE), where Base64 may >> be used for internally generated octets-streams. >> >> >> >> >> >> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode < >> unicode at unicode.org> a ?crit : >> >> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode >> wrote: >> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < >> > unicode at unicode.org> a ?crit : >> > > The only variance is described as: >> > > >> > > Care must be taken to use the proper octets for line breaks if >> base64 >> > > encoding is applied directly to text material that has not been >> > > converted to canonical form. In particular, text line breaks must >> be >> > > converted into CRLF sequences prior to base64 encoding. The >> > > important thing to note is that this may be done directly by the >> > > encoder rather than in a prior canonicalization step in some >> > > implementations. >> > > >> > > This is MIME, it specifies (in the same RFC): >> > >> > I've not spoken aboutr the encoding of new lines **in the actual encoded >> > text**: >> > - if their existing text-encoding ever gets converted to Base64 as if >> the >> > whole text was an opaque binary object, their initial text-encoding >> will be >> > preserved (so yes it will preserve the way these embedded newlines are >> > encoded as CR, LF, CR+LF, NL...) >> > >> > I spoke about newlines used in the transport syntax to split the initial >> > binary object (which may actually contain text but it does not matter). >> > MIME defines this operation and even requires splitting the binary >> object >> > in fragments with maximum binary size so that these binary fragments >> can be >> > converted with Base64 into lines with maximum length. In the MIME Base64 >> > representation you can insert newlines anywhere between fragments >> encoded >> > separately. >> >> There's another kind of fragmentation that can make the encoding differ >> (but >> still decode to the same payload): >> >> The data stream gets split into 3-byte internal, 4-byte external packets. >> Any packet may contain less than those 3 bytes, in which cases it is >> padded >> with = characters: >> 3 bytes XXXX >> 2 bytes XXX= >> 1 byte XX== >> >> Usually, such smaller packets happen only at the end of a message, but to >> support encoding a stream piecewise, they are allowed at any point. >> >> For example: >> "meow" is bWVvdw== >> "me""ow" is bWU=b3c= >> yet both carry the same payload. >> >> > Base64 is used exactly to support this flexibility in transport (or >> > storage) without altering any bit of the initial content once it is >> > decoded. >> >> Right, any such variations are in packaging only. >> >> >> ???? >> -- >> ??????? >> ??????? 10 people enter a bar: 1 who understands binary, >> ??????? 1 who doesn't, D who prefer to write it as hex, >> ??????? and 1 who narrowly avoided an off-by-one error. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 07:11:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 15 Oct 2018 14:11:58 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> Message-ID: Note that all these discussion about padding applies to all other base-N encodings, including base-10. For example to represent numbers of arbitrary precision: padding does not require a separate symbol but can use the "0" digit which is part of the 10-symbols alphabet, or encoders can discard them on the left, or on the right if there's a decimal dot; when the precision is less than a integral number of decimal digits, the extra bits or fractional bits of information in the last digit of the encoded sequence does not matter, encoders may choose to not set them to 0 but may prefer to use rounding which may conditionally set these bits to 1, depedning on the value of the last significant bits or fractional bits of maximum precision. As well the same decoders may want to use extra whitespaces (notably to limit line lengths at arbitrary lengths, notably for embedding the encoded sequences in printed documents or documents with a page layout and rendered with a readable font size suitable for the page width, or for presentation purpose by grouping symbols). In summary, padding is not required at all by all Base-N encoders/decoders, and non significant whitespace is frequently needed. Le lun. 15 oct. 2018 ? 13:57, Philippe Verdy a ?crit : > If you want an example where padding with "=" is not used at all, > - look into URL-shortening schemes > - look into database fields or data input forms and numerous data formats > where the "=" sign is restricted (just like in URLs and file paths, or in > identifiers) > Padding is not used anywhere in the middle of the binary encoding or even > at end, only the 64 symbols of the encoding alphabet are needed and the > extra 2 or 4 lowest bits that may be encoded in the last character of the > encoded sequence are discarded by the decoder (these extra bits are not > necessarily set to 0 by encoders in the last symbol, even if this is the > canonical form recommanded in encoders, their value is simply ignored by > decoders). > Some Base64 encoders do not necessarily encode binary octets-streams, but > bits-streams whose length in bits is not necessarily multiple of 8, in > which case there may be 1 to 7 trailing bits (not just 2 or 4) in the last > symbol of the encoded sequence. > Other encoders use streams of binary code units that are larger than 8 > bits, and may want to encode more padding symbols to force the alignment of > data required in their associated decoders, or will choose to not use any > padding at all, letting the decoder discard the trailing bits themselves at > end of the encoded stream. > > Le lun. 15 oct. 2018 ? 13:24, Philippe Verdy a > ?crit : > >> Also the rationale for supporting "unnecessary" whitespace is found in >> MIME's version of Base64, also in RFCs describing encoding formats for >> digital certificates, or for exchanging public keys in encryption >> algorithms like PGP (notably, but not only, as texts in the body of emails >> or in documentations and websites). >> >> Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : >> >>> Philippe, >>> >>> >>> >>> Where is the use of whitespace or the idea that 1-byte pieces do not >>> need all the equal sign paddings documented? >>> >>> I read the rfc 3501 you pointed at, I don?t see it there. >>> >>> >>> >>> Are these part of any standards? Or are you claiming these are practices >>> despite the standards? If so, are these just tolerated by parsers, or are >>> they actually generated by encoders? >>> >>> >>> >>> What would be the rationale for supporting unnecessary whitespace? If >>> linebreaks are forced at some line length they can presumably be removed at >>> that length and not treated as part of the encoding. >>> >>> Maybe we differ on define where the encoding begins and ends, and where >>> higher level protocols prescribe how they are embedded within the protocol. >>> >>> >>> >>> Tex >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe >>> Verdy via Unicode >>> *Sent:* Sunday, October 14, 2018 1:41 AM >>> *To:* Adam Borowski >>> *Cc:* unicode Unicode Discussion >>> *Subject:* Re: Base64 encoding applied to different unicode texts >>> always yields different base64 texts ... true or false? >>> >>> >>> >>> Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is >>> enough to indicate the end of an octets-span. The extra = after it do not >>> add any other octet. and as well you're allowed to insert whitespaces >>> anywhere in the encoded stream (this is what ensures that the >>> Base64-encoded octets-stream will not be altered if line breaks are forced >>> anywhere (notably within the body of emails). >>> >>> >>> >>> So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, >>> CR, LF, NEL) in the middle is non-significant and ignorable on decoding >>> (their "encoded" bit length is 0 and they don't terminate an octets-span, >>> unlike "=" which discards extra bits remaining from the encoded stream >>> before that are not on 8-bit boundaries). >>> >>> >>> >>> Also: >>> >>> - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" >>> symbol before "=" can vary in its 4 lowest bits (which are then >>> ignored/discarded by the "=" symbol) >>> >>> - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" >>> symbol before "=" can vary in its 2 lowest bits (which are then >>> ignored/discarded by the "=" symbol) >>> >>> >>> >>> So you can use Base64 by encoding each octet in separate pieces, as one >>> Base64 symbol followed by an "=" symbol, and even insert any number of >>> whitespaces between them: there's a infinite number of valid Base64 >>> encodings for representing the same octets-stream payload. >>> >>> >>> >>> Base64 allows encoding any octets streams but not directly any >>> bits-streams : it assumes that the effective bits-stream has a binary >>> length multiple of 8. To encode a bits-stream with an exact number of bits >>> (not multiple of 8), you need to encode an extra payload to indicate the >>> effective number of bits to keep at end of the encoded octets-stream (or at >>> start): >>> >>> - Base64 does not specify how you convert a bitstream of arbitrary >>> length to an octets-stream; >>> >>> - for that purpose, you may need to pad the bits-stream at start or at >>> end with 1 to 6 bits (so that it the resulting bitstream has a length >>> multiple of 8, then encodable with Base64 which takes only octets on input). >>> >>> - these extra padding bits are not significant for the original >>> bitstream, but are significant for the Base64 encoder/decoder, they will be >>> discarded by the bitstream decoder built on top of the Base64 decoder, but >>> not by the Base64 decoder itself. >>> >>> >>> >>> You need to encode somewhere with the bitstream encoder how many padding >>> bits (0 to 7) are present at start or end of the octets-stream; this can be >>> done: >>> >>> - as a separate payload (not encoded by Base64), or >>> >>> - by prepending 3 bits at start of the bits-stream then padded at end >>> with 1 to 7 random bits to get a bit-length multiple of 8 suitable for >>> Base64 encoding. >>> >>> - by appending 3 bits at end of the bits-stream, just after 1 to 7 >>> random bits needed to get a bit-length multiple of 8 suitable for Base64 >>> encoding. >>> >>> Finally your bits-stream decoder will be able to use this padding count >>> to discard these random padding bits (and possibly realign the stream on >>> different byte-boundaries when the effective bitlength bits-stream payload >>> is not a multiple of 8 and padding bits were added) >>> >>> >>> >>> Base64 also does not specify how bits of the original bits-stream >>> payload are packed into the octets-stream input suitable for >>> Base64-encoding, notably it does not specify their order and endian-ness. >>> The same remark applies as well for MIME, HTTP. So lot of network protocols >>> and file formats need to how to properly encode which possible option is >>> used to encode bits-streams of arbitrary length, or need to specify which >>> default choice to apply if this option is not encoded, or which option must >>> be used (with no possible variation). And this also adds to the number of >>> distinct encodings that are possible but are still equivalent for the same >>> effective bits-stream payload. >>> >>> >>> >>> All these allowed variations are from the encoder perspective. For >>> interoperability, the decoder has to be flexible and to support various >>> options to be compatible with different implementations of the encoder, >>> notably when the encoder was run on a different system. And this is the >>> case for the MIME transport by mail, or for HTTP and FTP transports, or >>> file/media storage formats even if the file is stored on the same system, >>> because it may actually be a copy stored locally but coming from another >>> system where the file was actually encoded). >>> >>> >>> >>> Now if we come back to the encoding of plain-text payloads, Unicode just >>> specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code >>> points (it actually does not mandate an exact bit-length because the range >>> does not fully fit exactly to 21 bits and an encoder can still pack >>> multiple code points together into more compact code units. >>> >>> >>> >>> However Unicode provides and standardizes several encodings >>> (UTF-8/16/32) which use code units whose size is directly suitable as input >>> for an octets-stream, so that they are directly encodable with Base64, >>> without having to specify an extra layer for the bits-stream >>> encoder/decoder. >>> >>> >>> >>> But many other encodings are still possible (and can be conforming to >>> Unicode, provided they preserve each Unicode scalar value, or at least the >>> code point identity because an encoder/decoder is not required to support >>> non-character code points such as surrogates or U+FFFE), where Base64 may >>> be used for internally generated octets-streams. >>> >>> >>> >>> >>> >>> Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode < >>> unicode at unicode.org> a ?crit : >>> >>> On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode >>> wrote: >>> > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < >>> > unicode at unicode.org> a ?crit : >>> > > The only variance is described as: >>> > > >>> > > Care must be taken to use the proper octets for line breaks if >>> base64 >>> > > encoding is applied directly to text material that has not been >>> > > converted to canonical form. In particular, text line breaks must >>> be >>> > > converted into CRLF sequences prior to base64 encoding. The >>> > > important thing to note is that this may be done directly by the >>> > > encoder rather than in a prior canonicalization step in some >>> > > implementations. >>> > > >>> > > This is MIME, it specifies (in the same RFC): >>> > >>> > I've not spoken aboutr the encoding of new lines **in the actual >>> encoded >>> > text**: >>> > - if their existing text-encoding ever gets converted to Base64 as if >>> the >>> > whole text was an opaque binary object, their initial text-encoding >>> will be >>> > preserved (so yes it will preserve the way these embedded newlines are >>> > encoded as CR, LF, CR+LF, NL...) >>> > >>> > I spoke about newlines used in the transport syntax to split the >>> initial >>> > binary object (which may actually contain text but it does not matter). >>> > MIME defines this operation and even requires splitting the binary >>> object >>> > in fragments with maximum binary size so that these binary fragments >>> can be >>> > converted with Base64 into lines with maximum length. In the MIME >>> Base64 >>> > representation you can insert newlines anywhere between fragments >>> encoded >>> > separately. >>> >>> There's another kind of fragmentation that can make the encoding differ >>> (but >>> still decode to the same payload): >>> >>> The data stream gets split into 3-byte internal, 4-byte external packets. >>> Any packet may contain less than those 3 bytes, in which cases it is >>> padded >>> with = characters: >>> 3 bytes XXXX >>> 2 bytes XXX= >>> 1 byte XX== >>> >>> Usually, such smaller packets happen only at the end of a message, but to >>> support encoding a stream piecewise, they are allowed at any point. >>> >>> For example: >>> "meow" is bWVvdw== >>> "me""ow" is bWU=b3c= >>> yet both carry the same payload. >>> >>> > Base64 is used exactly to support this flexibility in transport (or >>> > storage) without altering any bit of the initial content once it is >>> > decoded. >>> >>> Right, any such variations are in packaging only. >>> >>> >>> ???? >>> -- >>> ??????? >>> ??????? 10 people enter a bar: 1 who understands binary, >>> ??????? 1 who doesn't, D who prefer to write it as hex, >>> ??????? and 1 who narrowly avoided an off-by-one error. >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 08:02:08 2018 From: unicode at unicode.org (Tex via Unicode) Date: Mon, 15 Oct 2018 06:02:08 -0700 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> Message-ID: <002801d46487$4821e350$d865a9f0$@xencraft.com> Philippe, quote the entire section: In some circumstances, the use of padding ("=") in base-encoded data is not required or used. In the general case, when assumptions about the size of transported data cannot be made, padding is required to yield correct decoded data. Implementations MUST include appropriate pad characters at the end of encoded data unless the specification referring to this document explicitly states otherwise. The first para clarifies that padding is required when the length is not otherwise known. Only if the length is provided or predefined can the padding be dropped. The second para clarifies it must be included unless the higher level protocol states otherwise, in which case it is likely using another mechanism to define length. It doesn?t seem to me to be as open ended as you implied in your initial mails, but well-defined depending on whether base64 is being used as spec?d in the RFC, or being explicitly modified to suit an embedding protocol. And certainly the first sentence in this section isn?t intended to be taken without the context of the rest of the section. tex From: Philippe Verdy [mailto:verdy_p at wanadoo.fr] Sent: Monday, October 15, 2018 4:14 AM To: Tex Texin Cc: Adam Borowski; unicode Unicode Discussion Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st sentence, it is explicitly stated : In some circumstances, the use of padding ("=") in base-encoded data is not required or used. Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : Philippe, Where is the use of whitespace or the idea that 1-byte pieces do not need all the equal sign paddings documented? I read the rfc 3501 you pointed at, I don?t see it there. Are these part of any standards? Or are you claiming these are practices despite the standards? If so, are these just tolerated by parsers, or are they actually generated by encoders? What would be the rationale for supporting unnecessary whitespace? If linebreaks are forced at some line length they can presumably be removed at that length and not treated as part of the encoding. Maybe we differ on define where the encoding begins and ends, and where higher level protocols prescribe how they are embedded within the protocol. Tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode Sent: Sunday, October 14, 2018 1:41 AM To: Adam Borowski Cc: unicode Unicode Discussion Subject: Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is enough to indicate the end of an octets-span. The extra = after it do not add any other octet. and as well you're allowed to insert whitespaces anywhere in the encoded stream (this is what ensures that the Base64-encoded octets-stream will not be altered if line breaks are forced anywhere (notably within the body of emails). So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, LF, NEL) in the middle is non-significant and ignorable on decoding (their "encoded" bit length is 0 and they don't terminate an octets-span, unlike "=" which discards extra bits remaining from the encoded stream before that are not on 8-bit boundaries). Also: - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol before "=" can vary in its 4 lowest bits (which are then ignored/discarded by the "=" symbol) - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" symbol before "=" can vary in its 2 lowest bits (which are then ignored/discarded by the "=" symbol) So you can use Base64 by encoding each octet in separate pieces, as one Base64 symbol followed by an "=" symbol, and even insert any number of whitespaces between them: there's a infinite number of valid Base64 encodings for representing the same octets-stream payload. Base64 allows encoding any octets streams but not directly any bits-streams : it assumes that the effective bits-stream has a binary length multiple of 8. To encode a bits-stream with an exact number of bits (not multiple of 8), you need to encode an extra payload to indicate the effective number of bits to keep at end of the encoded octets-stream (or at start): - Base64 does not specify how you convert a bitstream of arbitrary length to an octets-stream; - for that purpose, you may need to pad the bits-stream at start or at end with 1 to 6 bits (so that it the resulting bitstream has a length multiple of 8, then encodable with Base64 which takes only octets on input). - these extra padding bits are not significant for the original bitstream, but are significant for the Base64 encoder/decoder, they will be discarded by the bitstream decoder built on top of the Base64 decoder, but not by the Base64 decoder itself. You need to encode somewhere with the bitstream encoder how many padding bits (0 to 7) are present at start or end of the octets-stream; this can be done: - as a separate payload (not encoded by Base64), or - by prepending 3 bits at start of the bits-stream then padded at end with 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 encoding. - by appending 3 bits at end of the bits-stream, just after 1 to 7 random bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. Finally your bits-stream decoder will be able to use this padding count to discard these random padding bits (and possibly realign the stream on different byte-boundaries when the effective bitlength bits-stream payload is not a multiple of 8 and padding bits were added) Base64 also does not specify how bits of the original bits-stream payload are packed into the octets-stream input suitable for Base64-encoding, notably it does not specify their order and endian-ness. The same remark applies as well for MIME, HTTP. So lot of network protocols and file formats need to how to properly encode which possible option is used to encode bits-streams of arbitrary length, or need to specify which default choice to apply if this option is not encoded, or which option must be used (with no possible variation). And this also adds to the number of distinct encodings that are possible but are still equivalent for the same effective bits-stream payload. All these allowed variations are from the encoder perspective. For interoperability, the decoder has to be flexible and to support various options to be compatible with different implementations of the encoder, notably when the encoder was run on a different system. And this is the case for the MIME transport by mail, or for HTTP and FTP transports, or file/media storage formats even if the file is stored on the same system, because it may actually be a copy stored locally but coming from another system where the file was actually encoded). Now if we come back to the encoding of plain-text payloads, Unicode just specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code points (it actually does not mandate an exact bit-length because the range does not fully fit exactly to 21 bits and an encoder can still pack multiple code points together into more compact code units. However Unicode provides and standardizes several encodings (UTF-8/16/32) which use code units whose size is directly suitable as input for an octets-stream, so that they are directly encodable with Base64, without having to specify an extra layer for the bits-stream encoder/decoder. But many other encodings are still possible (and can be conforming to Unicode, provided they preserve each Unicode scalar value, or at least the code point identity because an encoder/decoder is not required to support non-character code points such as surrogates or U+FFFE), where Base64 may be used for internally generated octets-streams. Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode a ?crit : On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > unicode at unicode.org> a ?crit : > > The only variance is described as: > > > > Care must be taken to use the proper octets for line breaks if base64 > > encoding is applied directly to text material that has not been > > converted to canonical form. In particular, text line breaks must be > > converted into CRLF sequences prior to base64 encoding. The > > important thing to note is that this may be done directly by the > > encoder rather than in a prior canonicalization step in some > > implementations. > > > > This is MIME, it specifies (in the same RFC): > > I've not spoken aboutr the encoding of new lines **in the actual encoded > text**: > - if their existing text-encoding ever gets converted to Base64 as if the > whole text was an opaque binary object, their initial text-encoding will be > preserved (so yes it will preserve the way these embedded newlines are > encoded as CR, LF, CR+LF, NL...) > > I spoke about newlines used in the transport syntax to split the initial > binary object (which may actually contain text but it does not matter). > MIME defines this operation and even requires splitting the binary object > in fragments with maximum binary size so that these binary fragments can be > converted with Base64 into lines with maximum length. In the MIME Base64 > representation you can insert newlines anywhere between fragments encoded > separately. There's another kind of fragmentation that can make the encoding differ (but still decode to the same payload): The data stream gets split into 3-byte internal, 4-byte external packets. Any packet may contain less than those 3 bytes, in which cases it is padded with = characters: 3 bytes XXXX 2 bytes XXX= 1 byte XX== Usually, such smaller packets happen only at the end of a message, but to support encoding a stream piecewise, they are allowed at any point. For example: "meow" is bWVvdw== "me""ow" is bWU=b3c= yet both carry the same payload. > Base64 is used exactly to support this flexibility in transport (or > storage) without altering any bit of the initial content once it is > decoded. Right, any such variations are in packaging only. ???? -- ??????? ??????? 10 people enter a bar: 1 who understands binary, ??????? 1 who doesn't, D who prefer to write it as hex, ??????? and 1 who narrowly avoided an off-by-one error. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 10:47:36 2018 From: unicode at unicode.org (Harshula via Unicode) Date: Tue, 16 Oct 2018 02:47:36 +1100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <20181015085359.339c5747@JRWUBU2> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> Message-ID: <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> Hi Richard, On 15/10/18 6:53 pm, Richard Wordingham via Unicode wrote: > On Mon, 15 Oct 2018 01:55:24 +1100 > Harshula via Unicode wrote: > >> 3) However, what you have observed is an issue with *explicit* >> conjunct creation. After the segmentation is completed, the >> layout/shaping engine needs to first check if there is a >> corresponding lookup for the explicit conjunct, if not, then it needs >> to remove the ZWJ and redo the segmentation and lookup(s). Perhaps >> that is not happening in Harfbuzz. > > This indeed seems to be the problem with HarfBuzz and with Windows 7 > Uniscribe. Curiously, they almost adopt this behaviour when touching > letters are not available. (The ZWJ seems not to be completely removed > - in HarfBuzz at least it can result in the al-lakuna not interacting > properly with the base character.) > > But where is this usually useful behaviour specified? > > 1. There may be nothing but time and money to stop fallbacks being > built into the font. For example, what prohibits the rendering of a > conjunct falling back to touching letters or a missing glyph symbol? I had not considered the missing glyph symbol. Perhaps that is the most accurate solution when a font is missing a glyph during an *explicit* conjunct lookup. Note, touching letters are formed by , so they should not be displayed as a fallback for conjuncts. cya, # From unicode at unicode.org Mon Oct 15 11:55:39 2018 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Mon, 15 Oct 2018 18:55:39 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> Message-ID: <20181015165539.v6fy9%steffen@sdaoden.eu> Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554 at DougEwell>: |Steffen Nurpmeso wrote: |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions |> (MIME) Part One: Format of Internet Message Bodies). | |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data |Encodings." RFC 2045 defines a particular implementation of base64, |specific to transporting Internet mail in a 7-bit environment. | |RFC 4648 discusses many of the "higher-level protocol" topics that some |people are focusing on, such as separating the base64-encoded output |into lines of length 72 (or other), alternative target code unit sets or |"alphabets," and padding characters. It would be helpful for everyone to |read this particular RFC before concluding that these topics have not |been considered, or that they compromise round-tripping or other |characteristics of base64. | |I had assumed that when Roger asked about "base64 encoding," he was |asking about the basic definition of base64. Sure; i have only followed the discussion superficially, and even though everybody can read RFCs, i felt the necessity to polemicize against the false however i look at it "MIME actually splits a binary object into multiple fragments at random positions". Solely my fault. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Oct 15 12:03:42 2018 From: unicode at unicode.org (Peter Saint-Andre via Unicode) Date: Mon, 15 Oct 2018 11:03:42 -0600 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <2A67B4F082F74F8AADF34BA11D885554@DougEwell> Message-ID: <25de2517-14f0-d05c-9ece-02e9644dad6a@mozilla.com> On 10/14/18 3:59 PM, Philippe Verdy via Unicode wrote: > > > Le?dim. 14 oct. 2018 ??21:21, Doug Ewell via Unicode > > a ?crit?: > > Steffen Nurpmeso wrote: > > > Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions > > (MIME) Part One: Format of Internet Message Bodies). > > Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data > Encodings." RFC 2045 defines a particular implementation of base64, > specific to transporting Internet mail in a 7-bit environment. > > > Wrong, this is "specific" to transporting Internet mail in any 7 bit or > 8 bit environment (today almost all mail agents are operating in 8 bit), > and then it is referenced directly by HTTP (and its HTTPS variant). > > So this is no so "specific". MIME is extremely popular, RFC 4648 is > extremely exotic (and RFC 4648 is wrong when saying that IMAP is very > specific as it is now a very popular protocol, widely used as well). > MIME is so frequently used, that almost all people refer to it when they > look for Base64, or do not explicitly state that another definition > (found in an exotic RFC) is explicitly used. RFC 4648 is used in many, many Internet protocols. It's definitely not "extremely exotic". Peter From unicode at unicode.org Mon Oct 15 13:22:07 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 15 Oct 2018 20:22:07 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: <002801d46487$4821e350$d865a9f0$@xencraft.com> References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> <002801d46487$4821e350$d865a9f0$@xencraft.com> Message-ID: Padding itself does not clearly indicate the length. It's an artefact that **may** be infered only in some other layers of protocols which specify when and how padding is needed (and how many padding bytes are required or accepted), it works only if these upper layer protocols are using **octets** streams, but it is still not usable for more general bitstreams (with arbitrary bit lengths). This RFC does not mandate/require these padding bytes and in fact many upper layer protocols do not ever need it (including UTF-7 for example), they are never necessary to infer a length in octets and insufficient for specifying a length in bits. As well the usage in MIME (where there's a requirement that lines of headers or in the content body is limited to 1000 bytes) requires free splitting of Base64 (there's no agreed maximum length, some sources insist it should not be more than 72 bytes, others use 80 bytes, but mail forwarding may add other characters at start of lines, forcing them to be shorter (leaving for example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): this means that padding may not be used where one would expect them, and padding can event occur in the middle of the encoded stream (not just at end) along with other whitespaces or separators (like "> " at start of lines in cited messages). More generally the padding in MIME offers no benefit at all. The actual length is infered from the whole content body, and it's just safer to ignore/discard all padding symbols in decoders (just like they will discard whitespaces or ">"). If one wants to get a sure indication that the stream is not truncated and has the expected length, the encoded message must either embed this length as part of the original binary stream itself, or can embed secure "digital signatures", "message digests" or "hashes", or the length can be specified separately in the unencoded MIME body, or as part of the MIME header if the whole MIME content body is specified as using a base64 encoding. The same applies to HTTP. I have rarely seen RFC 4648 used alone outside of another upper layer protocol. This statement in RFC 4648 section 3.1 is for example completely wrong for Base16 where paddings are almost always avoided. Various other Base-N profiles for other upper layer protocols never need (and sometime even forbid) the presence of any padding symbol, or consider that paddding can also be made using the bits representing 0 to pad the original binary stream, or can be made using other ignored/discard whitespaces or symbols, without assigning any specific role to "=" (as a length indicator or stream terminator). Le lun. 15 oct. 2018 ? 15:02, Tex a ?crit : > Philippe, quote the entire section: > > > > In some circumstances, the use of padding ("=") in base-encoded data > > is not required or used. In the general case, when assumptions about > > the size of transported data cannot be made, padding is required to > > yield correct decoded data. > > > > Implementations MUST include appropriate pad characters at the end of > > encoded data unless the specification referring to this document > > explicitly states otherwise. > > > > The first para clarifies that padding is required when the length is not > otherwise known. Only if the length is provided or predefined can the > padding be dropped. > > The second para clarifies it must be included unless the higher level > protocol states otherwise, in which case it is likely using another > mechanism to define length. > > > > It doesn?t seem to me to be as open ended as you implied in your initial > mails, but well-defined depending on whether base64 is being used as spec?d > in the RFC, or being explicitly modified to suit an embedding protocol. > > And certainly the first sentence in this section isn?t intended to be > taken without the context of the rest of the section. > > > > tex > > > > > > > > *From:* Philippe Verdy [mailto:verdy_p at wanadoo.fr] > *Sent:* Monday, October 15, 2018 4:14 AM > *To:* Tex Texin > *Cc:* Adam Borowski; unicode Unicode Discussion > *Subject:* Re: Base64 encoding applied to different unicode texts always > yields different base64 texts ... true or false? > > > > Look into https://tools.ietf.org/html/rfc4648, section 3.2, alinea 1, 1st > sentence, it is explicitly stated : > > > > In some circumstances, the use of padding ("=") in base-encoded data is not required or used. > > > > Le lun. 15 oct. 2018 ? 03:56, Tex a ?crit : > > Philippe, > > > > Where is the use of whitespace or the idea that 1-byte pieces do not need > all the equal sign paddings documented? > > I read the rfc 3501 you pointed at, I don?t see it there. > > > > Are these part of any standards? Or are you claiming these are practices > despite the standards? If so, are these just tolerated by parsers, or are > they actually generated by encoders? > > > > What would be the rationale for supporting unnecessary whitespace? If > linebreaks are forced at some line length they can presumably be removed at > that length and not treated as part of the encoding. > > Maybe we differ on define where the encoding begins and ends, and where > higher level protocols prescribe how they are embedded within the protocol. > > > > Tex > > > > > > > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy via Unicode > *Sent:* Sunday, October 14, 2018 1:41 AM > *To:* Adam Borowski > *Cc:* unicode Unicode Discussion > *Subject:* Re: Base64 encoding applied to different unicode texts always > yields different base64 texts ... true or false? > > > > Note that 1-byte pieces do not need to be padded by 2 = signs; only 1 is > enough to indicate the end of an octets-span. The extra = after it do not > add any other octet. and as well you're allowed to insert whitespaces > anywhere in the encoded stream (this is what ensures that the > Base64-encoded octets-stream will not be altered if line breaks are forced > anywhere (notably within the body of emails). > > > > So yes, Base64 is highly flexible, because any whitespace (SPACE, TAB, CR, > LF, NEL) in the middle is non-significant and ignorable on decoding (their > "encoded" bit length is 0 and they don't terminate an octets-span, unlike > "=" which discards extra bits remaining from the encoded stream before that > are not on 8-bit boundaries). > > > > Also: > > - For 1-octets pieces the minimum format is "XX= ", but the 2nd "X" symbol > before "=" can vary in its 4 lowest bits (which are then ignored/discarded > by the "=" symbol) > > - For 2-octets pieces the minimum format is "XXX= ", but the 3rd "X" > symbol before "=" can vary in its 2 lowest bits (which are then > ignored/discarded by the "=" symbol) > > > > So you can use Base64 by encoding each octet in separate pieces, as one > Base64 symbol followed by an "=" symbol, and even insert any number of > whitespaces between them: there's a infinite number of valid Base64 > encodings for representing the same octets-stream payload. > > > > Base64 allows encoding any octets streams but not directly any > bits-streams : it assumes that the effective bits-stream has a binary > length multiple of 8. To encode a bits-stream with an exact number of bits > (not multiple of 8), you need to encode an extra payload to indicate the > effective number of bits to keep at end of the encoded octets-stream (or at > start): > > - Base64 does not specify how you convert a bitstream of arbitrary length > to an octets-stream; > > - for that purpose, you may need to pad the bits-stream at start or at end > with 1 to 6 bits (so that it the resulting bitstream has a length multiple > of 8, then encodable with Base64 which takes only octets on input). > > - these extra padding bits are not significant for the original bitstream, > but are significant for the Base64 encoder/decoder, they will be discarded > by the bitstream decoder built on top of the Base64 decoder, but not by the > Base64 decoder itself. > > > > You need to encode somewhere with the bitstream encoder how many padding > bits (0 to 7) are present at start or end of the octets-stream; this can be > done: > > - as a separate payload (not encoded by Base64), or > > - by prepending 3 bits at start of the bits-stream then padded at end with > 1 to 7 random bits to get a bit-length multiple of 8 suitable for Base64 > encoding. > > - by appending 3 bits at end of the bits-stream, just after 1 to 7 random > bits needed to get a bit-length multiple of 8 suitable for Base64 encoding. > > Finally your bits-stream decoder will be able to use this padding count to > discard these random padding bits (and possibly realign the stream on > different byte-boundaries when the effective bitlength bits-stream payload > is not a multiple of 8 and padding bits were added) > > > > Base64 also does not specify how bits of the original bits-stream payload > are packed into the octets-stream input suitable for Base64-encoding, > notably it does not specify their order and endian-ness. The same remark > applies as well for MIME, HTTP. So lot of network protocols and file > formats need to how to properly encode which possible option is used to > encode bits-streams of arbitrary length, or need to specify which default > choice to apply if this option is not encoded, or which option must be used > (with no possible variation). And this also adds to the number of distinct > encodings that are possible but are still equivalent for the same effective > bits-stream payload. > > > > All these allowed variations are from the encoder perspective. For > interoperability, the decoder has to be flexible and to support various > options to be compatible with different implementations of the encoder, > notably when the encoder was run on a different system. And this is the > case for the MIME transport by mail, or for HTTP and FTP transports, or > file/media storage formats even if the file is stored on the same system, > because it may actually be a copy stored locally but coming from another > system where the file was actually encoded). > > > > Now if we come back to the encoding of plain-text payloads, Unicode just > specifies the allowed range (from 0 to 0x10FFFF) for scalar values of code > points (it actually does not mandate an exact bit-length because the range > does not fully fit exactly to 21 bits and an encoder can still pack > multiple code points together into more compact code units. > > > > However Unicode provides and standardizes several encodings (UTF-8/16/32) > which use code units whose size is directly suitable as input for an > octets-stream, so that they are directly encodable with Base64, without > having to specify an extra layer for the bits-stream encoder/decoder. > > > > But many other encodings are still possible (and can be conforming to > Unicode, provided they preserve each Unicode scalar value, or at least the > code point identity because an encoder/decoder is not required to support > non-character code points such as surrogates or U+FFFE), where Base64 may > be used for internally generated octets-streams. > > > > > > Le dim. 14 oct. 2018 ? 03:47, Adam Borowski via Unicode < > unicode at unicode.org> a ?crit : > > On Sun, Oct 14, 2018 at 01:37:35AM +0200, Philippe Verdy via Unicode wrote: > > Le sam. 13 oct. 2018 ? 18:58, Steffen Nurpmeso via Unicode < > > unicode at unicode.org> a ?crit : > > > The only variance is described as: > > > > > > Care must be taken to use the proper octets for line breaks if base64 > > > encoding is applied directly to text material that has not been > > > converted to canonical form. In particular, text line breaks must be > > > converted into CRLF sequences prior to base64 encoding. The > > > important thing to note is that this may be done directly by the > > > encoder rather than in a prior canonicalization step in some > > > implementations. > > > > > > This is MIME, it specifies (in the same RFC): > > > > I've not spoken aboutr the encoding of new lines **in the actual encoded > > text**: > > - if their existing text-encoding ever gets converted to Base64 as if > the > > whole text was an opaque binary object, their initial text-encoding will > be > > preserved (so yes it will preserve the way these embedded newlines are > > encoded as CR, LF, CR+LF, NL...) > > > > I spoke about newlines used in the transport syntax to split the initial > > binary object (which may actually contain text but it does not matter). > > MIME defines this operation and even requires splitting the binary object > > in fragments with maximum binary size so that these binary fragments can > be > > converted with Base64 into lines with maximum length. In the MIME Base64 > > representation you can insert newlines anywhere between fragments encoded > > separately. > > There's another kind of fragmentation that can make the encoding differ > (but > still decode to the same payload): > > The data stream gets split into 3-byte internal, 4-byte external packets. > Any packet may contain less than those 3 bytes, in which cases it is padded > with = characters: > 3 bytes XXXX > 2 bytes XXX= > 1 byte XX== > > Usually, such smaller packets happen only at the end of a message, but to > support encoding a stream piecewise, they are allowed at any point. > > For example: > "meow" is bWVvdw== > "me""ow" is bWU=b3c= > yet both carry the same payload. > > > Base64 is used exactly to support this flexibility in transport (or > > storage) without altering any bit of the initial content once it is > > decoded. > > Right, any such variations are in packaging only. > > > ???? > -- > ??????? > ??????? 10 people enter a bar: 1 who understands binary, > ??????? 1 who doesn't, D who prefer to write it as hex, > ??????? and 1 who narrowly avoided an off-by-one error. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 15 14:26:24 2018 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Mon, 15 Oct 2018 21:26:24 +0200 Subject: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false? In-Reply-To: References: <20181013165019.sxGzV%steffen@sdaoden.eu> <20181014013904.idfomqt5s65wnqro@angband.pl> <000601d4642a$4274ec70$c75ec550$@xencraft.com> <002801d46487$4821e350$d865a9f0$@xencraft.com> Message-ID: <20181015192624.pY-ze%steffen@sdaoden.eu> Philippe Verdy via Unicode wrote in : |Padding itself does not clearly indicate the length. | |It's an artefact that **may** be infered only in some other layers \ |of protocols which specify when and how padding is needed (and how \ |many padding bytes |are required or accepted), it works only if these upper layer protocols \ |are using **octets** streams, but it is still not usable for more general |bitstreams (with arbitrary bit lengths). | |This RFC does not mandate/require these padding bytes and in fact many \ |upper layer protocols do not ever need it (including UTF-7 for example), \ |they are |never necessary to infer a length in octets and insufficient for specify\ |ing a length in bits. | |As well the usage in MIME (where there's a requirement that lines of \ |headers or in the content body is limited to 1000 bytes) requires free \ |splitting of |Base64 (there's no agreed maximum length, some sources insist it should \ |not be more than 72 bytes, others use 80 bytes, but mail forwarding \ |may add other |characters at start of lines, forcing them to be shorter (leaving for \ |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \ |this means that |padding may not be used where one would expect them, and padding can \ |event occur in the middle of the encoded stream (not just at end) along \ That was actually a bug in my MUA. Other MUAs were not capable of decoding this correctly. Sorry :-(!! |with other |whitespaces or separators (like "> " at start of lines in cited messages). In fact garbage bytes may be embedded explicitly says MIME. Most handle that right, and skip (silently, maybe not right), but some explicit base64 decoders fail miserably when such things are seen (openssl base64, NetBSD base64 decoder (current)), others do not (busybox base64, for example). --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Oct 15 14:57:25 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 15 Oct 2018 20:57:25 +0100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> Message-ID: <20181015205725.38772e05@JRWUBU2> On Tue, 16 Oct 2018 02:47:36 +1100 Harshula via Unicode wrote: > Note, touching letters are formed by , so they should > not be displayed as a fallback for conjuncts. I don't follow that. While the conjuncts with r-, -r and -y are very different to pairs of touching letters, the conjuncts for tth, nd, ndr, ndh, kv and tv would be very similar to the hypothetical corresponding touching letters and quite different to the fallbacks with visible al-lakuna. Richard. From unicode at unicode.org Mon Oct 15 19:59:54 2018 From: unicode at unicode.org (Harshula via Unicode) Date: Tue, 16 Oct 2018 11:59:54 +1100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <20181015205725.38772e05@JRWUBU2> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> <20181015205725.38772e05@JRWUBU2> Message-ID: <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au> Hi Richard, On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote: > On Tue, 16 Oct 2018 02:47:36 +1100 > Harshula via Unicode wrote: > >> Note, touching letters are formed by , so they should >> not be displayed as a fallback for conjuncts. > > I don't follow that. While the conjuncts with r-, -r and -y are very > different to pairs of touching letters, the conjuncts for tth, nd, ndr, > ndh, kv and tv would be very similar to the hypothetical corresponding > touching letters and quite different to the fallbacks with visible > al-lakuna. If you haven't already, it's best you read SLS 1134:2011: http://www.language.lk/en/download/standards/ or the older SLS 1134:2004: http://unicode.org/wg2/docs/n2737.pdf cya, # From unicode at unicode.org Mon Oct 15 22:29:36 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 Oct 2018 04:29:36 +0100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> <20181015205725.38772e05@JRWUBU2> <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au> Message-ID: <20181016042936.21ce4fc9@JRWUBU2> On Tue, 16 Oct 2018 11:59:54 +1100 Harshula via Unicode wrote: > Hi Richard, > > On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote: > > On Tue, 16 Oct 2018 02:47:36 +1100 > > Harshula via Unicode wrote: > > > >> Note, touching letters are formed by , so they > >> should not be displayed as a fallback for > >> conjuncts. > > > > I don't follow that. While the conjuncts with r-, -r and -y are > > very different to pairs of touching letters, the conjuncts for tth, > > nd, ndr, ndh, kv and tv would be very similar to the hypothetical > > corresponding touching letters and quite different to the fallbacks > > with visible al-lakuna. > > If you haven't already, it's best you read SLS 1134:2011: > http://www.language.lk/en/download/standards/ > > or the older SLS 1134:2004: > http://unicode.org/wg2/docs/n2737.pdf The latter actually says, in Section 5.8, that may be used for either! I suspect that that is a printing error. The Sri Lankan standard simply assumes that the rendering system can accommodate what is requested in the backing store. It says nothing about fallbacks. So, if the user specifies the the syllable ddho written with a conjunct and encoded as ????? but the conjunct is missing from the fonts' repertoires, why is it right to display it with al-lakuna as though it were ???? but wrong to display it with the touching letters encoded as ??????? There are three different correct ways of writing 'ddho', but many systems only support one of them (and some weirdly use a fourth method). Richard. From unicode at unicode.org Tue Oct 16 06:00:18 2018 From: unicode at unicode.org (Harshula via Unicode) Date: Tue, 16 Oct 2018 22:00:18 +1100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: <20181016042936.21ce4fc9@JRWUBU2> References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> <20181015205725.38772e05@JRWUBU2> <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au> <20181016042936.21ce4fc9@JRWUBU2> Message-ID: Hi Richard, On 16/10/18 2:29 pm, Richard Wordingham via Unicode wrote: > On Tue, 16 Oct 2018 11:59:54 +1100 > Harshula via Unicode wrote: > >> Hi Richard, >> >> On 16/10/18 6:57 am, Richard Wordingham via Unicode wrote: >>> On Tue, 16 Oct 2018 02:47:36 +1100 >>> Harshula via Unicode wrote: >>> >>>> Note, touching letters are formed by , so they >>>> should not be displayed as a fallback for >>>> conjuncts. >>> >>> I don't follow that. While the conjuncts with r-, -r and -y are >>> very different to pairs of touching letters, the conjuncts for tth, >>> nd, ndr, ndh, kv and tv would be very similar to the hypothetical >>> corresponding touching letters and quite different to the fallbacks >>> with visible al-lakuna. >> >> If you haven't already, it's best you read SLS 1134:2011: >> http://www.language.lk/en/download/standards/ >> >> or the older SLS 1134:2004: >> http://unicode.org/wg2/docs/n2737.pdf > > The latter actually says, in Section 5.8, that may be > used for either! I suspect that that is a printing error. The former (SLS1134:2011) has a section for Touching letters. It is explicitly stated to use for Touching letters. Sorry, the file n2737.pdf hosted on unicode.org appears to be a draft. It is not the final SLS1134:2004. The final contains a section on Touching letters like SLS1134:2011. > The Sri Lankan standard simply assumes that the rendering system can > accommodate what is requested in the backing store. It says nothing > about fallbacks. So, if the user specifies the the syllable ddho > written with a conjunct and encoded as ????? but the conjunct is > missing from the fonts' repertoires, why is it right to display it with > al-lakuna as though it were ???? but wrong to display it with the > touching letters encoded as ??????? There are three different > correct ways of writing 'ddho', but many systems only support one of > them (and some weirdly use a fourth method). When a font is missing a glyph during an *explicit* conjunct lookup, it appears the most accurate solution is to display the missing glyph symbol. cya, # From unicode at unicode.org Tue Oct 16 19:04:19 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 17 Oct 2018 01:04:19 +0100 Subject: Fallback for Sinhala Consonant Clusters In-Reply-To: References: <20181014010259.4fb5436a@JRWUBU2> <0974867c-3fbb-a368-b4e3-de3ec6e60dac@hj.id.au> <20181015085359.339c5747@JRWUBU2> <0875128d-9858-29de-bef2-c0d5b6032c3c@hj.id.au> <20181015205725.38772e05@JRWUBU2> <78bc7c53-8eee-4236-0023-45c7f243cee2@hj.id.au> <20181016042936.21ce4fc9@JRWUBU2> Message-ID: <20181017010419.704f5283@JRWUBU2> On Tue, 16 Oct 2018 22:00:18 +1100 Harshula via Unicode wrote: > When a font is missing a glyph during an *explicit* conjunct lookup, > it appears the most accurate solution is to display the missing glyph > symbol. However, I don't believe that that is the most useful solution, and it certainly isn't when composing 'plain text'. Now, if one is composing the font that will be used to read the text as one writes the text, it may have some benefit; it may also have some benefit if one can select the font that will be used, and a suitable font is available. Richard. From unicode at unicode.org Sat Oct 27 06:10:20 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Oct 2018 13:10:20 +0200 Subject: A sign/abbreviation for "magister" Message-ID: <86tvl7tzkz.fsf@mimuw.edu.pl> Hi! On the over 100 years old postcard https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 you can see 2 occurences of a symbol which is explicitely explained (in Polish) as meaning "Magister". First question is: how do you interpret the symbol? For me it is definitely the capital M followed by the superscript "r" (written in an old style no longer used in Poland), but there is something below the superscript. It looks like a small "z", but such an interpretation doesn't make sense for me. The second question is: are you familiar with such or a similar symbol? Have you ever seen it in print? The third and the last question is: how to encode this symbol in Unicode? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sat Oct 27 07:36:59 2018 From: unicode at unicode.org (rein via Unicode) Date: Sat, 27 Oct 2018 14:36:59 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: <86tvl7tzkz.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: Janusz, reminds me of the "numero sign " № I tried to read the letter but couldn't manage to all the way ;) Droga i Kochana Wiria?ko za?aczam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?. co u Was wi?cej s?ycha? ?adnych jeszcze ni mam odpowiedzi ze znanych Ci miejscowoci ?adresowa?? do Staszki jak ty? chcia?a pisa? (W.Pan Mr Micha? Ga?kiewicz Feldspital 411 Feldpost 380.) Mr znaczy Magister. On przy tem szpitalu aptekarzem. ca?uj? Ci? ze wargatkiem Mami r?czki Tw?j Kochaj?cy W?odek 12/9 917 pozdrawiam, Rein Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode : > > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". > > First question is: how do you interpret the symbol? For me it is > definitely the capital M followed by the superscript "r" (written in an > old style no longer used in Poland), but there is something below the > superscript. It looks like a small "z", but such an interpretation > doesn't make sense for me. > > The second question is: are you familiar with such or a similar symbol? > Have you ever seen it in print? > > The third and the last question is: how to encode this symbol in > Unicode? > > Best regards > > Janusz > -- Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 07:58:38 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sat, 27 Oct 2018 05:58:38 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <86tvl7tzkz.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 08:09:35 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Oct 2018 15:09:35 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: (rein's message of "Sat, 27 Oct 2018 14:36:59 +0200") References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: <867ei3sfhs.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 14:36 +0200, rein wrote: > Janusz, > > reminds me of the "numero sign " № Yes, that's definitely similar. > > I tried to read the letter but couldn't manage to all the way ;) Congratulation, you have done it better than me! > > Droga i Kochana Wiria?ko Rather "Wisie?ko": "Ludwika" -> "Ludwisie?ka" ->"Wisie?ka" > > za?aczam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?. co > u Was wi?cej s?ycha? ?adnych jeszcze ni mam odpowiedzi I didn't recognized "odpowiedzi". >ze znanych Ci miejscowoci ?adresowa?? "Adresowa?" makes sense, although some letters seem missing. > do Staszki jak ty? chcia?a pisa? >(W.Pan Mr Micha? Ga?kiewicz Feldspital 411 Feldpost 380.) Mr znaczy >Magister. On przy tem szpitalu aptekarzem. ca?uj? Ci? ze wargatkiem I read this "wszystkiemi". >Mami I can't guess a word which would make sense of this phrase... > r?czki Tw?j Kochaj?cy W?odek 12/9 917 > > pozdrawiam, Rein Nawzajem :-) > > Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode : [...] >> The second question is: are you familiar with such or a similar symbol? >> Have you ever seen it in print? The postcard is from the front of the first WW written by an Austro-Hungarian soldier. He explaines the meaning of the abbreviation to his wife, so looks like the abbreviation was used but not very popular. >> >> The third and the last question is: how to encode this symbol in >> Unicode? I've got a comment to this question off the list, but I'm waiting to see more opinions. Best regards Janusz P.S. I subscribe the list in the digest form but I look up the archive - I think Asmus Freytag interpretation is the correct one (similar interpretation was suggested also of the list). -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sat Oct 27 09:32:49 2018 From: unicode at unicode.org (rein via Unicode) Date: Sat, 27 Oct 2018 16:32:49 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: <867ei3sfhs.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <867ei3sfhs.fsf@mimuw.edu.pl> Message-ID: Janusz, "wszystkimi m(oj)ami r?czki" some sort of plural instrumentalis :) "embracing you with all my hands/arms" pozdrawiam, Rein Op Sat, 27 Oct 2018 15:09:35 +0200 schreef Janusz S. Bie? : > On Sat, Oct 27 2018 at 14:36 +0200, rein wrote: >> Janusz, >> >> reminds me of the "numero sign " № > > Yes, that's definitely similar. > >> >> I tried to read the letter but couldn't manage to all the way ;) > > Congratulation, you have done it better than me! > >> >> Droga i Kochana Wiria?ko > > Rather "Wisie?ko": "Ludwika" -> "Ludwisie?ka" ->"Wisie?ka" > >> >> za??czam Ci z t? fotografij? list Staszki - odpisa?em ju? jej te?. co >> u Was wi?cej s?ycha? ?adnych jeszcze ni mam odpowiedzi > > I didn't recognized "odpowiedzi". > >> ze znanych Ci miejscowoci ?adresowa?? > > "Adresowa?" makes sense, although some letters seem missing. > >> do Staszki jak ty? chcia?a pisa? >> (W.Pan Mr Micha? Ga?kiewicz Feldspital 411 Feldpost 380.) Mr znaczy >> Magister. On przy tem szpitalu aptekarzem. ca?uj? Ci? ze wargatkiem > > I read this "wszystkiemi". > >> Mami > > I can't guess a word which would make sense of this phrase... > >> r?czki Tw?j Kochaj?cy W?odek 12/9 917 >> >> pozdrawiam, Rein > > Nawzajem :-) > >> >> Sat, 27 Oct 2018 13:10:20 +0200 schreef Janusz S. Bie? via Unicode >> : > > [...] > >>> The second question is: are you familiar with such or a similar symbol? >>> Have you ever seen it in print? > > The postcard is from the front of the first WW written by an > Austro-Hungarian soldier. He explaines the meaning of the abbreviation > to his wife, so looks like the abbreviation was used but not very > popular. > >>> >>> The third and the last question is: how to encode this symbol in >>> Unicode? > > I've got a comment to this question off the list, but I'm waiting to see > more opinions. > > Best regards > > Janusz > > P.S. I subscribe the list in the digest form but I look up the archive - > I think Asmus Freytag interpretation is the correct one (similar > interpretation was suggested also of the list). > -- Gemaakt met Opera's e-mailprogramma: http://www.opera.com/mail/ From unicode at unicode.org Sat Oct 27 09:53:56 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Oct 2018 16:53:56 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: (rein's message of "Sat, 27 Oct 2018 16:32:49 +0200") References: <86tvl7tzkz.fsf@mimuw.edu.pl> <867ei3sfhs.fsf@mimuw.edu.pl> Message-ID: <86a7mzphiz.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 16:32 +0200, rein wrote: > Janusz, > > "wszystkimi m(oj)ami r?czki" some sort of plural instrumentalis :) Rather "moimi", although still the phrase sounds strange. > "embracing you with all my hands/arms" Now "kiss" (ca?owa?) and "embrace" (obejmowa?) are strictly separated, but perhaps 100 years ago it was differently. Bess regards Janusz P.S. This discussion is completely of the topic of the list, but I'm very greatful for the help received on and off the list. -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sat Oct 27 12:25:02 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Oct 2018 19:25:02 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: (Asmus Freytag via Unicode's message of "Sat, 27 Oct 2018 05:58:38 -0700") References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: <86bm7fnvyp.fsf@mimuw.edu.pl> On Sat, Oct 27 2018 at 5:58 -0700, Asmus Freytag via Unicode wrote: [...] > My suspicion would be that the small "z" is rather a "=" that acquired > a connecting stroke as part of quick handwriting. You must be right. In the meantime I looked up some other postcards written by the same person i found several other abbreviation including ? 'NUMERO SIGN' (U+2116) written in the same way, i.e. with a double instead of a single line. So we have a consensus about how to interpret the sign, but there are still open questions about the scope of its usage, and its encoding. Thanks one again to all who contributed to the discussion. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sat Oct 27 12:35:17 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 27 Oct 2018 19:35:17 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: Le sam. 27 oct. 2018 ? 15:06, Asmus Freytag via Unicode a ?crit : > First question is: how do you interpret the symbol? For me it is > definitely the capital M followed by the superscript "r" (written in an > old style no longer used in Poland), but there is something below the > superscript. It looks like a small "z", but such an interpretation > doesn't make sense for me. > > My suspicion would be that the small "z" is rather a "=" that acquired a > connecting stroke as part of quick handwriting. > I have the same kind of reading, the zigzagging stroek is an hnadwritten emphasis of the uperscript r above it (explicitly noting it is terminating the abbreviation), jut like the small underline that happens sometimes below the superscript o in the abbreviation of "numero" (as well sometimes there was not just one but two small underlines, including in some prints). This sample is a perfect example of fast cursive handwritting (due to high variability of all other letter shapes, sizes and joinings, where even the capital M is written as two unconnected strokes), and it's not abnormal to see in such condition this cursive joining between the two underlining strokes so that it looks like a single zigzag. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 14:52:32 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 27 Oct 2018 19:52:32 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Mr? / M=? An image search for "magister symbol" finds many interesting graphics, but I couldn't find any resembling the abreviation shown on the post card.? (Magister symbol appears to be popular for certain religious and gaming uses.) From unicode at unicode.org Sat Oct 27 20:59:31 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 02:59:31 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: Do you speak about this one? https://www.magisterdaire.com/magister-symbol-black-sq/ It looks like a graphic personal signature for the author of this esoteric book, even if it looks like an interesting composition of several of our existing Unicode symbols, glued together in a vertical ligature, rather than a pure combining sequence. Such technics can be used extensively to create lot of other symbols, by gluing any kind of wellknown glyphs for standard characters. Mathematics and technologies (but also companies for their private corporate logos and branding marks) are constantly inventing new symbols like this. Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode a ?crit : > > Mr? / M=? > > An image search for "magister symbol" finds many interesting graphics, > but I couldn't find any resembling the abreviation shown on the post > card. (Magister symbol appears to be popular for certain religious and > gaming uses.) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 21:40:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 03:40:58 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: More interesting: the Masonic alphabet http://tallermasonico.com/0diccio1.htm - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J and K), are disposed by group of 2 letters in a 3x3 square grid, whose global outer sides are not marked on the outer border of the grid but on lines separating columns or rows. Then letters are noted by the marked sides of the square in which they are located, the second letter of the group being distinguished by adding a dot in the middle of the square. - The 4 other letters U to Z (excluding V and W) are noted by disposing them on a 2x2 square grid (this time rotated 45 degrees), whose global outer sides are also not marked on the outer border of the grid but on lines separating columns or rows (only 1 letter is places by cell). They are also noted by the marked sides of their square only.- Finally (if needed) the missing letters J, K, V, W use the same 4 last glyphs, but are distinguished by adding the central dot. AB | CD | EF ------+-----+----- GH | I L | MN ------+-----+----- OP | QR | ST \ XK / UJ > < WZ / YV \ So: - "A" becomes approximately "_|" - "B" becomes approximately "_|" with central dot - "U" becomes approximately ">" - "X" becomes approximately "\/" - "J" is noted like "I" as a square, or distinctly approximately as ">" with a central dot The 3x3 grid had some esoterical meaning based on numerology (a legend now propaged by scientology). Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy a ?crit : > Do you speak about this one? > https://www.magisterdaire.com/magister-symbol-black-sq/ > It looks like a graphic personal signature for the author of this esoteric > book, even if it looks like an interesting composition of several of our > existing Unicode symbols, glued together in a vertical ligature, rather > than a pure combining sequence. > Such technics can be used extensively to create lot of other symbols, by > gluing any kind of wellknown glyphs for standard characters. > Mathematics and technologies (but also companies for their private > corporate logos and branding marks) are constantly inventing new symbols > like this. > > > Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode > a ?crit : > >> >> Mr? / M=? >> >> An image search for "magister symbol" finds many interesting graphics, >> but I couldn't find any resembling the abreviation shown on the post >> card. (Magister symbol appears to be popular for certain religious and >> gaming uses.) >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 22:02:55 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 04:02:55 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: I must add that the Masonic 3x3 grid alphabet has been proposed as an alternative to Braille, easier to learn and memoize, easier and faster to draw with a pen on paper without any physical guide, and easier also to recognize using only tactile contact by a finger tip, but more difficult to form without cutting the sheet of paper while tracing the strokes. But it was seen on some manufactured Masonic objects. To note digits with the same shapes (like does Braille with its 2x3 dots grid), the same 3x3 grid is used for digits 1 to 9 (digit 0 uses the same square where it is significant as 5, but with a central dot, or use a space), but additional symbols "+" and "-" are used (without central dot) to switch between letters and digits. The placement of digits 1 to 9 (except 0 and 5) on the 3x3 grid varies (horizontally first, or vertically first). Le dim. 28 oct. 2018 ? 03:40, Philippe Verdy a ?crit : > More interesting: the Masonic alphabet > http://tallermasonico.com/0diccio1.htm > > - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J > and K), are disposed by group of 2 letters in a 3x3 square grid, whose > global outer sides are not marked on the outer border of the grid but on > lines separating columns or rows. Then letters are noted by the marked > sides of the square in which they are located, the second letter of the > group being distinguished by adding a dot in the middle of the square. > - The 4 other letters U to Z (excluding V and W) are noted by disposing > them on a 2x2 square grid (this time rotated 45 degrees), whose global > outer sides are also not marked on the outer border of the grid but on > lines separating columns or rows (only 1 letter is places by cell). > They are also noted by the marked sides of their square only.- Finally (if > needed) the missing letters J, K, V, W use the same 4 last glyphs, but are > distinguished by adding the central dot. > > > AB | CD | EF > ------+-----+----- > GH | I L | MN > ------+-----+----- > OP | QR | ST > > \ XK / > UJ > < WZ > / YV \ > > > So: > - "A" becomes approximately "_|" > - "B" becomes approximately "_|" with central dot > - "U" becomes approximately ">" > - "X" becomes approximately "\/" > - "J" is noted like "I" as a square, or distinctly approximately as ">" > with a central dot > > The 3x3 grid had some esoterical meaning based on numerology (a legend now > propaged by scientology). > > > Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy a > ?crit : > >> Do you speak about this one? >> https://www.magisterdaire.com/magister-symbol-black-sq/ >> It looks like a graphic personal signature for the author of this >> esoteric book, even if it looks like an interesting composition of several >> of our existing Unicode symbols, glued together in a vertical ligature, >> rather than a pure combining sequence. >> Such technics can be used extensively to create lot of other symbols, by >> gluing any kind of wellknown glyphs for standard characters. >> Mathematics and technologies (but also companies for their private >> corporate logos and branding marks) are constantly inventing new symbols >> like this. >> >> >> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode >> a ?crit : >> >>> >>> Mr? / M=? >>> >>> An image search for "magister symbol" finds many interesting graphics, >>> but I couldn't find any resembling the abreviation shown on the post >>> card. (Magister symbol appears to be popular for certain religious and >>> gaming uses.) >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 22:12:06 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sat, 27 Oct 2018 20:12:06 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: I learned that one as a kid, as the "pigpen cipher". I'm not aware of any numerological significance (which is easy enough to "find" in anything). On Sat, Oct 27, 2018 at 7:43 PM Philippe Verdy via Unicode < unicode at unicode.org> wrote: > More interesting: the Masonic alphabet > http://tallermasonico.com/0diccio1.htm > > - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J > and K), are disposed by group of 2 letters in a 3x3 square grid, whose > global outer sides are not marked on the outer border of the grid but on > lines separating columns or rows. Then letters are noted by the marked > sides of the square in which they are located, the second letter of the > group being distinguished by adding a dot in the middle of the square. > - The 4 other letters U to Z (excluding V and W) are noted by disposing > them on a 2x2 square grid (this time rotated 45 degrees), whose global > outer sides are also not marked on the outer border of the grid but on > lines separating columns or rows (only 1 letter is places by cell). > They are also noted by the marked sides of their square only.- Finally (if > needed) the missing letters J, K, V, W use the same 4 last glyphs, but are > distinguished by adding the central dot. > > > AB | CD | EF > ------+-----+----- > GH | I L | MN > ------+-----+----- > OP | QR | ST > > \ XK / > UJ > < WZ > / YV \ > > > So: > - "A" becomes approximately "_|" > - "B" becomes approximately "_|" with central dot > - "U" becomes approximately ">" > - "X" becomes approximately "\/" > - "J" is noted like "I" as a square, or distinctly approximately as ">" > with a central dot > > The 3x3 grid had some esoterical meaning based on numerology (a legend now > propaged by scientology). > > > Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy a > ?crit : > >> Do you speak about this one? >> https://www.magisterdaire.com/magister-symbol-black-sq/ >> It looks like a graphic personal signature for the author of this >> esoteric book, even if it looks like an interesting composition of several >> of our existing Unicode symbols, glued together in a vertical ligature, >> rather than a pure combining sequence. >> Such technics can be used extensively to create lot of other symbols, by >> gluing any kind of wellknown glyphs for standard characters. >> Mathematics and technologies (but also companies for their private >> corporate logos and branding marks) are constantly inventing new symbols >> like this. >> >> >> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode >> a ?crit : >> >>> >>> Mr? / M=? >>> >>> An image search for "magister symbol" finds many interesting graphics, >>> but I couldn't find any resembling the abreviation shown on the post >>> card. (Magister symbol appears to be popular for certain religious and >>> gaming uses.) >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 22:16:55 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 04:16:55 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: So in summary this Masonic "alphabet" uses 13 square "letters" and a single combining mark (the central dot), possibly extended with the minus and plus signs and space. It's possible that the central dot is used as a spacing mark to note a punctuation. The assignment of Latin (or Hebrew) letters to this alphabet varies (just like Braille symbols depending on languages/scripts) It may have extensions (like Braille outside its basic 2x3 patterns of dots), such as a second dot in squares, horizontally as "??" or vertically as ":" Le dim. 28 oct. 2018 ? 03:40, Philippe Verdy a ?crit : > More interesting: the Masonic alphabet > http://tallermasonico.com/0diccio1.htm > > - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J > and K), are disposed by group of 2 letters in a 3x3 square grid, whose > global outer sides are not marked on the outer border of the grid but on > lines separating columns or rows. Then letters are noted by the marked > sides of the square in which they are located, the second letter of the > group being distinguished by adding a dot in the middle of the square. > - The 4 other letters U to Z (excluding V and W) are noted by disposing > them on a 2x2 square grid (this time rotated 45 degrees), whose global > outer sides are also not marked on the outer border of the grid but on > lines separating columns or rows (only 1 letter is places by cell). > They are also noted by the marked sides of their square only.- Finally (if > needed) the missing letters J, K, V, W use the same 4 last glyphs, but are > distinguished by adding the central dot. > > > AB | CD | EF > ------+-----+----- > GH | I L | MN > ------+-----+----- > OP | QR | ST > > \ XK / > UJ > < WZ > / YV \ > > > So: > - "A" becomes approximately "_|" > - "B" becomes approximately "_|" with central dot > - "U" becomes approximately ">" > - "X" becomes approximately "\/" > - "J" is noted like "I" as a square, or distinctly approximately as ">" > with a central dot > > The 3x3 grid had some esoterical meaning based on numerology (a legend now > propaged by scientology). > > > Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy a > ?crit : > >> Do you speak about this one? >> https://www.magisterdaire.com/magister-symbol-black-sq/ >> It looks like a graphic personal signature for the author of this >> esoteric book, even if it looks like an interesting composition of several >> of our existing Unicode symbols, glued together in a vertical ligature, >> rather than a pure combining sequence. >> Such technics can be used extensively to create lot of other symbols, by >> gluing any kind of wellknown glyphs for standard characters. >> Mathematics and technologies (but also companies for their private >> corporate logos and branding marks) are constantly inventing new symbols >> like this. >> >> >> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode >> a ?crit : >> >>> >>> Mr? / M=? >>> >>> An image search for "magister symbol" finds many interesting graphics, >>> but I couldn't find any resembling the abreviation shown on the post >>> card. (Magister symbol appears to be popular for certain religious and >>> gaming uses.) >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Oct 27 22:29:26 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 04:29:26 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <9738cab6-ad28-0096-6b4e-a04b6724159b@gmail.com> Message-ID: If it was encoded in Unicode, it would use a single column and the encoding seems evident: x0 = MASONIC SQUARE SPACE x1 = MASONIC SYMBOL A B OR ONE x2 = MASONIC SYMBOL C D OR TWO x3 = MASONIC SYMBOL E F OR THREE x4 = MASONIC SYMBOL G H OR FOUR x5 = MASONIC SYMBOL I L OR ZERO FIVE x6 = MASONIC SYMBOL M N OR SIX x7 = MASONIC SYMBOL O P OR SEVEN x8 = MASONIC SYMBOL Q R OR EIGHT x9 = MASONIC SYMBOL S T OR NINE xA = MASONIC SYMBOL U J xB = MASONIC SYMBOL X K xC = MASONIC SYMBOL Y V xD = MASONIC SYMBOL Z W xE = MASONIC COMBINING DOT xF = MASONIC COMBINING DOUBLE DOT (?) Le dim. 28 oct. 2018 ? 04:21, Garth Wallace via Unicode a ?crit : > I learned that one as a kid, as the "pigpen cipher". I'm not aware of any > numerological significance (which is easy enough to "find" in anything). > > On Sat, Oct 27, 2018 at 7:43 PM Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> More interesting: the Masonic alphabet >> http://tallermasonico.com/0diccio1.htm >> >> - 18 letters of the Latin alphabet (or Hebrew), from A to T (excluding J >> and K), are disposed by group of 2 letters in a 3x3 square grid, whose >> global outer sides are not marked on the outer border of the grid but on >> lines separating columns or rows. Then letters are noted by the marked >> sides of the square in which they are located, the second letter of the >> group being distinguished by adding a dot in the middle of the square. >> - The 4 other letters U to Z (excluding V and W) are noted by disposing >> them on a 2x2 square grid (this time rotated 45 degrees), whose global >> outer sides are also not marked on the outer border of the grid but on >> lines separating columns or rows (only 1 letter is places by cell). >> They are also noted by the marked sides of their square only.- Finally (if >> needed) the missing letters J, K, V, W use the same 4 last glyphs, but are >> distinguished by adding the central dot. >> >> >> AB | CD | EF >> ------+-----+----- >> GH | I L | MN >> ------+-----+----- >> OP | QR | ST >> >> \ XK / >> UJ > < WZ >> / YV \ >> >> >> So: >> - "A" becomes approximately "_|" >> - "B" becomes approximately "_|" with central dot >> - "U" becomes approximately ">" >> - "X" becomes approximately "\/" >> - "J" is noted like "I" as a square, or distinctly approximately as ">" >> with a central dot >> >> The 3x3 grid had some esoterical meaning based on numerology (a legend >> now propaged by scientology). >> >> >> Le dim. 28 oct. 2018 ? 02:59, Philippe Verdy a >> ?crit : >> >>> Do you speak about this one? >>> https://www.magisterdaire.com/magister-symbol-black-sq/ >>> It looks like a graphic personal signature for the author of this >>> esoteric book, even if it looks like an interesting composition of several >>> of our existing Unicode symbols, glued together in a vertical ligature, >>> rather than a pure combining sequence. >>> Such technics can be used extensively to create lot of other symbols, by >>> gluing any kind of wellknown glyphs for standard characters. >>> Mathematics and technologies (but also companies for their private >>> corporate logos and branding marks) are constantly inventing new symbols >>> like this. >>> >>> >>> Le sam. 27 oct. 2018 ? 22:01, James Kass via Unicode < >>> unicode at unicode.org> a ?crit : >>> >>>> >>>> Mr? / M=? >>>> >>>> An image search for "magister symbol" finds many interesting graphics, >>>> but I couldn't find any resembling the abreviation shown on the post >>>> card. (Magister symbol appears to be popular for certain religious and >>>> gaming uses.) >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 28 03:13:26 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 28 Oct 2018 08:13:26 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> Message-ID: <20181028081326.264dc079@JRWUBU2> On Sat, 27 Oct 2018 05:58:38 -0700 Asmus Freytag via Unicode wrote: > On 10/27/2018 4:10 AM, Janusz S. Bie? via Unicode wrote: >> you can see 2 occurences of a symbol which is explicitely explained >> (in Polish) as meaning "Magister". >> First question is: how do you interpret the symbol? For me it is >> definitely the capital M followed by the superscript "r" (written in >> an old style no longer used in Poland), but there is something below >> the superscript. It looks like a small "z", but such an >> interpretation >> doesn't make sense for me. >> The second question is: are you familiar with such or a similar >> symbol? Have you ever seen it in prin> >> The third and the last question is: how to encode this symbol in >> Unicode? > My suspicion would be that the small "z" is rather a "=" that > acquired a connecting stroke as part of quick handwriting. The notation is a quite widespread format for abbreviations. the first letter is normal sized, and the subsequent letter is written in some variety of superscript with a squiggle underneath so that it doesn't get overlooked. I have deduced that this is not plain text because there is no encoding mechanism for it. For example, our lecturers would frequently use this treatment to abbreviate function as 'fn' with the 'n' superscript and supported by a squiggle below sitting on the baseline. The squiggle below has meaning; it marks the word as an abbreviation. Richard. From unicode at unicode.org Sun Oct 28 03:32:11 2018 From: unicode at unicode.org (arno.schmitt via Unicode) Date: Sun, 28 Oct 2018 09:32:11 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181028081326.264dc079@JRWUBU2> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> Message-ID: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode: > The notation is a quite widespread format for abbreviations. the > first letter is normal sized, and the subsequent letter is written in > some variety of superscript with a squiggle underneath so that it > doesn't get overlooked. I have deduced that this is not plain text > because there is no encoding mechanism for it. For example, our > lecturers would frequently use this treatment to abbreviate function > as 'fn' with the 'n' superscript and supported by a squiggle below > sitting on the baseline. The squiggle below has meaning; it marks the > word as an abbreviation. > > Richard. Looks to me like U+2116 ? NUMERO SIGN which perhaps should not have encoded, since we have both U+004E LATIN CAPITAL LETTER N and U+00BA ? MASCULINE ORDINAL INDICATOR Arn0 From unicode at unicode.org Sun Oct 28 09:19:50 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 15:19:50 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> Message-ID: Given the "squiggle" below letters are actually gien distinctive semantics, I think it should be encoded a combining character (to be written not after a "superscript" but after any normal base letter, possibly with other combining characters, or CGJ if needed because of the compatibility equivalence. That "squiggle" (which may look like an underscore) would haver the effect of implicity making the base letter superscript (smaller and elevated). It would have probably a "combining below" class. In that case U+2116 ? is perfectly encodable, but still distinct from because "?" does not require this mark (so there's no problem of stability with canonical equivalences, even if this creates new possible confusable pairs when either the mark is used after a normal letter: the risk of confusion only exists for "?" which is a legacy non-decomposable ligature but that has an existing compatibility equivalence, just like all other subscript letters). In that case we have other ways to note *semantically* any abbreviations using distinctive final letters (including for N abbreviating "Numeros", M for "Madame", M for "Mademoiselle", M, M for Monseigneur, P abbreviating "Professor"/"Professeur", or f abbreviating "function"). Notes: * The and are also used in French, instead of or to abbreviate a "-tion" or "-tions" suffix (which derives from Latin "-tio" or "-tios"). But I've also seen other abbreviation marks used for "-tion" and "-tions". * we also have in Unicode distinctive codes for dots used as abbreviation marks (they are not combining, but still encoded distinctly from the regular punctuation full stop), and for the mathematical binary dot operator, or the decimal separator, or for implicit mathematical operators that don't mark anything (i.e. invisible and zero-wdth) but that only break grapheme clusters and prohibit formation of discretionary ligatures). Medieval books or mails contained lot of abbreviation marks due to the cost of paper (or parchment): texts were then frequently "packed" using combining abbreviation marks in various positions (generally above or below). The Germanic "Fraktur e" was a remnant of this old practice, inherited from phonetic annotations added on top of Greek, Hebrew and Arabic, which later turned into an "umlaut" that Unicode unified with the diaeresis, even if it breaks the historic link to the letter Latin "e" used like an abreviation mark or Hebrew vowel point in Fraktur (I think that the history of the "Germanic Fraktur e" is highly linked to the influence of Hebrew in today's Germany, or Greek in today's Eastern and Southern Europe with some Slavic traditions in Cyrillic connected to religious traditions in Greek). The introduction of interlinear annotations in Greek was also margely influenced by Hebrew and Arabic (which however did not turn these marks into plain letters and avoided the formation of complex ligatures like in Indian Brahmic scripts), but was the base of the interlinear notation of actual phonetic. Even the combining accents in French were created after an initial step using ligatures of plain letters, before people started to replace these ligatures by some unstable combining marks (initially not distinguished) then turned them into plain distinctive accents which became the de facto standard (made the offical orthography only very late: before that there was a wide variation between those that wanted to distinguish phonetics, using different accents, but now French tends to simplify this set: the circumflkex in French was an abreviation mark for the unwritten letter "s" which initially was more like the tilde, i.e. a turned small "s"). The German umlaut written like a diaeresis is also very new (only after the abandonment of the Fraktut alphabet where the "e" just looked like two thick vertical strokes Le dim. 28 oct. 2018 ? 10:41, arno.schmitt via Unicode a ?crit : > Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode: > > The notation is a quite widespread format for abbreviations. the > > first letter is normal sized, and the subsequent letter is written in > > some variety of superscript with a squiggle underneath so that it > > doesn't get overlooked. I have deduced that this is not plain text > > because there is no encoding mechanism for it. For example, our > > lecturers would frequently use this treatment to abbreviate function > > as 'fn' with the 'n' superscript and supported by a squiggle below > > sitting on the baseline. The squiggle below has meaning; it marks the > > word as an abbreviation. > > > > Richard. > > Looks to me like U+2116 ? NUMERO SIGN > which perhaps should not have encoded, > since we have both U+004E LATIN CAPITAL LETTER N and > U+00BA ? MASCULINE ORDINAL INDICATOR > > Arn0 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 28 12:28:24 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 28 Oct 2018 18:28:24 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: (Philippe Verdy via Unicode's message of "Sun, 28 Oct 2018 15:19:50 +0100") References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> Message-ID: <86in1mgevb.fsf@mimuw.edu.pl> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote: > Given the "squiggle" below letters are actually gien distinctive > semantics, I think it should be encoded a combining character (to be > written not after a "superscript" but after any normal base letter, > possibly with other combining characters, or CGJ if needed because of > the compatibility equivalence. That "squiggle" (which may look like > an underscore) would haver the effect of implicity making the base > letter superscript (smaller and elevated). It would have probably a > "combining below" class. Seems to me an elegant solution. [...] On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: > Mr? / M=? For me only the latter seems acceptable. Using COMBINING LATIN SMALL LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as the base character. However in the lack of a better solution I can live with it :-) An alternative would be to use SMALL EQUALS SIGN, but looks like fonts supporting it are rather rare. > > Le dim. 28 oct. 2018 ? 10:41, arno.schmitt via Unicode a ?crit : [...] > Looks to me like U+2116 ? NUMERO SIGN > which perhaps should not have encoded, > since we have both U+004E LATIN CAPITAL LETTER N and > U+00BA ? MASCULINE ORDINAL INDICATOR I'm rather sure it is inherited from a character set used for the round-trip test. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sun Oct 28 12:54:34 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 18:54:34 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <86in1mgevb.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> Message-ID: Le dim. 28 oct. 2018 ? 18:28, Janusz S. Bie? a ?crit : > On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote: > > Given the "squiggle" below letters are actually gien distinctive > > semantics, I think it should be encoded a combining character (to be > > written not after a "superscript" but after any normal base letter, > > possibly with other combining characters, or CGJ if needed because of > > the compatibility equivalence. That "squiggle" (which may look like > > an underscore) would haver the effect of implicity making the base > > letter superscript (smaller and elevated). It would have probably a > > "combining below" class. > > Seems to me an elegant solution. > > [...] > > On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: > > Mr? / M=? > > For me only the latter seems acceptable. Using COMBINING LATIN SMALL > LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as > the base character. However in the lack of a better solution I can live > with it :-) > There's a third alternative, that uses the superscript letter r, followed by the combining double underline, instead of the normal letter r followed by the same combining double underline. However it is still not very elegant if we stil need to use only the limited set of superscript letters (this still reduces the number of abbreviations, such as those commonly used in French that needs a superscript "?") -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 28 13:18:29 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Oct 2018 19:18:29 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> Message-ID: Also if the "combining abbreviation mark" is used only at end of a combining sequence to transform it, we can avoid all needs of CGJ for that mark, if the mark is itself assigned the combining class 0. So - abbreviating "Mister" as "M" (without the underscore below "r") becomes - abbreviating "Monseigneur" as "M" (without the underscore below "g" and "r") becomes - abbreviating "Ditto" as "D" (without the underscore below "to") becomes - abbreviating "Operation" as "Op (without the underscore below "to") becomes

- abbreviating "constitutionalit?" as "C (without the underscore below "t?") becomes or - abbreviating "Num?ro" as "N" (without the underscore below "o") becomes - abbreviating "Magister" as "M" (with the double underscore below "r") becomes It is quite easy for text renderers to infer the selection of a small superscript for the base (and its other combining characters or extenders when they support these combinations), before applying the new combiner mark. If not, they can still render the leading base (and its other supported combining characters or extenders), followed by some dotted mark (e.g. a small dotted circle). Renderers that do not recognize the new combining abbreviation mark will just render it at end of the sequence as a usual square or rectangular "tofu"; those that recognize it as a combining character but no support for it, will render the usual dotted square (meaning "unsupported combining mark", to distinguish from the meaning as if there was a "missing base character" to apply before a known combining mark or extender) Le dim. 28 oct. 2018 ? 18:54, Philippe Verdy a ?crit : > Le dim. 28 oct. 2018 ? 18:28, Janusz S. Bie? a > ?crit : > >> On Sun, Oct 28 2018 at 15:19 +0100, Philippe Verdy via Unicode wrote: >> > Given the "squiggle" below letters are actually gien distinctive >> > semantics, I think it should be encoded a combining character (to be >> > written not after a "superscript" but after any normal base letter, >> > possibly with other combining characters, or CGJ if needed because of >> > the compatibility equivalence. That "squiggle" (which may look like >> > an underscore) would haver the effect of implicity making the base >> > letter superscript (smaller and elevated). It would have probably a >> > "combining below" class. >> >> Seems to me an elegant solution. >> >> [...] >> >> On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: >> > Mr? / M=? >> >> For me only the latter seems acceptable. Using COMBINING LATIN SMALL >> LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as >> the base character. However in the lack of a better solution I can live >> with it :-) >> > > There's a third alternative, that uses the superscript letter r, followed > by the combining double underline, instead of the normal letter r followed > by the same combining double underline. > However it is still not very elegant if we stil need to use only the > limited set of superscript letters (this still reduces the number of > abbreviations, such as those commonly used in French that needs a > superscript "?") > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 28 15:12:27 2018 From: unicode at unicode.org (Garth Wallace via Unicode) Date: Sun, 28 Oct 2018 13:12:27 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> Message-ID: On Sun, Oct 28, 2018 at 2:34 AM arno.schmitt via Unicode < unicode at unicode.org> wrote: > Am 28.10.2018 um 09:13 schrieb Richard Wordingham via Unicode: > > The notation is a quite widespread format for abbreviations. the > > first letter is normal sized, and the subsequent letter is written in > > some variety of superscript with a squiggle underneath so that it > > doesn't get overlooked. I have deduced that this is not plain text > > because there is no encoding mechanism for it. For example, our > > lecturers would frequently use this treatment to abbreviate function > > as 'fn' with the 'n' superscript and supported by a squiggle below > > sitting on the baseline. The squiggle below has meaning; it marks the > > word as an abbreviation. > > > > Richard. > > Looks to me like U+2116 ? NUMERO SIGN > which perhaps should not have encoded, > since we have both U+004E LATIN CAPITAL LETTER N and > U+00BA ? MASCULINE ORDINAL INDICATOR > AIUI, ? was encoded as a compatibility character because it appears in some East Asian character sets -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Oct 28 15:42:04 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 28 Oct 2018 20:42:04 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <86in1mgevb.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> Message-ID: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative. I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.) I like palaeographic renderings of text very much indeed, and in fact remain in conflict with members of the UTC (who still, alas, do NOT communicate directly about such matters, but only in duelling ballot comments) about some actually salient representations required for medievalist use. The squiggle in your sample, Janusz, does not indicate anything; it is only a decoration, and the abbreviation is the same without it. Michael Everson > On 28 Oct 2018, at 17:28, Janusz S. Bie? via Unicode wrote: > > For me only the latter seems acceptable. Using COMBINING LATIN SMALL > LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as > the base character. However in the lack of a better solution I can live > with it :-) From unicode at unicode.org Sun Oct 28 16:47:44 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Sun, 28 Oct 2018 21:47:44 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: <8063207F-0BAF-495E-A95B-BFAAAE4BBAE4@evertype.com> I think that it is the _superscription_ that indicates the fact that it is an abbreviation. In English ??e" was written ?ye? and and ?y?? ?y?? and the last of these might have a dot or a line or a squiggle underneath it, or not, and in no case was that dot or line or squiggle either _meaningful_ or necessary. Michael Everson > On 28 Oct 2018, at 21:43, Piotr Karocki wrote: > >> The squiggle in your sample, Janusz, does not indicate anything; it is only a decoration, and the abbreviation is the same without it. > > I disagreee. This squiggle means "warning, this is abbreviation", and is > present in many abbreviations in many centuries (sometimes, although, > 'abbrev symbol' is rendered differently). So yes, it is important symbol and > shouldn't be lost in transliteration. > > Piotr Karocki From unicode at unicode.org Sun Oct 28 18:57:06 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 28 Oct 2018 23:57:06 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: The umlauts in the band name "M?tley Cr?e" are decorative, yet the difference between "M?tley Cr?e" and "Motley Crue" is one of spelling.? Although the tilde in the place name "Rancho Pe?asquitos" is *not* decorative, "Rancho Pe?asquitos" vs. "Rancho Penasquitos" is still a spelling difference. Dingbats are both decorative and representable in computer plain text.? (??????) Conventions exist in computer plain text for distinguishing *bold* and /italic/ text strings, why not a convention for abbreviation superscripts & squiggles?? (At least until something better comes along, such as a direct encoding along the lines of Philippe Verdy's earlier suggestion.) "M=?" might render properly (or not, Notepad using Lucida Console fails here), but it wouldn't easily accommodate needed superscripted Latin small diacriticized letters. "Mr?" for display purposes may look as daft as "/italics/", but it captures the elements of the text of the original manuscript.? And it would allow preservation of abbreviations such as for "constitutionalit?" ? "Ct???". If "Mccoy" vs. "McCoy" vs. "MCCOY" vs. "MC COY" represent spelling differences, then so do "McCoy" vs "M?Coy".? It's a matter of opinion, and opinions often differ. From unicode at unicode.org Mon Oct 29 00:21:57 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Mon, 29 Oct 2018 06:21:57 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> (Michael Everson's message of "Sun, 28 Oct 2018 20:42:04 +0000") References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: <86efc91g5m.fsf@mimuw.edu.pl> On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson wrote: > This is no different the Irish name McCoy which can be written M?Coy > where the raising of the c is actually just decorative, though perhaps > it was once an abbreviation for Mac. In some styles you can see a line > or a dot under the raised c. This is purely decorative. > > I would encode this as M? if you wanted to make sure your data > contained the abbreviation mark. [...] > The squiggle in your sample, Janusz, does not indicate anything; it is > only a decoration, and the abbreviation is the same without it. I have received off the list even more radical suggestion: >>> The third and the last question is: how to encode this symbol in >>> Unicode? > > Why would you need to? Its plain text content is adequately > represented by "Mr" On Sun, Oct 28 2018 at 23:57 GMT, James Kass wrote: > The umlauts in the band name "M?tley Cr?e" are decorative, yet the > difference between "M?tley Cr?e" and "Motley Crue" is one of > spelling.? Although the tilde in the place name "Rancho Pe?asquitos" > is *not* decorative, "Rancho Pe?asquitos" vs. "Rancho Penasquitos" is > still a spelling difference. [...] > If "Mccoy" vs. "McCoy" vs. "MCCOY" vs. "MC COY" represent spelling > differences, then so do "McCoy" vs "M?Coy".? It's a matter of opinion, > and opinions often differ. Well said, but I make the claim stronger; it depends on the purpose of the encoding and intended applications. Handwriting recognition (HWR) is no longer just an abstract possibility, it's a facility present to everybody e.g. in Transkribus (https://transkribus.eu/) which I actually use for transcribing the texts of interest. Do you claim that in the ground-truth for HWR the squiggle and raising doesn't matter? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Oct 29 01:50:11 2018 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Mon, 29 Oct 2018 06:50:11 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: On 2018/10/29 05:42, Michael Everson via Unicode wrote: > This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative. > > I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.) I think this may depend on actual writing practice. In German at least, it is customary to have dots (periods) at the end of abbreviations, and using any other symbol, or not using the dot, would be considered an error. The question of how to encode that dot is fortunately an easy one, but even if it were not, German-writing people would find a sentence such as "The dot or ... has no meaning at all." extremely weird. The dot is there (and in German, has to be there) because it's an abbreviation. Regards, Martin. From unicode at unicode.org Mon Oct 29 02:57:45 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 29 Oct 2018 07:57:45 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <86efc91g5m.fsf@mimuw.edu.pl> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <86efc91g5m.fsf@mimuw.edu.pl> Message-ID: <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com> Janusz S. Bie? asked, > Do you claim that in the ground-truth for HWR the > squiggle and raising doesn't matter? Not me!? "McCoy", "M=?Coy", and "M-?Coy" are three different ways of writing the same surname.? If I were entering plain text data from an old post card, I'd try to keep the data as close to the source as possible.? Because that would be my purpose.? Others might have different purposes.? As you state, it depends on the intention. But, if there were an existing plain text convention I'd be inclined to use it.? Conventions allow for the possibility of interchange, direct encoding would ensure it. From unicode at unicode.org Mon Oct 29 05:43:57 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Mon, 29 Oct 2018 11:43:57 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com> (James Kass's message of "Mon, 29 Oct 2018 07:57:45 +0000") References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <86efc91g5m.fsf@mimuw.edu.pl> <9e196138-a72b-edeb-5deb-bab80ba4286e@gmail.com> Message-ID: <86bm7duj6a.fsf@mimuw.edu.pl> On Mon, Oct 29 2018 at 7:57 GMT, James Kass wrote: > Janusz S. Bie? asked, > >> Do you claim that in the ground-truth for HWR the >> squiggle and raising doesn't matter? > > Not me! I know, sorry if my previous mail was confusing. > "McCoy", "M=?Coy", and "M-?Coy" are three different ways of > writing the same surname.? If I were entering plain text data from an > old post card, I'd try to keep the data as close to the source as > possible.? Because that would be my purpose.? Others might have > different purposes.? As you state, it depends on the intention. But, > if there were an existing plain text convention I'd be inclined to use > it.? Conventions allow for the possibility of interchange, direct > encoding would ensure it. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Oct 29 06:36:04 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Mon, 29 Oct 2018 04:36:04 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 29 06:53:01 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 29 Oct 2018 11:53:01 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: <20181029115301.664e62e1@JRWUBU2> On Sun, 28 Oct 2018 20:42:04 +0000 Michael Everson via Unicode wrote: > I like palaeographic renderings of text very much indeed, and in fact > remain in conflict with members of the UTC (who still, alas, do NOT > communicate directly about such matters, but only in duelling ballot > comments) about some actually salient representations required for > medievalist use. The squiggle in your sample, Janusz, does not > indicate anything; it is only a decoration, and the abbreviation is > the same without it. I think this is one of the few cases where Multicode may have advantages over Unicode. In a mathematical contest, a? would be interpreted as _a_ applied _n_ times. As to "f?", ambiguity may be avoided by the superscript being inappropriate for an exponent. What is redundant in one context may be significant in another. Richard. From unicode at unicode.org Mon Oct 29 14:20:49 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 29 Oct 2018 12:20:49 -0700 Subject: A sign/abbreviation for "magister" Message-ID: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> Richard Wordingham wrote: >> I like palaeographic renderings of text very much indeed, and in fact >> remain in conflict with members of the UTC (who still, alas, do NOT >> communicate directly about such matters, but only in duelling ballot >> comments) about some actually salient representations required for >> medievalist use. The squiggle in your sample, Janusz, does not >> indicate anything; it is only a decoration, and the abbreviation is >> the same without it. > > I think this is one of the few cases where Multicode may have > advantages over Unicode. In a mathematical contest, a? would be > interpreted as _a_ applied _n_ times. As to "f?", ambiguity may be > avoided by the superscript being inappropriate for an exponent. What > is redundant in one context may be significant in another. Are you referring to the encoding described in the 1997 paper by Mudawwar, which "address[es] Unicode's principal drawbacks" by switching between language-specific character sets? Kind of like ISO 2022, but less extensible? ObMagister: I agree that trying to reflect every decorative nuance of handwriting is not what plain text is all about. (I also disagree with those who insist that superscripted abbreviations are required for correct spelling in certain languages, and I expect to draw swift flamage for that stance.) The abbreviation in the postcard, rendered in plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion just fuels the recurring demands for every Latin letter (and eventually those in other scripts) to be duplicated in subscript and superscript, ? la L2/18-206. Back into my hole now. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Oct 29 15:20:36 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 29 Oct 2018 21:20:36 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> Message-ID: <397615514.10318.1540844437188.JavaMail.www@wwinf2209> On 29/10/18 20:29, Doug Ewell via Unicode wrote: [?] > ObMagister: I agree that trying to reflect every decorative nuance of > handwriting is not what plain text is all about. Agreed. > (I also disagree with > those who insist that superscripted abbreviations are required for > correct spelling in certain languages, and I expect to draw swift > flamage for that stance.) It all (no ?flamage?, just trying to understand) depends on how we set the level of requirements, and what is understood by ?correct?. There is even an official position arguing that representing an "?" with an "oe" string is correct, and that using the correct "?" is not required. > The abbreviation in the postcard, rendered in > plain text, is "Mr". Bringing U+02B3 or U+036C into the discussion In English, ?Mr? for ?Mister? is correct, because English does not use superscript here, according to my knowledge. Ordinal indicators are considered different, and require superscript in correct representation. Thus being trained on English, one cannot easily evaluate what is correct and what is required for correctness in a neighbor locale. > just > fuels the recurring demands for every Latin letter (and eventually those > in other scripts) to be duplicated in subscript and superscript, ? la > L2/18-206. That is a generic request, unrelated to any locale, based only on a kind of criticism of poor rendering systems. The ?fake super-/subscripts? are already fixed if only OpenType is supported and fonts are complete. > > Back into my hole now. No worries. Stay tuned :-) Informed discussion brings advancement. Best regards, Marcel From unicode at unicode.org Mon Oct 29 20:47:25 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 30 Oct 2018 02:47:25 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: For the case of "Mister" vs. "Magister", the (double) underlining is not just a stylistic option but conveys semantics as an explicit abbreviation mark ! We are here at the line between what is pure visual encoding (e.g. using superscript letters), and logical encoding (as done eveywhere else in unicode with combining sequences; the most well known exceptions being for Thai script which uses the visual model). Obviously the Latin script should not use any kind of visual encoding, and even the superscript letters (initially introduced for something else, notably as distinct symbols for IPA) was not the correct path (it also has limitation because the superscript letters are quite limited; the same can be saif about the visual encoding of Mathematic symbols as stylistic variants transformed as plain characters, which will always be incomplete, while it could as well be represented logically). So Unicode does not have a consistent policy (and this inconsistence was not just introduced due to legacy roundtrip compatibibility, like the Numero abbreviation or the encoding of the Thai script). Le lun. 29 oct. 2018 ? 12:44, Asmus Freytag via Unicode a ?crit : > On 10/28/2018 11:50 PM, Martin J. D?rst via Unicode wrote: > > On 2018/10/29 05:42, Michael Everson via Unicode wrote: > > This is no different the Irish name McCoy which can be written M?Coy where the raising of the c is actually just decorative, though perhaps it was once an abbreviation for Mac. In some styles you can see a line or a dot under the raised c. This is purely decorative. > > I would encode this as M? if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=? or anything else like that, because the ?r? is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr?, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.) > > > I think this may depend on actual writing practice. In German at least, > it is customary to have dots (periods) at the end of abbreviations, and > using any other symbol, or not using the dot, would be considered an error. > > The question of how to encode that dot is fortunately an easy one, but > even if it were not, German-writing people would find a sentence such as > "The dot or ... has no meaning at all." extremely weird. The dot is > there (and in German, has to be there) because it's an abbreviation. > > Swedes employ ":" for abbreviations but often (always?) for eliding > several word-interior letters. Definitely also a case of a non-optional > convention. > > The use of superscript is tricky, because it can be optional in some > contexts; if I write "3rd" in English, it will definitely be understood no > different from "3rd". Likewise with the several marks below superscripts. > Whether "numero" has an underline or not appears to be a matter of font > design, with some regional preferences (which also affect the style of the > N). > > I'm very much with James that questions of what is spelling vs. what is > style (decoration) can be a matter of opinion - or better perhaps, a matter > of convention and associated expectations. And that there may not always be > unanimity in the outcome. > > In TeX the two transition fluidly. If I was going to transcribe such texts > in TeX, I would construct a macro for the construct of the entire > abbreviation and would name it. That macro would raise the "r", and then - > depending on the desired fidelity of the style of the document, might > include secondary elements, such as underlining, or a squiggle. > > In the standard rich text model of plaintext "back bone" combined with > font selection (and other styling), the named macro would correspond to > encoding the semantic of an Mr abbreviation in the "superscript r" > convention and the details would be handled in the font design. > > That system is perhaps not well suited to exact transcriptions because > unlike Tex, it separates the two aspects, and removes the aspect of > detailed glyph design from the control of the author, unless the latter is > also a font-designer. > > Nevertheless, I think the use of devices like combining underlines and > superscript letters in plain text are best avoided. > > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Oct 29 22:06:57 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 30 Oct 2018 03:06:57 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> Asmus Freytag wrote, > Nevertheless, I think the use of devices like combining underlines > and superscript letters in plain text are best avoided. That's probably true according to the spirit of the underlying encoding principles.? But hasn't that genie already left the bottle? People write their names as they please.? With the entire repertoire of Unicode from which to choose, people are coming up with some amazingly unorthodox ways to "spell" their screen names.? Here's six screen names copy/pasted from an atypical Twitter account's comments sections: Jo? ????ic??? I?MAGI?NER? ???? IXOYE444 (?This one included character U+200F, I removed it.) Q?y ? eT ? Dog ? VOTES? ??? ?? ??K?????z ?? ??? ??? (?Note the decorative emoji.) People are mixing scripts and so forth in order to create distinctive screen names.? Those screen names are out there in the wild and are part of our stored data which future historians are welcome to scratch their heads over. IIRC, around the time that the math alphanumerics were added to Plane One, Michael Everson noted that once characters are encoded people will use them as they see fit.? In this present thread, Michael Everson wrote: > And I would not encode it as Mr?, firstly because it > would never render properly and you might as well > encode it as Mr. or M:r, and second because in the > IPA at least that character indicates an alveolar > realization in disordered speech. (Of course it > could be used for anything.) Yes, it could be used for anything requiring combining-two-lines-below.? At some point, if enough people were doing it, it would morph from a kludge of hacking alveolar whatevers into an accepted convention.? (Not that I am pushing this approach, it's only one suggestion out of many possibilities.? I'm in favor of direct encoding.)? I would not encode the abbreviation as either "Mr." or "M:r" because neither of those text strings appear in the original manuscript. FAICT, "????" is pronounced just like "Tom", but it ain't spelled the same.? Likewise for "McCoy" and "M=?Coy". It strikes me as perverse if "????" can spell his name as he pleases using the UCS but "M=?Coy" mustn't.? Especially since names like "M=?Coy" and abbreviations such as "M=?" could be typed on old-style mechanical typewriters.? Quintessential plain-text, that. From unicode at unicode.org Mon Oct 29 23:46:50 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Mon, 29 Oct 2018 21:46:50 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> Message-ID: <28edcd88-2294-741c-e65f-eb52891459ae@att.net> On 10/29/2018 8:06 PM, James Kass via Unicode wrote: > could be typed on old-style mechanical typewriters.? Quintessential > plain-text, that. Nope. Typewriters were regularly used for underscoring and for strikethrough, both of which are *styling* of text, and not plain text. The mere fact that some visual aspect of graphic representation on a page of paper can be implemented via a mechanical typewriter does not, ipso facto, mean that particular feature is plain text. The fact that I could also implement superscripting and subscripting on a mechanical typewriter via turning the platen up and down half a line, also does not make *those* aspects of text styling plain text. either. The same reasoning applies to handwriting, only more so. --Ken From unicode at unicode.org Tue Oct 30 03:42:29 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 30 Oct 2018 08:42:29 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> Message-ID: <20181030084229.0f67ce4d@JRWUBU2> On Mon, 29 Oct 2018 12:20:49 -0700 Doug Ewell via Unicode wrote: > Richard Wordingham wrote: > > I think this is one of the few cases where Multicode may have > > advantages over Unicode. In a mathematical contest, a? would be > > interpreted as _a_ applied _n_ times. As to "f?", ambiguity may be > > avoided by the superscript being inappropriate for an exponent. What > > is redundant in one context may be significant in another. > > Are you referring to the encoding described in the 1997 paper by > Mudawwar, which "address[es] Unicode's principal drawbacks" by > switching between language-specific character sets? Kind of like ISO > 2022, but less extensible? More precisely to the principle. What is an irrelevant, optional feature in one writing system may be significant in another. I'm currently trying to work out the rules for writing Pali in the Sinhala script - I have to worry about the difference between touching letters and conjuncts. A simple ISCII-like encoding for Sinhala Pali would delegate such matters to the font. Richard. From unicode at unicode.org Tue Oct 30 04:02:53 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 30 Oct 2018 09:02:53 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <28edcd88-2294-741c-e65f-eb52891459ae@att.net> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> <28edcd88-2294-741c-e65f-eb52891459ae@att.net> Message-ID: Ken Whistler replied, >> could be typed on old-style mechanical >> typewriters.? Quintessential plain-text, that. > > Nope. Typewriters were regularly used for > underscoring and for strikethrough, both of which > are *styling* of text, and not plain text. The > mere fact that some visual aspect of graphic > representation on a page of paper can be > implemented via a mechanical typewriter does not, > ipso facto, mean that particular feature is plain > text. The fact that I could also implement > superscripting and subscripting on a mechanical > typewriter via turning the platen up and down half > a line, also does not make *those* aspects of text > styling plain text. either. Sorry if we disagree. I've never used a typewriter for producing anything other than text.? Just plain old unadorned text.? Plain text.? Colloquially speaking rather than speaking technically.? Text existed before the computer age. A typewriter puts text on paper.? Pressing the "M" key while holding the "Shift" key puts "M" on the sheet.? Rolling the platen appropriately and striking "r" puts a superscript "r" on the sheet. Hitting the backspace key, rolling the platen a bit in the other direction and typing the "equals" key finishes this abbreviation in the text on the page.? Then the user rolls the platen to its earlier position and resumes typing.? (It's way easier to do than to describe.) If the typist didn't intend to put a superscript "r" on that page with a double underline, the typist wouldn't have bothered with all that jive. It's about the importance one places on respecting authorial intent. Anything reasonable done on a mechanical typewriter can be replicated in an electronic data display.? If necessary I'd use a kludge before I'd hold my breath waiting for direct encoding when the desired result is for the displayed text on the screen to match the handwritten text in the source as closely as possible.? (I've used lots of kludges while awaiting the real M=?Coy.) Sure, underscoring was used for s?t?r?e?s?s?, but it wasn't used *as* a stylistic difference as much as it was used *in lieu* of the ability to make a stylistic difference, such as bolding or italicizing.? It's the "plain text" convention of that time, predating the asterisks or slashes used in the modern convention. Underscoring might be stripped without messing with the legibility, but so could tatweels and lots of other stuff.? If nothing should mung the asterisks and slashes used in the modern convention, then the earlier convention's underscoring is every bit as worthy of being preserved.? (If I'm not mistaken, there was also some kind of underscoring convention for titles which was used instead of placing titles in quotes.) Strikethrough isn't stylistic if it's done to type a character which isn't present on one of the keys.? For example, letters with strokes used for minority languages, like "?".? I don't see strikethrough as "style" if the typist didn't want to waste White Out on a draft, either. Perhaps I should have referred to typewritten text as seminal plain text rather than quintessential plain text, but quintessential scans better. Speaking of text, computer age or otherwise, the O.E.D. definition of text as related to computers appears outdated and/or incomplete: https://en.oxforddictionaries.com/definition/text (definition 1.3) From unicode at unicode.org Tue Oct 30 06:43:14 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 30 Oct 2018 11:43:14 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> <28edcd88-2294-741c-e65f-eb52891459ae@att.net> Message-ID: <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com> (Still responding to Ken Whistler's post) > The fact that I could also implement superscripting and subscripting on a > mechanical typewriter via turning the platen up and down half a line, also > does not make *those* aspects of text styling plain text. either. Do you know the difference between H?SO? and H2SO4?? One of them is a chemical formula, the other one is a license plate number. T?h?a?t? is not a stylistic difference /in my book/.? (Emphasis added.) But suppose both those strings were *intended* to represent the chemical formula?? Then one of them would be optimally correct; the other one... meh. Now what if we were future historians given the task of encoding both of those strings, from two different sources, and had no idea what those two strings were supposed to represent?? Wouldn't it be best to preserve both strings intact, as they were originally written? From unicode at unicode.org Tue Oct 30 08:13:01 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Tue, 30 Oct 2018 13:13:01 +0000 (GMT) Subject: A sign/abbreviation for "magister" References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> <28edcd88-2294-741c-e65f-eb52891459ae@att.net> <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com> Message-ID: On 2018-10-30, James Kass via Unicode wrote: > (Still responding to Ken Whistler's post) .... > Do you know the difference between H?SO? and H2SO4?? One of them is a > chemical formula, the other one is a license plate number. T?h?a?t? is > not a stylistic difference /in my book/.? (Emphasis added.) Yes. In chemical notation, sub/superscripting is semantically significant. That's not the case for abbreviations: the choice of Mr or any of its superscripted and decorated variations is not semantically significant. The English abbreviation Mr was also frequently superscripted in the 15th-17th centuries, and that didn't mean anything special either - it was just part of a general convention of superscripting the final segment of abbreviations, probably inherited from manuscript practice. > But suppose both those strings were *intended* to represent the chemical > formula?? Then one of them would be optimally correct; the other one... meh. > > Now what if we were future historians given the task of encoding both of > those strings, from two different sources, and had no idea what those > two strings were supposed to represent?? Wouldn't it be best to preserve > both strings intact, as they were originally written? Indeed - and that means an image, not any textual representation. The typeface might be significant too. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Tue Oct 30 10:52:47 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Oct 2018 16:52:47 +0100 (CET) Subject: A sign/abbreviation for "magister" Message-ID: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> Rather than a dozen individual e-mails, I?m sending this omnibus reply for the record, because even if here and in CLDR (SurveyTool forum and Trac) everything has already been discussed and fixed, there is still a need to stay acknowledging, so as not to fail following up, with respect to the oncoming surveys, next of which is to start in 30 days. First here: On 29/10/2018 at 12:43, Dr Freytag via Unicode wrote: [?] > The use of superscript is tricky, because it can be optional in some > contexts; if I write "3rd" in English, it will definitely be > understood no different from "3rd". [Note that this second instance was actually intended to read "3??", but it was formatted using a higher-level protocol.] [?] > In TeX the two transition fluidly. If I was going to transcribe such > texts in TeX, I would construct a macro [?] [?] > Nevertheless, I think the use of devices like combining underlines > and superscript letters in plain text are best avoided. While most other scripts from Arabic to Duployan are generously granted all and everything they need for accurate representation, starting with preformatted superscripts and ending with superscripting or subscripting format controls, Latin script is often quite deliberately pulled down in order to make it unusable outside high-end DTP software, from TeX to Adobe InDesign, with the notable exception of sparsely and parsimoniously encoded preformatted characters for phoneticists and medievalists. E.g. in Arabic script, superscript is considered worth encoding and using without any caveat, whereas when Latin script is on, superscripts are thrown into the same cauldron as underscoring. Obviously Unicode don?t apply to Latin script the same principle they do to all other scripts, i.e. to free preformatted letters as suitable if they are part of a standard representation and in some cases are needed to ensure unambiguity. Mediterranean locales had preformatted ordinal indicators even in the Latin-1-only era, despite "1a" and "2o" may be understood no different from "1?" and 2?". The degree sign, that is on French keyboards, is systematically hijacked to represent the "n?" abbreviation, unless a string is limited to ASCII-only. Several Latin-script-using locales have standard representations and strong user demands for superscripts, which instead of being satisfied on Unicode level as would be done for any other of the world?s scripts, are obstinately rebuffed when not intended for phonetics, or in some cases, for palaeography. I wasn?t digging down to find out about those UTC members who on a regular basis are aggressively contradicting ballot comments about encoding palaeographic Latin letters, while proving unable to sustain any open and honest discussion on this List or elsewhere. Referring to what Dr Everson via Unicode wrote on 28/10/2018 at 21:49: > I like palaeographic renderings of text very much indeed, and in fact > remain in conflict with members of the UTC (who still, alas, do NOT > communicate directly about such matters, but only in duelling ballot > comments) about some actually salient representations required for > medievalist use. That said: On 29/10/2018 at 09:09, James Kass via Unicode wrote: [?] > If I were entering plain text data from an old post card, I'd try > to keep the data as close to the source as possible. Because that > would be my purpose. Others might have different purposes. > As you state, it depends on the intention. But, if there were an > existing plain text convention I'd be inclined to use it. > Conventions allow for the possibility of interchange, direct > encoding would ensure it. The goal of discouraging Latin superscripts is obviously to ensure that reliable document interchange is limited to the PDF. If Unicode were allowed to emit an official recommendation to use preformatted superscripts in Latin script, too, then font designers would implement comprehensive support of combining diacritics, and any plain text including superscripted abbreviations could use the preformatted characters, in order to gather the interoperability that Unicode was designed for. Referring to what Dr Verdy via Unicode wrote on 28/10/2018 at 19:01: [?] > However it is still not very elegant if we stil need to use only > the limited set of superscript letters (this still reduces the > number of abbreviations, such as those commonly used in French > that needs a superscript "?") The use of combining diacritics with preformatted superscripts is also the reason why Unicode is limiting encoding support to base letters, even for preformatted superscript letters. The rule that no *new* precomposed letters with acute accent are encoded anymore applies to superscripts too. A Unicode-conformant way to represent such abbreviations would IMO use U+1D49 followed by U+0301: ,??,. Other representations may require OpenType support, which in Latin script is often turned off, supposedly in order to shift to higher level protocols what Unicode makes available in plain text. Referring to what Dr Kass wrote on 29/10/2018 at 01:05: [?] > "Mr?" for display purposes may look as daft as "/italics/", but > it captures the elements of the text of the original manuscript. > And it would allow preservation of abbreviations such as for > "constitutionalit?" ? "Ct???". Using superscripts plus combining diacritics might be a way to address the limitations Dr Verdy mentioned on 30/10/2018 at 02:56: [?] > Obviously the Latin script should not use any kind of visual > encoding, and even the superscript letters (initially introduced > for something else, notably as distinct symbols for IPA) was not > the correct path (it also has limitation because the superscript > letters are quite limited; [?] But for font designers to implement combining diacritics for use with preformatted superscripts, Unicode needs to explicitly allow or recommend the use of preformatted superscripts in abbreviations. This use case is different from the use case that led to submit the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29: [?] > The abbreviation in the postcard, rendered in plain text, is "Mr". > Bringing U+02B3 or U+036C into the discussion just fuels the > recurring demands for every Latin letter (and eventually those > in other scripts) to be duplicated in subscript and superscript, > ? la L2/18-206. IMO this proposal implodes when considering that the preformatted characters are supposed to be inserted by the application rather than directly out of keyboard drivers. The document L2/18-206 seems to originate from the observation of poor fonts and rendering engines in low-end document editing software. As previously mentioned, the fix is already available using high-end DTP software. That is sustainable as long as no locales are impacted. What this thread is about is a digitally interoperable representation of actual languages. E.g. small caps is out of scope, given the postcard writer did not write the names in small caps, that in Latin script are merely a stylistic convention intended for scientific publication and so on ? while Cyrillic script currently uses ?small caps? to write in lowercase. Cyrillic also uses the ? sign, that is mapped to the second level on key E03 ("3" key) on the Russian and other Cyrillic keyboards. Russian keyboard layout: https://docs.microsoft.com/en-us/globalization/keyboards/kbdru.html Bulgaran (phonetic traditional) keyboard layout: https://docs.microsoft.com/en-us/globalization/keyboards/kbdbgph1.html Perhaps the Numero sign is used in Cyrillic after it had been encoded for East Asian as Dr Wallace via Unicode hinted on 28/10/2018 at 21:20: [?] > AIUI, ? was encoded as a compatibility character because it appears > in some East Asian character sets Still ? is also encoded in ISO/IEC 8859-5, at 0xf0. Further, Dr Whistler via Unicode stated on 30/10/2018 at 05:54: [?] > The mere fact that some visual aspect of graphic representation on a > page of paper can be implemented via a mechanical typewriter does not, > ipso facto, mean that particular feature is plain text. The fact that I > could also implement superscripting and subscripting on a mechanical > typewriter via turning the platen up and down half a line, also does not > make *those* aspects of text styling plain text. either. The reverse is true, too: The fact that some language representation was performed by tweaking the typewriter didn?t tag that representation as not plain text. E.g. the LATIN CAPITAL LETTER C WITH CEDILLA couldn?t be typed by holding Shift and hitting "?"?key E09, the "9" key?on a French keyboard. Nevertheless it is required for legibility when "?" occurs at the start of a sentence or in all-caps. The workaround was to type a COMMA over LATIN CAPITAL LETTER C. Likewise, SUPERSCRIPT TWO was available on French (France) typewriters, and Belgian French ones had SUPERSCRIPT THREE, too. Also, again, the now MODIFIER LETTER SMALL O was and still is emulated using the DEGREE SIGN (on level 2 of key E11). The fact that other superscript letters needed turning the platen does not make them belong to rich text, today. It?s as Dr Kass via Unicode put it on 30/10/2018 at 10:09 when replying to Dr Whistler via Unicode (above): [?] > If the typist didn't intend to put a superscript "r" on that page with a > double underline, the typist wouldn't have bothered with all that jive. > > It's about the importance one places on respecting authorial intent. > [?] > [?] Underscoring might be stripped without messing with the legibility, > but so could tatweels and lots of other stuff. [?] If the intent of Unicode is to discriminate Arabic script vs Latin script, that would be worth mentioning in the Standard. Making claims about interoperability and about unambiguous representation of all of the world?s scripts, Unicode is expected to do so for Latin, too. Dr Bie? via Unicode wrote on 29/10/2018 at 06:40: > > [?] It's a matter of opinion, and opinions often differ. > > Well said, but I make the claim stronger; it depends on the purpose of > the encoding and intended applications. Dr Everson via Unicode replied to Dr Karocki on 28/10/2018 at 22:55: > > I think that it is the _superscription_ that indicates the fact that > it is an abbreviation. Hence Unicode is expected to fully support the use of plain text superscript for those locales using superscript as an abbreviation indicator, in the same role as other locales may use colon or period, a usage that Dr D?rst via Unicode mentioned on 29/10/2018 at 08:04 responding to Dr Everson?s 05:42 (same day) e-mail: [?] > I think this may depend on actual writing practice. In German at least, > it is customary to have dots (periods) at the end of abbreviations, and > using any other symbol, or not using the dot, would be considered an error. So should be, in some locales among which French, not using superscript. It?s just that the perception of a superscript-less abbreviation that normally uses superscript, is biased by the computer keyboard layouts actually still in use (but hopefully soon to be enhanced by more complete layouts). Now is Unicode inspired by typewriting practice when designing the encoding of Latin script, unlike what is done for potentially all other scripts? Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn?t know when replying to Dr Ewell on 29/10/2018 at 21:27: [?] > The English abbreviation Mr was also frequently superscripted in the > 15th-17th centuries, and that didn't mean anything special either - it > was just part of a general convention of superscripting the final > segment of abbreviations, probably inherited from manuscript practice. So English dropped the superscript requirement for common abbreviations in the 17?? or 18?? century to keep it only for ordinals. Should Unicode now take example on English to pull down the representation of French? Fortunately it does not, as the French ordinal indicators are now a part of CLDR, consistently with what the French national body intended when setting up again a design process of a locale-conformant keyboard. The rest of superscript abbreviation letters should follow in CLDR when browsers will be using correct fonts for displaying the data. We remember that The Unicode Standard explicitly specifies that the glyphs of all superscript or modifier letters of a script shall be equalized. No ransom note effect is allowed in Unicode-conformant fonts (except for the purpose of artwork, as in Apple?s former San Francisco typeface). Best regards, Marcel From unicode at unicode.org Tue Oct 30 11:35:09 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Oct 2018 17:35:09 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> Message-ID: <1654688647.7700.1540917309959.JavaMail.www@wwinf2209> On 30/10/18 17:01 I wrote: > A Unicode-conformant way to represent > such abbreviations would IMO use U+1D49 followed by U+0301: ,??,. Works actually fine in my browser. My apologies to font designers and foundries, already supporting the combining diacritics with superscript Latin letters. Only in my text editor it didn?t work, hence the commas instead of quotes bracketing the literal. > We remember that The Unicode Standard explicitly specifies that the > glyphs of all superscript or modifier letters of a script shall be equalized. There is too much interpretation in that statement. TUS actually specifies that no difference of usage is intended by a difference in naming schemes, i.e. MODIFIER LETTERs shall not be discriminated from those letters having SUPERSCRIPT in their name. > No ransom note effect is allowed in Unicode-conformant fonts It may not be explicitely prohibited, though it is not Unicode conformant. Best regards, Marcel From unicode at unicode.org Tue Oct 30 12:51:22 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 Oct 2018 10:51:22 -0700 Subject: A sign/abbreviation for "magister" Message-ID: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Marcel Schneider wrote: > This use case is different from the use case that led to submit > the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29: I guess this is intended as a compliment. While many of the people you quoted do have doctoral degrees, many others of us do not. > E.g. small caps is out of scope, given the postcard writer did not > write the names in small caps, that in Latin script are merely a > stylistic convention intended for scientific publication and so on ? > while Cyrillic script currently uses ?small caps? to write in > lowercase. You're joking, right? ?? ?? ?? ?? This undermines a lot of what you are claiming to know about writing systems, and about the difference between case distinctions and styling. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Oct 30 13:25:37 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 30 Oct 2018 18:25:37 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com> References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> <320cc6c3-b698-1359-baee-d73d70075215@gmail.com> <28edcd88-2294-741c-e65f-eb52891459ae@att.net> <6388e734-97ba-2dfd-6b8d-8e2c9a18011d@gmail.com> Message-ID: <20181030182537.77eb9c26@JRWUBU2> On Tue, 30 Oct 2018 11:43:14 +0000 James Kass via Unicode wrote: > Now what if we were future historians given the task of encoding both > of those strings, from two different sources, and had no idea what > those two strings were supposed to represent?? Wouldn't it be best to > preserve both strings intact, as they were originally written? In general, it is not possible to encode text in Unicodeif one has no knowledge of what the text itself represents. Some English typewriters did not distinguish digit ?0? from capital letter ?O? or digit ?1? from small letter ?l?. Richard. From unicode at unicode.org Tue Oct 30 13:51:06 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Oct 2018 19:51:06 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Message-ID: <1918728727.9415.1540925466807.JavaMail.www@wwinf2209> On 30/10/2018 at 18:59, Doug Ewell via Unicode wrote: > > Marcel Schneider wrote: > > > This use case is different from the use case that led to submit > > the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29: > > I guess this is intended as a compliment. Right. > While many of the people you > quoted do have doctoral degrees, many others of us do not. Making a safe distinction is beyond my knowledge, safest is not to discriminate. > > > E.g. small caps is out of scope, given the postcard writer did not > > write the names in small caps, that in Latin script are merely a > > stylistic convention intended for scientific publication and so on ? > > while Cyrillic script currently uses ?small caps? to write in > > lowercase. > > You're joking, right? No, I wasn?t, nowhere. > > ?? ?? ?? ?? > > This undermines a lot of what you are claiming to know about writing > systems, and about the difference between case distinctions and styling. Unfortunately, yes. My apologies to all Cyrillic scriptors hurted while I assumed that every Cyrillic capital letter is a big version of its lowercase. It?s ironic, given I worked hard to revise the French nameslist, including the Cyrillic block, where I propose to make more subdivisions, the actual heading scheme seems to me as not being respectful enough. Sorry. Marcel From unicode at unicode.org Tue Oct 30 14:01:09 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 30 Oct 2018 19:01:09 +0000 Subject: Logical Order (was: A sign/abbreviation for "magister") In-Reply-To: References: <86tvl7tzkz.fsf@mimuw.edu.pl> <20181028081326.264dc079@JRWUBU2> <9f08cea7-720f-cc27-2105-240559b0a6b1@gmx.net> <86in1mgevb.fsf@mimuw.edu.pl> <48E94E9B-BD5D-426A-8461-FE0D2129CFF3@evertype.com> Message-ID: <20181030190109.15458137@JRWUBU2> On Tue, 30 Oct 2018 02:47:25 +0100 Philippe Verdy via Unicode wrote: > We are here at the line between what is pure visual encoding (e.g. > using superscript letters), and logical encoding (as done eveywhere > else in unicode with combining sequences; the most well known > exceptions being for Thai script which uses the visual model). For your information, Thai uses the logical encoding, almost by definition. The logical order is the order used in the backing store (See Section 2.2, Unicode Design Principles ). In the Thai ?combining sequences? you have in mind, the vowel symbols you have in mind are classified as letters, so we do not have combining sequences! There were ill-defined preposed logically following combining marks (in the charts, but not the tables) in Unicode 1.0, but the problems with implementing them in the Thai monosyllable ???? were so great that I wonder if any one succeeded at the time - with invisible PHINTHU, as opposed to with visible PHINTHU! The official disinformation source, http://www.unicode.org/glossary, misdefines logical order to be ?the order in which text is typed on a keyboard?. So much for suggestions that one should design keyboard interfaces to convert visual order to storage order! A striking example is New Tai Lue, whose standard ordering was changed from phonetic order to visual order because it was found that the logical order, even using the Unicode *character* encoding, was visual order rather than phonetic order. Richard. From unicode at unicode.org Tue Oct 30 15:26:22 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Tue, 30 Oct 2018 22:26:22 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> Message-ID: <20181030202622.GA16380@macbook.localdomain> On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote: > E.g. in Arabic script, superscript is considered worth > encoding and using without any caveat, whereas when Latin script is on, > superscripts are thrown into the same cauldron as underscoring. Curious, what Arabic superscripts are encoded in Unicode? Regards, Khaled From unicode at unicode.org Tue Oct 30 15:26:42 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Tue, 30 Oct 2018 20:26:42 +0000 (GMT) Subject: A sign/abbreviation for "magister" References: <795781780.7176.1540914767836.JavaMail.www@wwinf2209> Message-ID: On 2018-10-30, Marcel Schneider via Unicode wrote: > Dr Bradfield just added on 30/10/2018 at 14:21 something that I didn?t > know when replying to Dr Ewell on 29/10/2018 at 21:27: >> The English abbreviation Mr was also frequently superscripted in the >> 15th-17th centuries, and that didn't mean anything special either - it >> was just part of a general convention of superscripting the final >> segment of abbreviations, probably inherited from manuscript practice. > > So English dropped the superscript requirement for common abbreviations Who said anything about requirement? I didn't. The practice of using superscripts to end abbreviations is alive and well in manuscript - I do it myself in writting notes for myself. For example, "condition" I will often write as "condn", and "equation" as "eqn". > in the 17?? or 18?? century to keep it only for ordinals. Should Unicode What do you mean, for ordinals? If you mean 1st, 2nd etc., then there is not now (when superscripting looks very old-fashioned) and never has been any requirement to superscript them, as far as I know - though since the OED doesn't have an entry for "1st", I can't easily check. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Tue Oct 30 15:38:18 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 Oct 2018 13:38:18 -0700 Subject: A sign/abbreviation for "magister" Message-ID: <20181030133818.665a7a7059d7ee80bb4d670165c8327d.4cbd4f03b4.wbe@email03.godaddy.com> Julian Bradfield wrote: >> in the 17?? or 18?? century to keep it only for ordinals. Should >> Unicode > > What do you mean, for ordinals? If you mean 1st, 2nd etc., then there > is not now (when superscripting looks very old-fashioned) and never > has been any requirement to superscript them, as far as I know - > though since the OED doesn't have an entry for "1st", I can't easily > check. The English Wikipedia article "Ordinal number (linguistics)" does not show numbers such as 1st, 2nd, etc. with superscripts, though as a rich-text Web page, it could easily. The article "English numerals" does include a bullet point: "The suffixes -th, -st, -nd and -rd are occasionally written superscript above the number itself." Note the word "occasionally." -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Oct 30 16:02:43 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Oct 2018 22:02:43 +0100 (CET) Subject: A sign/abbreviation for "magister" Message-ID: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209> On 30/10/2018? at 21:34, Khaled Hosny via Unicode wrote: >? > On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote: > > E.g. in Arabic script, superscript is considered worth? > > encoding and using without any caveat, whereas when Latin script is on,? > > superscripts are thrown into the same cauldron as underscoring. >? > Curious, what Arabic superscripts are encoded in Unicode? ? First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671. But it is a vowel sign. Many letters put above are called superscript? when explaining in English. ? There is the range U+FC5E..U+FC63 (presentation forms). ? Best regards, ? Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Oct 30 16:23:34 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 Oct 2018 14:23:34 -0700 Subject: [getting OT] Re: A sign/abbreviation for "magister" Message-ID: <20181030142334.665a7a7059d7ee80bb4d670165c8327d.50dbbbe7bb.wbe@email03.godaddy.com> Marcel Schneider replied to Khaled Hosny: >>> E.g. in Arabic script, superscript is considered worth encoding and >>> using without any caveat, [...] >> >> Curious, what Arabic superscripts are encoded in Unicode? > > [...] There is the range U+FC5E..U+FC63 (presentation forms). Arabic presentation forms are never an example of anything, and their use is full of caveats. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Oct 30 16:32:57 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 30 Oct 2018 21:32:57 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Message-ID: Doug Ewell responded to Marcel Schneider, >> while Cyrillic script currently uses ?small caps? to write in >> lowercase. > > You're joking, right? > > ?? ?? ?? ?? > > This undermines a lot of what you are claiming to know > about writing systems, and about the difference between > case distinctions and styling. That seems unduly harsh.? None of us are perfect; we all make mistakes.? The lowercase part of Cyrillic casing pairs do resemble small caps for most letters.? One casual mistake given in an aside does not negate the rest of Marcel Schneider's points.? One error about a related script (Cyrillic) does not undermine his thoughtful expectations for the Latin script as a French language member of the Latin script user community. As an aside, calling a mister a doctor isn't insulting but calling a doctor a mister might be.? I suppose we could all call each other magister here, just to be safe, but we can't seem to agree on how to encode its abbreviation. From unicode at unicode.org Tue Oct 30 16:50:27 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Tue, 30 Oct 2018 14:50:27 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Message-ID: On 10/30/2018 2:32 PM, James Kass via Unicode wrote: > but we can't seem to agree on how to encode its abbreviation. For what it's worth, "mgr" seems to be the usual abbreviation in Polish for it. --Ken From unicode at unicode.org Tue Oct 30 16:52:45 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Tue, 30 Oct 2018 23:52:45 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209> References: <1125638808.10320.1540933363084.JavaMail.www@wwinf2209> Message-ID: <20181030215245.GB16380@macbook.localdomain> On Tue, Oct 30, 2018 at 10:02:43PM +0100, Marcel Schneider wrote: > On 30/10/2018? at 21:34, Khaled Hosny via Unicode wrote: > >? > > On Tue, Oct 30, 2018 at 04:52:47PM +0100, Marcel Schneider via Unicode wrote: > > > E.g. in Arabic script, superscript is considered worth? > > > encoding and using without any caveat, whereas when Latin script is on,? > > > superscripts are thrown into the same cauldron as underscoring. > >? > > Curious, what Arabic superscripts are encoded in Unicode? > ? > First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671. > But it is a vowel sign. Many letters put above are called superscript? > when explaining in English. As you say, this is a vowel sign not a superscript letter, so the name is a misnomer at best. It should have been called COMBINING ARABIC LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic it is called small or dagger alef. > There is the range U+FC5E..U+FC63 (presentation forms). That is a backward compatiplity block no one is supposed to use, there are many such backward comatipility presentation forms even of Latin script (U+FB00..U+FB4F). So I don?t see what makes you think, based on this, that Unicode is favouring Arabic or other scripts over Latin. Regards, Khaled From unicode at unicode.org Tue Oct 30 17:41:06 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 30 Oct 2018 23:41:06 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Message-ID: > On 30 Oct 2018, at 22:50, Ken Whistler via Unicode wrote: > > On 10/30/2018 2:32 PM, James Kass via Unicode wrote: >> but we can't seem to agree on how to encode its abbreviation. > > For what it's worth, "mgr" seems to be the usual abbreviation in Polish for it. That seems to be the contemporary usage, but the postcard is from 1917, cf. the OP. Also, the transcription in the followup post suggests that the Polish script at the time, or at least of the author, differed from the commonly taught D'Nealian cursive [1], cf. the "z". A variation of the latter has ended up as the Unicode MATHEMATICAL SCRIPT letters, which is closer to the Swedish cursive [2] for some letters. 1. https://en.wikipedia.org/wiki/D'Nealian 2. https://sv.wikipedia.org/wiki/Skrivstil From unicode at unicode.org Wed Oct 31 00:45:13 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 31 Oct 2018 06:45:13 +0100 Subject: second attempt (was: A sign/abbreviation for "magister") In-Reply-To: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> (Doug Ewell via Unicode's message of "Mon, 29 Oct 2018 12:20:49 -0700") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> Message-ID: <86k1lypt3q.fsf@mimuw.edu.pl> My previous attempt to send this mail was rejected by the list as spam. If this one will not appear on the list, would you be so kind to forward it to the list and the listmaster? On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: [...] > The abbreviation in the postcard, rendered in > plain text, is "Mr". The relevant fragment of the postcard in a loose translation is Use the following address: ... is the abbreviation of magister. I don't think your rendering Mr is the abbreviation of magister. has the same meaning. Please note that I didn't asked *whether* to encode the abbreviation. I asked *how* to do it. If you think it is impossible to encode it in Unicode (without using PUA), just say this explicitely. BTW, I find it strange that nobody refers to an old thread https://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0117.html Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Oct 31 02:27:47 2018 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Wed, 31 Oct 2018 07:27:47 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <1918728727.9415.1540925466807.JavaMail.www@wwinf2209> References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> <1918728727.9415.1540925466807.JavaMail.www@wwinf2209> Message-ID: <02fe068b-b6f5-0bbd-9af2-338f70756806@it.aoyama.ac.jp> On 2018/10/31 03:51, Marcel Schneider via Unicode wrote: > On 30/10/2018 at 18:59, Doug Ewell via Unicode wrote: >> >> Marcel Schneider wrote: >> >>> This use case is different from the use case that led to submit >>> the L2/18-206 proposal, cited by Dr Ewell on 29/10/2018 at 20:29: >> >> I guess this is intended as a compliment. > > Right. > >> While many of the people you >> quoted do have doctoral degrees, many others of us do not. And even those who have such degrees don't expect them to be used on a mailing list. > Making a safe distinction is beyond my knowledge, safest is not to discriminate. Yes. The easiest way to not discriminate is to not use titles in mailing list discussions. That's what everybody else does, and what I highly recommend. Regards, Martin. From unicode at unicode.org Wed Oct 31 04:38:25 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Wed, 31 Oct 2018 09:38:25 +0000 (GMT) Subject: second attempt (was: A sign/abbreviation for "magister") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> Message-ID: On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode wrote: > On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: [ as did I in private mail ] >> The abbreviation in the postcard, rendered in >> plain text, is "Mr". > > The relevant fragment of the postcard in a loose translation is > > Use the following address: ... > is the abbreviation of magister. > > I don't think your rendering > > Mr is the abbreviation of magister. > > has the same meaning. I do, for the reasons stated by many. If the topic were a study of the ways in which people indicate abbreviations by typographic or manuscript styling, then it would be important to know the exact form of the marks; but that is not plain text. One cannot expect to discuss detailed technical questions using only plain text, other than by using language to describe the details. > Please note that I didn't asked *whether* to encode the abbreviation. I > asked *how* to do it. Doug and I have argued that the encoding is "Mr". Further detail can be given in natural language as a note. You could use the various hacks you've discussed, with modifier letters; but that is not "encoding", that is "abusing Unicode to do markup". At least, that's the view I take! Perhaps a more challenging case is that at one time in English, it was common to write and print "the" as "ye" (from older "?e"). Here, there is actually a potential contrast between the forms "ye" ("the") and "ye" (2nd plural pronoun), and the contrast could be realized: "the/ye idle braggarts are a curse upon England". Is the encoding of "ye" to be "ye" or "the"? A hard-line plain-texter such as myself would probably argue for "the". -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Wed Oct 31 05:12:16 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 03:12:16 -0700 Subject: second attempt In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> Message-ID: <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 06:53:22 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 31 Oct 2018 11:53:22 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> Message-ID: <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com> Responding to Julian Bradfield, U+1D49 MODIFIER LETTER SMALL E General Category:? Letter, Modifier Decomposition Type Mapping:? U+0065 It's a spacing superscript Latin lower case "E".? It's a letter. People spell with letters. "One of the goals of the Consortium is to preserve humanity's common linguistic heritage and provide universal access for the world's languages?past, present, and future." Superscripts and subscripts are part of the Latin writing system. If the source says "y?" or "??", that's what I would enter into the database.? Otherwise it's just transcription, IMHO.? If the goal is to preserve the past by transcribing it, we could've done? that with ASCII. Having "y?" or "??" in the database makes the database more human-readable than having mark-up such as "ye" and takes fewer bytes. DUCET allows for desired collation results.? Searching for "y?" or "??" could get only those files which included the specific string and not all the files which include strings "ye", "?e", or "the". The superscript lower case Latin "E" also has "grapheme base" listed as one of its binary properties, so it might be OK to add a line or two under one, if that's what's desired. If the superscript lower case Latin letter "E", ("?"), cannot be used in this instance because it is supposed to *modify* the preceding character, then is its usage in this question a "hack"? It isn't modifying that ASCII quote at all. Providing mark-up solutions isn't universal, but computer plain-text is. For the OP's question, PUA for perfect display and no guarantee of interoperability, "Mr" for transcription, or (what Michael said initially) "M?".? I think it would be OK to add something like a combining equals sign below to Michael's suggested string and make it "M??", but it wouldn't display well unless a font's OpenType tables provided for it. From unicode at unicode.org Wed Oct 31 07:34:53 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 13:34:53 +0100 (CET) Subject: A sign/abbreviation for "magister" Message-ID: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> Thank you for your feedback. ? On 30/10/2018 at 22:52, Khaled Hosny wrote: ? > > First, ARABIC LETTER SUPERSCRIPT ALEPH U+0671. > > But it is a vowel sign. Many letters put above are called superscript? > > when explaining in English. >? > As you say, this is a vowel sign not a superscript letter, so the name > is a misnomer at best. It should have been called COMBINING ARABIC > LETTER ALEF ABOVE, similar to COMBINING LATIN SMALL LETTER A. In Arabic > it is called small or dagger alef. ? Thank you for this information. Indeed the current French translation? named it: 0670 DIACRITIQUE VOYELLE ARABE ALIF EN CHEF * l'appellation anglaise de ce caract?re est erron?e http://hapax.qc.ca/ListeNoms-10.0.0.txt Translation: 0670 COMBINING ARABIC VOWEL ALEF ABOVE * the English designation of this character is mistaken ? Sorry for mistyping its code point, and for forgetting these facts. What?s surprising, then, may be the facility it was named using SUPERSCRIPT,? while superscripts seemed to be disliked in the Standard. ? I note, now, that it should be called COMBINING ARABIC LETTER ALEF ABOVE, as you indicate. (Translating to French as DIACRITIQUE LETTRE ARABE ALIF EN CHEF). ? >? > > There is the range U+FC5E..U+FC63 (presentation forms). >? > That is a backward compatiplity block no one is supposed to use, there > are many such backward comatipility presentation forms even of Latin > script (U+FB00..U+FB4F). >? > So I don?t see what makes you think, based on this, that Unicode is > favouring Arabic or other scripts over Latin. ? Indeed it doesn?t. Sorry about my assumption, but I mainly cited Arabic? first because its name starts with an A, and I remembered it uses a? ?SUPERSCRIPT? in running text. ? Other scripts have: 10FC MODIFIER LETTER GEORGIAN NAR # 10DC 2D6F TIFINAGH MODIFIER LETTER LABIALIZATION MARK # 2D61 A69C MODIFIER LETTER CYRILLIC HARD SIGN # 044A A69D MODIFIER LETTER CYRILLIC SOFT SIGN # 044C [but the latter two are for dialectology] These are in the Duployan block: 1BCA2 SHORTHAND FORMAT DOWN STEP 1BCA3 SHORTHAND FORMAT UP STEP because vertical alignment is significant in stenography. So it is in Latin script when superscript us used as an? abbreviation indicator. However I see that the subjoiners and subjoined letters? are obeying to another scheme than what led to super- or? subscript. ? On 31/07/2018 at 08:27, Martin J. D?rst wrote: > > > Making a safe distinction is beyond my knowledge, safest is not to discriminate. > > Yes. The easiest way to not discriminate is to not use titles in mailing? > list discussions. That's what everybody else does, and what I highly? > recommend. ? OK. That is sound practice, which I observed a long time, until I felt best using Dr.? Thanks for clearing it up. ? On 30/10/2018 at 21:34, Julian Bradfield via Unicode wrote: ? > The practice of using superscripts to end abbreviations is alive and > well in manuscript - I do it myself in writting notes for myself. For > example, "condition" I will often write as "condn", and > "equation" as "eqn". ? That tends to prove that legibility is suboptimal without superscripts,? even in note/draft style, and consequently, in machine processed plain text? ?only more so? (quoting an expression from Ken Whistler?s reply to? James Kass on 30/10/2018 05:54). ? > > in the 17?? or 18?? century to keep it only for ordinals. Should Unicode? >? > What do you mean, for ordinals? If you mean 1st, 2nd etc., then there > is not now (when superscripting looks very old-fashioned) and never > has been any requirement to superscript them, as far as I know - > though since the OED doesn't have an entry for "1st", I can't easily > check. ? Then French, Italian, Portuguese and Spanish seem to be the only locales having? superscript ordinal indicator requirements, or preferences if you prefer.? ? The following forum has a comprehensive explanation for English, and for Romance? languages except French: https://english.stackexchange.com/questions/111265/should-ordinal-indicators-be-inline Especially it explains where the American English lining ordinal indicators came from. ? English Wikipedia?s Ordinal indicator article? https://en.wikipedia.org/wiki/Ordinal_indicator states that ordinal indicators and superscript letters don?t share the same glyph,? which would explain why there was an intent to project a proposal for encoding French? ordinal indicators. (But I advised that that would be a waste of time, as Unicode?s? preformatted superscripts are working out of the box.)? ? Preformatted Unicode superscript small letters are meeting the French superscript? requirement, that is found in: http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux (in French). This brief article focuses on the spelling of the indicators,? without questioning the fact that they are superscript. ? On 31/08/2018 at 06:54, Janusz S. Bie? via Unicode wrote: [?] > BTW, I find it strange that nobody refers to an old thread >? > https://www.unicode.org/mail-arch/unicode-ml/y2016-m12/0117.html ? I thought at linking to some of my previous e-mails and would probably have picked? this one. Thanks for remembering, and for reminding. ? Best regards, ? ? Marcel ? From unicode at unicode.org Wed Oct 31 09:57:20 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 15:57:20 +0100 (CET) Subject: A sign/abbreviation for "magister" (was: Re: second attempt) In-Reply-To: <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> Message-ID: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote: > > On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: > > > You could use the various hacks > > you've discussed, with modifier letters; but that is not "encoding", > > that is "abusing Unicode to do markup". At least, that's the view I > > take! > > +1 There seems to be a widespread confusion about what is plain text, and what Unicode is for. From an US-QWERTY point of view, a current mental representation of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with acute. Unicode is granting to every language its plain text representation. If superscript acts as abbreviation indicator in a given language, this is part of the plain text representation of that language. So far, so good. The core problem is now to determine whether superscript is mandatory, and baseline is fallback, or superscript is optional and decorative, and baseline is correct. That may be a matter of opinion, as has been suggested. However we know now a list of languages where superscript is mandatory, and baseline is fallback. Leaving English alone, these languages on themselves need the use of preformatted superscript letters being granted to them by the UTC. Still in the beginning, when early Unicode set up the Standard, superscript was ruled out of plain text, except when there was sort of a strong lobbying, like when Vietnamese precomposed letters were added. Phoneticists have a strong lobby, so they got some ranges of preformatted letters. To make sure nobody dare use them in running text elsewhere, all *new* superscript letters got names on a MODIFIER LETTER basis, while subscript letters got straightforward names having SUBSCRIPT in them. Additionally, strong caveats were published in TUS. And the trick worked, as most of the time, one is now referring to the superscript letters using the ?modifier letter? label that Unicode have decked them out with. That is why, today, any discussion is at risk of being subject to strong biases when its result should allow some languages to use their traditional abbreviation indicators, in an already encoded and implemented form. Fortunately the front has begun to move, as CLDR TC have granted ordinal indicators to the French locale per v34. Ordinal indicators are one category of abbreviation indicators. Consistently, the already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in titles like "S?", "N? S?", as found in the navigation pane of: http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea I?m not quite sure whether some people would still argue that that string isn?t understood differently from "Na Sa". > In general, I have a certain sympathy for the position that there is no universal > answer for the dividing line between plain and styled text; there are some texts > where the conventional division of plain test and styling means that the plain > text alone will become somewhat ambiguous. That is why phonetics need preformatted super- and subscripts, and so do languages relying on superscript as an abbreviation indicator. > We know that for mathematics, a different dividing line meant that it is possible > to create an (almost) plain text version of many (if not most) mathematical > texts; the conventions of that field are widely shared -- supporting a case for > allowing a standard encoding to support it. Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, https://www.unicode.org/notes/tn28/ is always a good point in this discussion. UnicodeMath uses the full range of superscript digits, because the range is full. It does not use superscript letters, because their range is not full. Hence if superscript digits had stopped at the legacy range "???", only measurement units like the metric equivalents of sq ft and cb ft could be written with superscripts, and that is already allowed according to TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, since phonetics need a full range of superscript and subscript digits, these were added to Unicode, and therefore are used in UnicodeMath. Likewise, phonetics need a nearly-full range of superscript letters, so these were added to Unicode, and therefore are used in the digital representation of natural languages. > However, it stops short of 100% support for edge cases, as does the ordinary > plain text when used for "normal" texts. I think, on balance, that is OK. That is not clear as long as ?ordinary plain text? is not defined for the purpose of this discussion. Since I have superscript small letters on live keys, and the superscript "?" even doubled on the same level as the digits (that it is used to transform into ordinals for most of them), my French keyboard layout driver allows the OS to output ordinary plain text consisting of various signs including superscript small Latin letters. Now is Unicode making a difference between ?plain text? and ?ordinary plain text?? There are various ways to ?clean up? the UCS, first removing presentation forms, then historic letters, then mathematical symbols, then why not emoji, and somewhere in-between, phonetic letters, among which superscripts. The result would then be ?ordinary plain text? ? but to what purpose? Possibly so that all documents must be written up using TeX. Following that logic to its end would mean that composed letters should be removed, too, given they are accurately represented using escape sequences like "e\'" for "?". > If there were another important notational convention, widely shared, > reasonably consistent and so on, then I see no principled objection to considering > whether it should be supported (minus some edge cases) in its own form of > plain text (with appropriate additional elements encoded). I?m pleased to read that. Given the use of superscript in French is important, widely shared, and reasonably consistent, we need to know what it should be else. Certainly: supported by the local keyboard layout. Hopefully it will be, soon. > The current case, transcribing a post-card to make the text searchable, for > example, would fit the use case for ordinary plain text, with the warning against > simulated effects of markup. Triggering such a warning would need to first sort out whether a given representation is best encoded using plain text or using markup. If it?s plain text, then that is not simulating anything. The reverse is true: Markup simulates accurate plain text. Searchability is ensured by equivalence classes. Google Search has most comprehensive equivalence classes, indexing even all mathematical preformatted Latin letters like plain ASCII. > All other uses are better served by markup, whether > SGML / XML style to capture identified features, or final-form rich text like PDF > just preserving the appearance. Agreed. Best regards, Marcel From unicode at unicode.org Wed Oct 31 10:45:21 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Wed, 31 Oct 2018 15:45:21 +0000 (GMT) Subject: A sign/abbreviation for "magister" (was: Re: second attempt) In-Reply-To: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> Message-ID: <9272010.33324.1541000721948.JavaMail.defaultUser@defaultHost> There was a proposal, in the Bytext Report by Bernard Miller many years ago to introduce arrow parentheses characters, eight of them. They were stateful, one character to mean that effectively everything following is superscript until told otherwise, and one for everything following is no longer superscript until told otherwise. There were also pairs for subscript, for the upper limit of an integral and for the the lower limit of an integral and those two latter pairs could also be used with the capital sigma sign used to express the summation of a mathematical series. Now, I appreciate that the statefulness of those suggested characters may still rule them out for implementation in plain text yet maybe an arrow parenthesis or something like it could be encoded that is like a combining accent character but has the effect of making the one character that it follows be a superscript character, and another similar character for subscripts. That would mean that any Unicode character could be used as a superscript or a subscript in plain text. Maybe another two, or maybe another four, such characters could be added so as to allow the limits of integrals and summations to be expressed in plain text using such a method. These new characters could have a visible glyph as a fallback display yet not be displayed at all if, as a result of glyph substitution for the two character sequence, a superscript or subscript version of the first character of the two character sequence were displayed. William Overington Wednesday 31 October 2018 ----Original message---- >From : unicode at unicode.org Date : 2018/10/31 - 14:57 (GMTST) To : unicode at unicode.org Subject : Re: A sign/abbreviation for "magister" (was: Re: second attempt) On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote: > > On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: > > > You could use the various hacks > > you've discussed, with modifier letters; but that is not "encoding", > > that is "abusing Unicode to do markup". At least, that's the view I > > take! > > +1 There seems to be a widespread confusion about what is plain text, and what Unicode is for. From an US-QWERTY point of view, a current mental representation of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with acute. Unicode is granting to every language its plain text representation. If superscript acts as abbreviation indicator in a given language, this is part of the plain text representation of that language. So far, so good. The core problem is now to determine whether superscript is mandatory, and baseline is fallback, or superscript is optional and decorative, and baseline is correct. That may be a matter of opinion, as has been suggested. However we know now a list of languages where superscript is mandatory, and baseline is fallback. Leaving English alone, these languages on themselves need the use of preformatted superscript letters being granted to them by the UTC. Still in the beginning, when early Unicode set up the Standard, superscript was ruled out of plain text, except when there was sort of a strong lobbying, like when Vietnamese precomposed letters were added. Phoneticists have a strong lobby, so they got some ranges of preformatted letters. To make sure nobody dare use them in running text elsewhere, all *new* superscript letters got names on a MODIFIER LETTER basis, while subscript letters got straightforward names having SUBSCRIPT in them. Additionally, strong caveats were published in TUS. And the trick worked, as most of the time, one is now referring to the superscript letters using the ?modifier letter? label that Unicode have decked them out with. That is why, today, any discussion is at risk of being subject to strong biases when its result should allow some languages to use their traditional abbreviation indicators, in an already encoded and implemented form. Fortunately the front has begun to move, as CLDR TC have granted ordinal indicators to the French locale per v34. Ordinal indicators are one category of abbreviation indicators. Consistently, the already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in titles like "S?", "N? S?", as found in the navigation pane of: http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea I?m not quite sure whether some people would still argue that that string isn?t understood differently from "Na Sa". > In general, I have a certain sympathy for the position that there is no universal > answer for the dividing line between plain and styled text; there are some texts > where the conventional division of plain test and styling means that the plain > text alone will become somewhat ambiguous. That is why phonetics need preformatted super- and subscripts, and so do languages relying on superscript as an abbreviation indicator. > We know that for mathematics, a different dividing line meant that it is possible > to create an (almost) plain text version of many (if not most) mathematical > texts; the conventions of that field are widely shared -- supporting a case for > allowing a standard encoding to support it. Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, https://www.unicode.org/notes/tn28/ is always a good point in this discussion. UnicodeMath uses the full range of superscript digits, because the range is full. It does not use superscript letters, because their range is not full. Hence if superscript digits had stopped at the legacy range "???", only measurement units like the metric equivalents of sq ft and cb ft could be written with superscripts, and that is already allowed according to TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, since phonetics need a full range of superscript and subscript digits, these were added to Unicode, and therefore are used in UnicodeMath. Likewise, phonetics need a nearly-full range of superscript letters, so these were added to Unicode, and therefore are used in the digital representation of natural languages. > However, it stops short of 100% support for edge cases, as does the ordinary > plain text when used for "normal" texts. I think, on balance, that is OK. That is not clear as long as ?ordinary plain text? is not defined for the purpose of this discussion. Since I have superscript small letters on live keys, and the superscript "?" even doubled on the same level as the digits (that it is used to transform into ordinals for most of them), my French keyboard layout driver allows the OS to output ordinary plain text consisting of various signs including superscript small Latin letters. Now is Unicode making a difference between ?plain text? and ?ordinary plain text?? There are various ways to ?clean up? the UCS, first removing presentation forms, then historic letters, then mathematical symbols, then why not emoji, and somewhere in-between, phonetic letters, among which superscripts. The result would then be ?ordinary plain text? ? but to what purpose? Possibly so that all documents must be written up using TeX. Following that logic to its end would mean that composed letters should be removed, too, given they are accurately represented using escape sequences like "e\'" for "?". > If there were another important notational convention, widely shared, > reasonably consistent and so on, then I see no principled objection to considering > whether it should be supported (minus some edge cases) in its own form of > plain text (with appropriate additional elements encoded). I?m pleased to read that. Given the use of superscript in French is important, widely shared, and reasonably consistent, we need to know what it should be else. Certainly: supported by the local keyboard layout. Hopefully it will be, soon. > The current case, transcribing a post-card to make the text searchable, for > example, would fit the use case for ordinary plain text, with the warning against > simulated effects of markup. Triggering such a warning would need to first sort out whether a given representation is best encoded using plain text or using markup. If it?s plain text, then that is not simulating anything. The reverse is true: Markup simulates accurate plain text. Searchability is ensured by equivalence classes. Google Search has most comprehensive equivalence classes, indexing even all mathematical preformatted Latin letters like plain ASCII. > All other uses are better served by markup, whether > SGML / XML style to capture identified features, or final-form rich text like PDF > just preserving the appearance. Agreed. Best regards, Marcel From unicode at unicode.org Wed Oct 31 11:03:18 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Wed, 31 Oct 2018 18:03:18 +0200 Subject: A sign/abbreviation for "magister" (was: Re: second attempt) In-Reply-To: <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> Message-ID: <20181031160318.GD16380@macbook.localdomain> On Wed, Oct 31, 2018 at 03:57:20PM +0100, Marcel Schneider via Unicode wrote: > > We know that for mathematics, a different dividing line meant that it is possible > > to create an (almost) plain text version of many (if not most) mathematical > > texts; the conventions of that field are widely shared -- supporting a case for > > allowing a standard encoding to support it. > > Referring to Murray Sargent?s UnicodeMath, a Nearly Plain Text Encoding of Mathematics, > https://www.unicode.org/notes/tn28/ > is always a good point in this discussion. UnicodeMath uses the full range of > superscript digits, because the range is full. It does not use superscript letters, > because their range is not full. Hence if superscript digits had stopped at the > legacy range "???", only measurement units like the metric equivalents of sq ft and > cb ft could be written with superscripts, and that is already allowed according to > TUS. I?m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway, > since phonetics need a full range of superscript and subscript digits, these were > added to Unicode, and therefore are used in UnicodeMath. A while I was localizing some application to Arabic and the developer ?helpfully? used m? for square meter, but that does not work for Arabic because there is no superscript ? in Unicode, so I had to contact the developer and ask for markup to be used for the superscript so that O can use it as well. That nicely shows one of the problems with encoding superscript symbols for arbitrary text styling in Unicode, you can?t stop before duplicating the whole character repertoire or else you will be discriminating against some writing system or uncommon usage. Regards, Khaled From unicode at unicode.org Wed Oct 31 11:20:47 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Wed, 31 Oct 2018 16:20:47 +0000 (GMT) Subject: A sign/abbreviation for "magister" References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> Message-ID: On 2018-10-31, Marcel Schneider via Unicode wrote: > Preformatted Unicode superscript small letters are meeting the French superscript? > requirement, that is found in: > http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux > (in French). This brief article focuses on the spelling of the indicators,? > without questioning the fact that they are superscript. When one does question the Acad?mie about the fact, this is their reply: Le fait de placer en exposant ces mentions est de convention typographique ; il convient donc de le faire. Les seules exceptions sont pour Mme et Mlle. which, if my understanding of "convient" is correct, carefully does quite say that it is *wrong* not to superscript, but that one should superscript when one can because that is the convention in typography. My original question was: Dans les imprim?s ou dans le manuscrit on ?crit "1er, 45e" etc. (J'utilise l'indication HTML pour les lettres sup?rieures.) La question est: est-ce que les lettres sup?rieures sont *obligatoires*, ou sont-ils simplement une question de style? C'est ? dire, si on ?crit "1er, 45e" etc., est-ce une erreur, ou un style simple mais correct? I did not think that their Dictionary desk would understand the concept of plain text, so I didn't ask explicitly for their opinions on encoding :) Which takes us back to when typography is plain text... -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Wed Oct 31 12:18:46 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 18:18:46 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <20181031160318.GD16380@macbook.localdomain> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> Message-ID: <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> On 31/10/2018 at 17:03, Khaled Hosny wrote: > > A while I was localizing some application to Arabic and the developer > ?helpfully? used m? for square meter, but that does not work for Arabic > because there is no superscript ? in Unicode, so I had to contact the > developer and ask for markup to be used for the superscript so that O > can use it as well. That nicely shows one of the problems with encoding > superscript symbols for arbitrary text styling in Unicode, you can?t > stop before duplicating the whole character repertoire or else you will be > discriminating against some writing system or uncommon usage. It seems to me that Arabic is lacking two characters when using Eastern Arabic digits, not Western Arabic. Unicode allowing the m? and m? unit notations, these should be implemented in any script using the same notation. Not the whole UCS, just these two, like Arabic per cent. Or do you have use cases in Arabic where superscript is used as an abbreviation indicator? I don?t share the view according to which superscript is arbitrary in Latin. There is a medieval tradition of superscripting. If it is in Arabic, then it would be limited to these two missing digits. Many many symbols were encoded for Arabic, notably mirrored arrows, so adding these two is quite straightforward. Sad that Arabic ? and ? are still missing. Best regards, Marcel From unicode at unicode.org Wed Oct 31 12:32:54 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 31 Oct 2018 18:32:54 +0100 Subject: second attempt In-Reply-To: (Julian Bradfield via Unicode's message of "Wed, 31 Oct 2018 09:38:25 +0000 (GMT)") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> Message-ID: <86in1im37d.fsf@mimuw.edu.pl> On Wed, Oct 31 2018 at 9:38 GMT, Julian Bradfield via Unicode wrote: > On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode wrote: >> On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: > > [ as did I in private mail ] > >>> The abbreviation in the postcard, rendered in >>> plain text, is "Mr". >> >> The relevant fragment of the postcard in a loose translation is >> >> Use the following address: ... >> is the abbreviation of magister. >> >> I don't think your rendering >> >> Mr is the abbreviation of magister. >> >> has the same meaning. > > I do, for the reasons stated by many. How many? I'm aware only of you and Doug Ewell. > > If the topic were a study of the ways in which people indicate > abbreviations by typographic or manuscript styling, then it would be > important to know the exact form of the marks; but that is not plain > text. Let me remind what plain text is according to the Unicode glossary: Computer-encoded text that consists only of a sequence of code points from a given standard, with no other formatting or structural information. If you try to use this definition to decide what is and what is not a character, you get vicious circle. As mentioned already by others, there is no other generally accepted definition of plain text. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Oct 31 12:37:56 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Wed, 31 Oct 2018 18:37:56 +0100 Subject: use vs mention (was: second attempt) In-Reply-To: (Julian Bradfield via Unicode's message of "Wed, 31 Oct 2018 09:38:25 +0000 (GMT)") References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> Message-ID: <86efc6m2yz.fsf_-_@mimuw.edu.pl> On Wed, Oct 31 2018 at 9:38 GMT, Julian Bradfield via Unicode wrote: > On 2018-10-31, Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode wrote: [...] >> The relevant fragment of the postcard in a loose translation is >> >> Use the following address: ... >> is the abbreviation of magister. >> >> I don't think your rendering >> >> Mr is the abbreviation of magister. >> >> has the same meaning. > > I do The author of the postcard definitely *referred* to the abbreviation in the form *used* in the postcard. We don't know whether the abbreviation "Mr", spelled exactly this way, already existed in that time and in that geographical area. You still don't see the difference in the meaning? Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Wed Oct 31 13:10:16 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 19:10:16 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> Message-ID: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote: > > On 2018-10-31, Marcel Schneider via Unicode wrote: > > > Preformatted Unicode superscript small letters are meeting the French superscript > > requirement, that is found in: > > http://www.academie-francaise.fr/abreviations-des-adjectifs-numeraux > > (in French). This brief article focuses on the spelling of the indicators, > > without questioning the fact that they are superscript. > > When one does question the Acad?mie about the fact, this is their > reply: > > Le fait de placer en exposant ces mentions est de convention > typographique ; il convient donc de le faire. Les seules exceptions > sont pour Mme et Mlle. Translation: ?Superscripting these mentions is typographical convention; consequently it is convenient to do so. The only exceptions are for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].? > > which, if my understanding of "convient" is correct, carefully does > quite say that it is *wrong* not to superscript, but that one should > superscript when one can because that is the convention in typography. Draft style may differ from mail style, and this, from typography, only due to the limitations imposed by input interfaces. These limitations are artificial and mainly the consequence of insufficient development of said interfaces. If the computer is anything good for, then that should also include the transition from typewriter fallbacks to the true digital representation of all natural languages. Latin not excluded. > > My original question was: > > Dans les imprim?s ou dans le manuscrit on ?crit "1er, 45e" > etc. (J'utilise l'indication HTML pour les lettres sup?rieures.) > > La question est: est-ce que les lettres sup?rieures sont > *obligatoires*, ou sont-ils simplement une question de style? C'est ? > dire, si on ?crit "1er, 45e" etc., est-ce une erreur, ou un style > simple mais correct? Translation: ?In print or handwriting one spells "1er, 45e", and so on. (I?m using HTML tags for the superscript letters.) The question is: Are the superscript letters *mandatory*, or are they simply a matter of style? I.e. when writing "1er, 45e", is that a mistake, or a simple but correct style?? > > I did not think that their Dictionary desk would understand the > concept of plain text, so I didn't ask explicitly for their opinions > on encoding :) If you don?t think that they would understand character encoding and the concept of plain text as described in the Unicode Standard, you may wish to explain it to them in detail prior to asking for their opinion on the subject. Thank you anyway for letting us know. > > Which takes us back to when typography is plain text... When the typographc rendering is congruent with the underlying plain text, that means that there is no formatting; but that is quite impossible given the minimal default settings include a font and a font-size. If the plain text is an interoperable representation of a natural language, and that language uses superscript as an abbreviation indicator, that superscript must be visible when the text string is displayed as-is. Else the string referred to as ?plain text? is at risk of not being a legible representation of the intended content. If despite that risk it is, then you are lucky. Best regards, Marcel From unicode at unicode.org Wed Oct 31 13:27:00 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 11:27:00 -0700 Subject: second attempt In-Reply-To: <86in1im37d.fsf@mimuw.edu.pl> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> Message-ID: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 13:35:19 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 11:35:19 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> Message-ID: <84fa3796-22f5-f206-cf3a-84ddc9ad85bc@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 14:14:36 2018 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 31 Oct 2018 12:14:36 -0700 Subject: second attempt In-Reply-To: <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> Message-ID: On 10/31/2018 11:27 AM, Asmus Freytag via Unicode wrote: > but we don't have an agreement that reproducing all variations in > manuscripts is in scope. In fact, I would say that in the UTC, at least, we have an agreement that that clearly is out of scope! Trying to represent all aspects of text in manuscripts, including handwriting conventions, as plain text is hopeless. There is no principled line to draw there before you get into arbitrary calligraphic conventions. And while this list is happily deep-ending on handwritten lines under superscript Latin letters in Polish abbreviations, keep in mind that *Han* characters alone constitute over 64% of the encoded characters in Unicode -- and the handwriting, style, and calligraphic conventions for Han make Latin look simple. Here: Japanese Postcard NY Greeting That is a New Year's greeting snipped from a 1906 Japanese postcard. Oh, snap! What are we going to do to represent the *leaves* (or are they feathers?) being used for handwritten strokes in that text??? --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: JapaneseNY.PNG Type: image/png Size: 56011 bytes Desc: not available URL: From unicode at unicode.org Wed Oct 31 16:57:37 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 14:57:37 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> Message-ID: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 17:30:27 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 31 Oct 2018 22:30:27 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <9d1ab84c-6b1f-6e37-bafc-67cbf4df17ab@gmail.com> Message-ID: <0abcbf67-bd82-a761-f21a-eb8780223209@gmail.com> In my last post I used the word "transcription".? It should have been "transliteration".? Sorry for the mistake.? Three times. FWIW, here's a corrected re-post. --- Responding to Julian Bradfield, U+1D49 MODIFIER LETTER SMALL E General Category:? Letter, Modifier Decomposition Type Mapping:? U+0065 It's a spacing superscript Latin lower case "E".? It's a letter. People spell with letters. "One of the goals of the Consortium is to preserve humanity's common linguistic heritage and provide universal access for the world's languages?past, present, and future." Superscripts and subscripts are part of the Latin writing system. If the source says "y?" or "??", that's what I would enter into the database.? Otherwise it's just transliteration, IMHO.? If the goal is to preserve the past by transliterating it, we could've done that with ASCII. Having "y?" or "??" in the database makes the database more human-readable than having mark-up such as "ye" and takes fewer bytes. DUCET allows for desired collation results.? Searching for "y?" or "??" could get only those files which included the specific string and not all the files which include strings "ye", "?e", or "the". The superscript lower case Latin "E" also has "grapheme base" listed as one of its binary properties, so it might be OK to add a line or two under one, if that's what's desired. If the superscript lower case Latin letter "E", ("?"), cannot be used in this instance because it is supposed to *modify* the preceding character, then is its usage in this question a "hack"? It isn't modifying that ASCII quote at all. Providing mark-up solutions isn't universal, but computer plain-text is. For the OP's question, PUA for perfect display and no guarantee of interoperability, "Mr" for transliteration, or (what Michael said initially) "M?".? I think it would be OK to add something like a combining equals sign below to Michael's suggested string and make it "M??", but it wouldn't display well unless a font's OpenType tables provided for it. From unicode at unicode.org Wed Oct 31 17:32:09 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 15:32:09 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181031160318.GD16380@macbook.localdomain> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> Message-ID: <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 17:34:33 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 23:34:33 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <1714769165.8076.1541006326684.JavaMail.www@wwinf1d36> <3a187870-027c-7f2f-7736-e2b0806eb885@ix.netcom.com> Message-ID: <1788983878.9257.1541025273955.JavaMail.www@wwinf2209> On 31/10/18 at 23:05, Asmus Freytag via Unicode wrote: [?] > > Sad that Arabic ? and ? are still missing. > > How about all the other sets of native digits? The missing ones are hopefully already on the roadmap. Or do you refer to the missing ? and ? in all other native digits? Obviously they need to be encoded if there is a demand like for Arabic. Thanks for the call. Best regards, Marcel From unicode at unicode.org Wed Oct 31 17:37:13 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Oct 2018 23:37:13 +0100 (CET) Subject: A sign/abbreviation for "magister" Message-ID: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209> On 31/10/2018 19:42, Asmus Freytag via Unicode wrote: > > On 10/31/2018 11:10 AM, Marcel Schneider via Unicode wrote: > > > > > which, if my understanding of "convient" is correct, carefully does > > > [not] quite say that it is *wrong* not to superscript, but that one should > > > superscript when one can because that is the convention in typography. > > > > Draft style may differ from mail style, and this, from typography, only > > due to the limitations imposed by input interfaces. These limitations are > > artificial and mainly the consequence of insufficient development of said > > interfaces. If the computer is anything good for, then that should also > > include the transition from typewriter fallbacks to the true digital > > representation of all natural languages. Latin not excluded. > > It is a fallacy that all text output on a computer should match the convention > of "fine typography". > > Much that is written on computers represents an (unedited) first draft. Giving > such texts the appearance of texts, which in the day of hot metal typography, > was reserved for texts that were fully edited and in many cases intended for > posterity is doing a disservice to the reader. > The disconnect is in many people believing the user should be disabled to write his or her language without disfiguring it by lack of decent keyboarding, and that such input should be considered standard for user input. Making such text usable for publishing needs extra work, that today many users cannot afford, while the mass of publishing has increased exponentially over the past decades. The result is garbage, following the rule of ?garbage in, garbage out.? The real disservice to the reader is not to enable the inputting user to write his or her language correctly. A draft whose backbone is a string usable as-is for publishing is not a disservice, but a service to the reader, paying the reader due respect. Such a draft is also a service to the user, enabling him or her to streamline the workflow. Such streamlining brings monetary and reputational benefit to the user. That disconnect seems to originate from the time where the computer became a tool empowering the user to write in all of the world?s languages thanks to Unicode. The concept of ?fine typography? was then used to draw a borderline between what the user is supposed to input, and what he or she needs to get for publication. In the same move, that concept was extended in a way that it should include the quality of the string, additionally to what _fine typography_ really is: fine tuning of the page layout, such as vertical justification, slight variations in the width of non-breakable spaces, and of course, discretionary ligatures. Producing a plain text string usable for publishing was then put out of reach of most common mortals, by using the lever of deficient keyboarding, but also supposedly by an ?encoding error? (scare quotes) in the line break property of U+2008 PUNCTUATION SPACE, that should be non-breakable like its siblings U+2007 FIGURE SPACE (still?as per UAX #14?recommended for use in numbers) and U+2012 FIGURE DASH to gain the narrow non-breaking space needed to space the triads in numbers using space as a group separator, and to space big punctuation in a Latin script using locale, where JTC1/SC2/WG2 had some meetings for the UCS: French. For everybody having beneath his or her hands a keyboard whose layout driver is programmed in a fully usable way, the disconnect implodes. At encoding and input levels (the only ones that are really on-topic in this thread) the sorcery called fine typography sums then up to nothing else than having the keyboard inserting fully diacriticized letters, right punctuation, accurate space characters, and superscript letters as ordinal indicators and abbreviation endings, depending on the requirements. Now was I talking about ?all text output on a computer?? No, I wasn?t. The computer is able to accept input of publishing-ready strings, since we have Unicode. Precluding the user from using the needed characters by setting up caveats and prohibitions in the Unicode Standard seems to me nothing else than an outdated operating mode. U+202F NARROW NO-BREAK SPACE, encoded in 1999 for Mongolian [1][2], has been readily ripped off by the French graphic industry. In 2014, TUS started mentioning its use in French [3]; in 2018, it put it on top [4]. That seems to me a striking example of how things encoded for other purposes are reused (or following a certain usage, ?abused?, ?hacked?, ?hijacked?) in locales like French. If it wasn?t an insult to minority languages, that language could be called, too, ?digitally disfavored? in a certain sense. > On the other hand, I'm a firm believer in applying certain styling attributes > to things like e-mail or discussion papers. Well-placed emphasis can make such > texts more readable (without requiring that they pay attention to all other > facets of "fine typography".) The parenthesized sidenote (that is probably the intended main content?) makes this paragraph wrong. I?d buy it if either the parenthesis is removed or if it comes after the following. With due respect, I need to add that the disconnect in that is visible only to French readers. Without NNBSP, punctuation ? la fran?aise in e-mails is messed up because even NBSP is ignored (I don?t know what exactly happens at backend; anyway at frontend it?s like a normal space in at least one e-mail client and in several if not all browsers, and if pasted in plain text from MS Word, it?s truly replaced with SP. All that makes e-mails harder to read. Correct spacing with punctuation in French is often considered ?fine-tuning?, but only if that punctuation spacing is not supported by the keyboard driver, and that?s still almost always the case, except on the updated version 1.1 of the b?po layout (and some personal prototypes not yet released). Not using angle quotation marks doesn?t fix it, given four other punctuation marks still need spacing (and are almost forcibly spaced with SP by lack of anything better), and given not using angle quotation marks makes any French text harder to read when there is no means to distinguish citation quotes ????? and scare quotes ??? following a scheme that may not be well known yet. See already [5] (with the reader comments) for an overview of the problem. Thank you for your attention. Best regards, Marcel [1] TUS version 3, chapter 6, page 150, table: https://www.unicode.org/versions/Unicode3.0.0/ch06.pdf#%5B%7B%22num%22%3A4%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2Cnull%2C 214%2Cnull%5D [2] TUS version 10 (the last one having detailed bookmarks), ch. 13, p. 534: https://www.unicode.org/versions/Unicode10.0.0/ch13.pdf#I1.27802 [3] TUS version 7, chapter 6, page 265: https://www.unicode.org/versions/Unicode7.0.0/ch06.pdf#G17097 [4] TUS version 11, chapter 6, page 265 (no direct link): https://www.unicode.org/versions/Unicode11.0.0/ch06.pdf#G1834 [5] ??Les antiguillemets comme symboles de la postv?rit???, /Le Devoir/, 2016-12-30 (in French): https://www.ledevoir.com/societe/actualites-en-societe/488139/mises-aux-points-les-antiguillemets-comme-symboles-de-la-postverite From unicode at unicode.org Wed Oct 31 17:58:12 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 31 Oct 2018 22:58:12 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <86in1im37d.fsf@mimuw.edu.pl> <4e4fe547-432b-2fda-0075-89c60d38e0b0@ix.netcom.com> Message-ID: <51ead4ad-27c1-9e12-e5d8-f2c84da0b1c8@gmail.com> Ken Whistler wrote, > Trying to represent all aspects of text in manuscripts, > including handwriting conventions, as plain text is > hopeless. There is no principled line to draw there > before you get into arbitrary calligraphic conventions. Very much agree.? The post card in question is in cursive, for one thing, and the "t" in the spelled out word "Magister" isn't crossed. It's all about where we draw the line.? I'd draw it on the "t" in this case, and enter the word into the data accordingly. From unicode at unicode.org Wed Oct 31 17:35:06 2018 From: unicode at unicode.org (Piotr Karocki via Unicode) Date: Wed, 31 Oct 2018 23:35:06 +0100 Subject: use vs mention (was: second attempt) Message-ID: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> >We don't know whether the abbreviation "Mr", spelled exactly this way, >already existed in that time and in that geographical area. > >You still don't see the difference in the meaning? Maybe another example, from chemistry: 14C = isotope of carbon (carbon 14) 14C = 14 units of carbon (mole, atoms, molecule) C14 = 14 atoms of carbon CI = carbon on first oxidation CI = molecule of carbon and iodine CV = carbon on fifth oxidation CV = molecule of carbon and vanadium CVV = molecule of carbon and vanadium, with vanadium on fifth oxidation CVV molecule of carbon and vanadium, with vanadium on fifth oxidation, with carbon on fifth oxidation Ca2+ = plus sign means cation (of calcium with electrical charge 2) Ca2+ = plus sign means adding something to molecule of two atoms of calcium etc. So, what means 'plaintext' 14C? Which of two possible meanings? So, what means 'plaintext' CVV? What means "Ca2+"? Letter, digit, etc., placed as has different meanings than , and different than no-sup and no-sub. These are only examples of changes in meaning with or , not all of these examples can really exist - but, then, another question: can we know what author means? And as carbon and iodine cannot exist, then of course CI should be interpreted as carbon on first oxidation? But maybe author is student, taking exam, and he/she thinks about molecule of carbon and iodine? ---8<--- Piotr Karocki From unicode at unicode.org Wed Oct 31 18:11:39 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Thu, 1 Nov 2018 01:11:39 +0200 Subject: A sign/abbreviation for "magister" In-Reply-To: <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com> Message-ID: <20181031231055.GJ16380@macbook.localdomain> On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote: > On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote: > > A while I was localizing some application to Arabic and the developer > ?helpfully? used m? for square meter, but that does not work for Arabic > because there is no superscript ? in Unicode, so I had to contact the > developer and ask for markup to be used for the superscript so that O > can use it as well. > > This just pushes the issue down one level. > > Because it assumes that the presence/absence of markup is locale-independent. > > For translation of general text I know this is not true. There are instances > where some words in certain languages are customarily italicized in a way that > is not lexical, therefore not something where the source language would ever > supply markup. That was a while ago, but IIRC, the markup was enabled for that particular widget unconditionally. The localizer is now free to use the markup or not use it, the string was translatable as whole with the embedded markup. It should be possible to enable markup for any widget, it is just an option to tick off in the UI designer, but may experience is that markup is seldom needed in computer UIs, but I may be biased with the kind of UIs and locales I?m most familiar with. Regards, Khaled From unicode at unicode.org Wed Oct 31 18:33:20 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 31 Oct 2018 16:33:20 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <20181031231055.GJ16380@macbook.localdomain> References: <20181029122049.665a7a7059d7ee80bb4d670165c8327d.aa84bf7970.wbe@email03.godaddy.com> <86k1lypt3q.fsf@mimuw.edu.pl> <28fec136-9870-8b39-51be-ceace224318e@ix.netcom.com> <1019555922.6265.1540997841099.JavaMail.www@wwinf1d36> <20181031160318.GD16380@macbook.localdomain> <64d5ae9b-a40e-ed40-ad28-9ed7c2b4e131@ix.netcom.com> <20181031231055.GJ16380@macbook.localdomain> Message-ID: <3221dbb9-86fb-75c4-abb7-d8bb292ec553@ix.netcom.com> On 10/31/2018 4:11 PM, Khaled Hosny wrote: > On Wed, Oct 31, 2018 at 03:32:09PM -0700, Asmus Freytag via Unicode wrote: >> On 10/31/2018 9:03 AM, Khaled Hosny via Unicode wrote: >> >> A while I was localizing some application to Arabic and the developer >> ?helpfully? used m? for square meter, but that does not work for Arabic >> because there is no superscript ? in Unicode, so I had to contact the >> developer and ask for markup to be used for the superscript so that O >> can use it as well. >> >> This just pushes the issue down one level. >> >> Because it assumes that the presence/absence of markup is locale-independent. >> >> For translation of general text I know this is not true. There are instances >> where some words in certain languages are customarily italicized in a way that >> is not lexical, therefore not something where the source language would ever >> supply markup. > That was a while ago, but IIRC, the markup was enabled for that > particular widget unconditionally. The localizer is now free to use the > markup or not use it, the string was translatable as whole with the > embedded markup. It should be possible to enable markup for any widget, > it is just an option to tick off in the UI designer, but may experience > is that markup is seldom needed in computer UIs, but I may be biased > with the kind of UIs and locales I?m most familiar with. All makes sense now. A./ > > Regards, > Khaled > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 18:35:14 2018 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Wed, 31 Oct 2018 23:35:14 +0000 Subject: A sign/abbreviation for "magister" In-Reply-To: <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> Message-ID: <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp> On 2018/11/01 03:10, Marcel Schneider via Unicode wrote: > On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote: >> When one does question the Acad?mie about the fact, this is their >> reply: >> >> Le fait de placer en exposant ces mentions est de convention >> typographique ; il convient donc de le faire. Les seules exceptions >> sont pour Mme et Mlle. > Translation: > ?Superscripting these mentions is typographical convention; > consequently it is convenient to do so. The only exceptions are > for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].? >> >> which, if my understanding of "convient" is correct, carefully does >> quite say that it is *wrong* not to superscript, but that one should >> superscript when one can because that is the convention in typography. As for translation of "il convient", I think Julian is closer to the intended meaning. The verb "convenir" has several meanings (see e.g. https://www.collinsdictionary.com/dictionary/french-english/convenir), but especially in this impersonal usage, the meaning "it is advisable, it is right to, it is proper to" seems to be most appropriate in this context. It may not at all be convenient (=practical) to use the superscripts, e.g. if they are not easily available on a keyboard. Regards, Martin. (French isn't my native language, and nor is English) From unicode at unicode.org Wed Oct 31 19:21:08 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 17:21:08 -0700 Subject: A sign/abbreviation for "magister" In-Reply-To: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209> References: <2139479861.9258.1541025433428.JavaMail.www@wwinf2209> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 19:24:26 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Thu, 1 Nov 2018 01:24:26 +0100 (CET) Subject: A sign/abbreviation for "magister" In-Reply-To: <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp> References: <696655972.4496.1540989293055.JavaMail.www@wwinf1d36> <23350023.8867.1541009416477.JavaMail.www@wwinf1d36> <07eec040-2a63-7dd2-d396-965438f9104f@it.aoyama.ac.jp> Message-ID: <1579143918.9351.1541031866725.JavaMail.www@wwinf2209> On 01/11/2018 at 00:41, Martin J. D?rst wrote: > > On 2018/11/01 03:10, Marcel Schneider via Unicode wrote: > > On 31/10/2018 at 17:27, Julian Bradfield via Unicode wrote: > > >> When one does question the Acad?mie about the fact, this is their > >> reply: > >> > >> Le fait de placer en exposant ces mentions est de convention > >> typographique ; il convient donc de le faire. Les seules exceptions > >> sont pour Mme et Mlle. > > Translation: > > ?Superscripting these mentions is typographical convention; > > consequently it is convenient to do so. The only exceptions are > > for "Mme" [short for "Madame", Mrs] and "Mlle" [short for "Mademoiselle", Ms].? > >> > >> which, if my understanding of "convient" is correct, carefully does > >> quite say that it is *wrong* not to superscript, but that one should > >> superscript when one can because that is the convention in typography. > > As for translation of "il convient", I think Julian is closer to the > intended meaning. The verb "convenir" has several meanings (see e.g. > https://www.collinsdictionary.com/dictionary/french-english/convenir), > but especially in this impersonal usage, the meaning "it is advisable, > it is right to, it is proper to" seems to be most appropriate in this > context. > > It may not at all be convenient (=practical) to use the superscripts, > e.g. if they are not easily available on a keyboard. Very good, thank you. I forgot about the meaning of ?convenient?, and didn?t think at ?advisable? nor at ?right to, proper to?. The point about keyboarding is essential. As long as superscripts are considered exotic or at least very special and need to be grabbed off a character picker, there is no point in bothering users with inputting them. But since that is going to change, it would be fine that Unicode be ready to back the corresponding keyboard layouts so that they won?t get challenged by the sort of considerations prevailing among hardliners. Partly, i.e. for fr(-FR) ordinal indicators, Unicode is ready. Best regards, Marcel > > (French isn't my native language, and nor is English) (Neither is mine either, but I?m based in France since a long time.) From unicode at unicode.org Wed Oct 31 20:01:51 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 31 Oct 2018 18:01:51 -0700 Subject: use vs mention (was: second attempt) In-Reply-To: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> References: <8aa249cef0c646e4525c6ac532ea7089@mail.gmail.com> Message-ID: <9a9790f7-39ca-5ddb-58c0-50dfb8cca6b8@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Oct 31 21:51:24 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Nov 2018 03:51:24 +0100 Subject: A sign/abbreviation for "magister" In-Reply-To: References: <20181030105122.665a7a7059d7ee80bb4d670165c8327d.6605d392f3.wbe@email03.godaddy.com> Message-ID: As is "Mgr" for Monseigneur in French ("Mgr" without superscripts makes little sense, and if "Mr" is sometimes found as an abbreviation for "Monsieur", its standard abbreviation is "M.", and its plural "Messieurs" is noted "MM" without any abbreviation dot or superscript, but normally never as "Mrs" or "Mrs"). If someone finds "Mgr" without the superscript, it could think it is an English abbreviation for "Manager" (a term now frequently used in the modern "Frenglish" language used in French business)... Le mar. 30 oct. 2018 ? 22:58, Ken Whistler via Unicode a ?crit : > > On 10/30/2018 2:32 PM, James Kass via Unicode wrote: > > but we can't seem to agree on how to encode its abbreviation. > > For what it's worth, "mgr" seems to be the usual abbreviation in Polish > for it. > > --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: