From unicode at unicode.org Sat Sep 1 01:00:02 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 1 Sep 2018 08:00:02 +0200 (CEST) Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20) In-Reply-To: <20180831081953.68476d36@spixxi> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> Message-ID: <1680938489.236.1535781602870.JavaMail.www@wwinf1d31> On 31/08/18 08:25 Marius Spix via Unicode wrote: > > A good compromise between human readability, machine processability and > filesize would be using YAML. > > Unlike JSON, YAML supports comments, anchors and references, multiple > documents in a file and several other features. Thanks for advice. Already I do use YAML syntaxic highlighting to display XCompose files, that use the colon as a separator, too. Did you figure out how YAML would fit UCD data? It appears to heavily rely on line breaks, that may get lost as data turns around across environments. XML indentation is only a readability feature and irrelevant to content. The structure is independent of invisible characters and is stable if only graphics are not corrupted (while it may happen that they are). Linebreaks are odd in that they are inconsistent across OSes, because Unicode was denied the right to impose a unique standard in that matter. The result is mashed-up files, and I fear YAML might not hold out. Like XML, YAML needs to repeat attribute names in every instance. That is precisely what CSV gets around of, at the expense of readability in plain text. Personally I could use YAML as I do use XML for lookup in the text editor, but I?m afraid that there is no advantage over CSV with respect to file size. Regards, Marcel > > Regards, > > Marius Spix > > > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode > wrote: > [?] From unicode at unicode.org Sat Sep 1 02:12:12 2018 From: unicode at unicode.org (Marius Spix via Unicode) Date: Sat, 1 Sep 2018 09:12:12 +0200 Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20) In-Reply-To: <1680938489.236.1535781602870.JavaMail.www@wwinf1d31> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> <1680938489.236.1535781602870.JavaMail.www@wwinf1d31> Message-ID: <20180901091212.03841b71@spixxi> Hello Marcel, YAML supports references, so you can refer to another character?s properties. Example: repertoire: char: - name_alias: - [NUL,abbreviation] - ["NULL",control] cp: 0000 na1: "NULL" props: &0000 age: "1.1" na: "" JSN: "" gc: Cc ccc: 0 dt: none dm: "#" nt: None nv: NaN bc: BN bpt: n bpb: "#" Bidi_M: N bmg: "" suc: "#" slc: "#" stc: "#" uc: "#" lc: "#" tc: "#" scf: "#" cf: "#" jt: U jg: No_Joining_Group ea: N lb: CM sc: Zyyy scx: Zyyy Dash: N WSpace: N Hyphen: N QMark: N Radical: N Ideo: N UIdeo: N IDSB: N IDST: N hst: NA DI: N ODI: N Alpha: N OAlpha: N Upper: N OUpper: N Lower: N OLower: N Math: N OMath: N Hex: N AHex: N NChar: N VS: N Bidi_C: N Join_C: N Gr_Base: N Gr_Ext: N OGr_Ext: N Gr_Link: N STerm: N Ext: N Term: N Dia: N Dep: N IDS: N OIDS: N XIDS: N IDC: N OIDC: N XIDC: N SD: N LOE: N Pat_WS: N Pat_Syn: N GCB: CN WB: XX SB: XX CE: N Comp_Ex: N NFC_QC: Y NFD_QC: Y NFKC_QC: Y NFKD_QC: Y XO_NFC: N XO_NFD: N XO_NFKC: N XO_NFKD: N FC_NFKC: "#" CI: N Cased: N CWCF: N CWCM: N CWKCF: N CWL: N CWT: N CWU: N NFKC_CF: "#" InSC: Other InPC: NA PCM: N blk: ASCII isc: "" - cp: 0001 na1: "START OF HEADING" name_alias: - [SOH,abbreviation] - [START OF HEADING,control] props: *0000 Regards, Marius Spix On Sat, 1 Sep 2018 08:00:02 +0200 (CEST) schrieb Marcel Schneider wrote: > On 31/08/18 08:25 Marius Spix via Unicode wrote: > > > > A good compromise between human readability, machine processability > > and filesize would be using YAML. > > > > Unlike JSON, YAML supports comments, anchors and references, > > multiple documents in a file and several other features. > > Thanks for advice. Already I do use YAML syntaxic highlighting to > display XCompose files, that use the colon as a separator, too. > > Did you figure out how YAML would fit UCD data? It appears to heavily > rely on line breaks, that may get lost as data turns around across > environments. XML indentation is only a readability feature and > irrelevant to content. The structure is independent of invisible > characters and is stable if only graphics are not corrupted (while it > may happen that they are). Linebreaks are odd in that they are > inconsistent across OSes, because Unicode was denied the right to > impose a unique standard in that matter. The result is mashed-up > files, and I fear YAML might not hold out. > > Like XML, YAML needs to repeat attribute names in every instance. > That is precisely what CSV gets around of, at the expense of > readability in plain text. Personally I could use YAML as I do use > XML for lookup in the text editor, but I?m afraid that there is no > advantage over CSV with respect to file size. > > Regards, > > Marcel > > > > Regards, > > > > Marius Spix > > > > > > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via > > Unicode wrote: > > > [?] -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: Digitale Signatur von OpenPGP URL: From unicode at unicode.org Sat Sep 1 06:35:32 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 1 Sep 2018 12:35:32 +0100 Subject: UCD in XML or in CSV? In-Reply-To: References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> Message-ID: <20180901123532.011f10e6@JRWUBU2> On Fri, 31 Aug 2018 10:36:45 +0200 Manuel Strehl via Unicode wrote: > For me it's currently much easier to have all the data in a single > place, e.g. a large XML file, than spread over a multitude of files > _with different ad-hoc syntaxes_. > > The situation would possibly be different, though, if the UCD data > would be split in several files of the same format. (Be it JSON, CSV, > YAML, XML, TOML, whatever. Just be consistent.) Most properties are stored in pretty much the same format in the UCD files. UnicodeData.txt is the major exception; it seems to date from when the set of properties was expected to be stable. The big exception is set-valued properties. PropList.txt can be viewed as having an odd syntax for storing the set of miscellaneous Boolean properties for which the codepoint has the value of 'true'. Richard. From unicode at unicode.org Sat Sep 1 07:16:03 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 1 Sep 2018 14:16:03 +0200 (CEST) Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: <20180901091212.03841b71@spixxi> References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> <1680938489.236.1535781602870.JavaMail.www@wwinf1d31> <20180901091212.03841b71@spixxi> Message-ID: <290536618.2898.1535804163592.JavaMail.www@wwinf1d33> Thank you Marius for the example. Indeed I now see that YAML is a powerful means for a file to have an intuitive readability while drastically reducing file size. BTW what I conjectured about the role of line breaks is true for CSV too, and any file downloaded from UCD on a semicolon separator basis becomes unusable when displayed straight in the built-in text editor of Windows, given Unicode uses Unix EOL. ?Still for use in spreadsheets, YAML needs to be converted to CSV, although that might not crash the browser as large XML does. Regards, Marcel On 01/09/18 09:18 Marius Spix via Unicode wrote: > > Hello Marcel, > > YAML supports references, so you can refer to another character?s > properties. > > Example: > > repertoire: > char: > - > name_alias: > - [NUL,abbreviation] > - ["NULL",control] > cp: 0000 > na1: "NULL" > props: &0000 > age: "1.1" > na: "" > JSN: "" > gc: Cc > ccc: 0 > dt: none > dm: "#" > nt: None > nv: NaN > bc: BN > bpt: n > bpb: "#" > Bidi_M: N > bmg: "" > suc: "#" > slc: "#" > stc: "#" > uc: "#" > lc: "#" > tc: "#" > scf: "#" > cf: "#" > jt: U > jg: No_Joining_Group > ea: N > lb: CM > sc: Zyyy > scx: Zyyy > Dash: N > WSpace: N > Hyphen: N > QMark: N > Radical: N > Ideo: N > UIdeo: N > IDSB: N > IDST: N > hst: NA > DI: N > ODI: N > Alpha: N > OAlpha: N > Upper: N > OUpper: N > Lower: N > OLower: N > Math: N > OMath: N > Hex: N > AHex: N > NChar: N > VS: N > Bidi_C: N > Join_C: N > Gr_Base: N > Gr_Ext: N > OGr_Ext: N > Gr_Link: N > STerm: N > Ext: N > Term: N > Dia: N > Dep: N > IDS: N > OIDS: N > XIDS: N > IDC: N > OIDC: N > XIDC: N > SD: N > LOE: N > Pat_WS: N > Pat_Syn: N > GCB: CN > WB: XX > SB: XX > CE: N > Comp_Ex: N > NFC_QC: Y > NFD_QC: Y > NFKC_QC: Y > NFKD_QC: Y > XO_NFC: N > XO_NFD: N > XO_NFKC: N > XO_NFKD: N > FC_NFKC: "#" > CI: N > Cased: N > CWCF: N > CWCM: N > CWKCF: N > CWL: N > CWT: N > CWU: N > NFKC_CF: "#" > InSC: Other > InPC: NA > PCM: N > blk: ASCII > isc: "" > > - > cp: 0001 > na1: "START OF HEADING" > name_alias: > - [SOH,abbreviation] > - [START OF HEADING,control] > props: *0000 > > > > > > Regards, > > Marius Spix > > > On Sat, 1 Sep 2018 08:00:02 +0200 (CEST) > schrieb Marcel Schneider wrote: > [?] From unicode at unicode.org Sat Sep 1 08:15:56 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 1 Sep 2018 15:15:56 +0200 (CEST) Subject: UCD in XML or in CSV? (is: Parsing UCD in XML) In-Reply-To: References: <201808301827.w7UIRbqF028462@unicode.org> <39633706.52606.1535691517838.JavaMail.www@wwinf1d37> <20180831081953.68476d36@spixxi> Message-ID: <1728293477.3292.1535807756905.JavaMail.www@wwinf1d33> On 31/08/18 10:47 Manuel Strehl via Unicode wrote: > > To handle the UCD XML file a streaming parser like Expat is necessary. Thanks for the tip. However for my needs, Expat looks like overkill, and I?m looking out for a much simpler standalone tool, just converting XML to CSV. > > For codepoints.net I use that data [?] Very good site IMO, as it compiles a lot of useful information trying to maximize human readability. Nice to have added the Adopt-a-character button, too. Thanks, Marcel From unicode at unicode.org Sat Sep 1 21:16:07 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 2 Sep 2018 04:16:07 +0200 (CEST) Subject: UCD in XML or in CSV? (is: UCD data consumption) Message-ID: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> I?m not responding without thinking, as I was blamed of when I did, but it is painful for me to dig into what Ken explained about how we should be consuming UCD data. I?ll now try to get some more clarity into the topic. > On 31/08/18 19:59 Ken Whistler via Unicode wrote: > [?] > > > > Third, please remember that folks who come here complaining about the > > complications of parsing the UCD are a very small percentage of a very > > small percentage of a very small percentage of interested parties. OK, among avg. 700 list subscribers, relatively few are ever complaining about anything, let alone about this particular topic. But we should always keep in mind that many folks out there complaining about Unicode don?t come here to do so. > > Nearly everybody who needs UCD data should be consuming it as a > > secondary source (e.g. for reference via codepoints.net), or as a > > tertiary source (behind specialized API's, regex, etc.), Like already suggested, ?as? should probably read ?via? in that part. > > or as an end > > user (just getting behavior they expect for characters in applications). That is more than a simple statement about who is consuming UCD data which way, as you say ?should.? There seem to be assumptions that it is discouraged to dive into the raw data; that folks reading file headers are not doing well; that the data should be assembled only in certain ways; and that ignorant people shouldn?t open the UCD cupboard to pick a file they deem useful. If so, then it might be surprising to know that when submitting a proposal about Bidi-mirroring mathematical symbols issues feedback http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html I?d started as a quasi-end-user not getting behavior I expected for characters in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like it is implemented in web browsers, because I wanted that end-users could experience bidi-mirroring as it works. Unexpectedly a number of math symbols did not mirror, despite many of them being even scalar neighbors. > > Programmers who actually *need* to consume the raw UCD data files and > > write parsers for them directly should actually be able to deal with the > > format complexity -- and, if anything, slowing them down to make them > > think about the reasons for the format complexity might be a good thing, I can see one main reason for the format complexity, and that is that data from various propeties don?t necessarily telescope the same way to make for small files. The complexity of UCD would then mainly be self-induced by the way of packing data into one small file per property rather than adding the value to each relevant code point in one large list as is UnicodeData.txt. While I?m now taking the time to write this up because I?m committed to process that information, we can think of many many people who don?t like to be slowed down trying to find out why Unicode changed UCD design while following the original idea of a large CSV list would be straightforward, eventually by setting up a new one if the first one got stuck. What I can figure out is that while a new property was added, that particular property was always thought of as being the last one. (At some point the many files were then dumped into the known XML files.) If UCD is to be made of small files, it is necessarily complex, and the conclusion is that there should be another large CSV grid to make things simple again and lightweight alike so far as they can. > > as it tends to put the lie to the easy initial assumption that the UCD > > is nothing more than a bunch of simple attributes for all the code points. Did you try the sentence when taking off ?simple?? It appears to me as not being a lie then. One attribute comes to mind that is so complex that its design even changed over time, despite Unicode?s commitment to stability. The Bidi_Mirrored_Glyph property was originally designed to include ?best-fit? pairs for least-worse display in applications not supporting RTL glyphs (ie without OpenType support), with legibility of math formulae in mind. Later (probably due to a poorly written OpenType spec), no more best-fit pairs were added to BidiMirroring.txt, as if OpenType implementers weren?t to remove the best-fit pairs anyway prior to using the file (while the spec says to use it as-is). That led then to the display problem pointed above. I?m sparing the particular problem related to 3 pairs of symbols with tilde, nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. So you can understand that I?m not unaware of the complexity of UCD. Though I don?t think that this could be an argument for not publishing a medium-size CSV file with scalar values listed as in UnicodeData.txt. > > [?] > Even Excel Starter, that I have, is a great tool helping > to perform tasks I fail to get with other tools, even spreadsheet software. Ie not every spreadsheet software seems to do the job as I need it. Regards, Marcel From unicode at unicode.org Mon Sep 3 01:24:06 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Mon, 03 Sep 2018 08:24:06 +0200 Subject: UCD in XML or in CSV? (is: UCD data consumption) In-Reply-To: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> (Marcel Schneider via Unicode's message of "Sun, 2 Sep 2018 04:16:07 +0200 (CEST)") References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> Message-ID: <86ftyr3xq1.fsf@mimuw.edu.pl> On Sun, Sep 02 2018 at 4:16 +0200, [...] > So you can understand that I?m not unaware of the complexity of UCD. Though > I don?t think that this could be an argument for not publishing a medium-size > CSV file with scalar values listed as in UnicodeData.txt. For a non-programmer like me CVS is much more convenient form than XML - I can use it not only with a spreadsheet, but also import directly into a database and analyse with various queries. XML is politically correct, but practically almost unusable without a specialised parser. On Sat, Sep 01 2018 at 15:15 +0200, unicode at unicode.org writes: > On 31/08/18 10:47 Manuel Strehl via Unicode wrote: >> >> To handle the UCD XML file a streaming parser like Expat is necessary. > > Thanks for the tip. However for my needs, Expat looks like overkill, and I?m > looking out for a much simpler standalone tool, just converting XML to CSV. I think CSV and XML can coexist peacefully, we just need an open source round-trip converter. Last but not least, let me remind that the thread was started by a question what is the most convenient way to describe the properties of PUA characters. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Sep 3 02:45:38 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Mon, 03 Sep 2018 09:45:38 +0200 Subject: CLDR In-Reply-To: <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> (Marcel Schneider via Unicode's message of "Fri, 31 Aug 2018 12:17:41 +0200 (CEST)") References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> Message-ID: <86wos32fdp.fsf@mimuw.edu.pl> On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote: > The XML files in these folders: > > https://unicode.org/repos/cldr/tags/latest/common/ Thanks for the link. In the meantime I rediscovered Locale Explorer http://demo.icu-project.org/icu-bin/locexp which I used some time ago. On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote: > On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote: > [?] >> > Given NamesList.txt / Code Charts comments are kept minimal by design, >> > one couldn?t simply pop them into XML or whatever, as the result would be >> > disappointing and call for completion in the aftermath. Yet another task >> > competing with CLDR survey. >> >> Please elaborate. It's not clear for me what do you mean. > > These comments are designed for the Code Charts and as such must not be > disproportionate in exhaustivity. Eg we have lists of related languages ending > in an ellipsis. Looks like we have different comments in mind. [...] >> > Reviewing CLDR data is IMO top priority. >> > There are many flaws to be fixed in many languages including in English. >> > A lot of useful digest charts are extracted from XML there, >> >> Which XML? where? > > More precisely it is LDML, the CLDR-specific XML. > What I called ?digest charts? are the charts found here: > > http://www.unicode.org/cldr/charts/34/ > > The access is via this page: > > http://cldr.unicode.org/index/downloads > > where the charts are in the Charts column, while the raw data is under > SVN Tag. Thanks for the link. I found especially interesting the Polish section in https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html Looks like a complete rubbish, e.g. plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of Pomorze) transliterated into the Greek alphabet (and something in Arabic). The header of the page says "The coverage depends on the availability of data in wikidata for these names" but I was unable to find this rubbish in Wikidata (but I was not looking very hard). > >> >> > and we really >> > need to go through the data and correct the many many errors, please. But who is the right person or institution to do it? >> >> Some time ago I tried to have a close look at the Polish locale and >> found the CLDR site prohibitively confusing. > > I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive > for the access to the XML data (except when knowing about SubVersioN). > Polish data is found here: > > https://www.unicode.org/cldr/charts/34/summary/pl.html > > The access is via the top of the "Summary" index page (showing root data): > > https://www.unicode.org/cldr/charts/34/summary/root.html > > You may wish to particularly check the By-Type charts: > > https://www.unicode.org/cldr/charts/34/by_type/index.html > > Here I?d suggest to first focus on alphabetic information and on punctuation. > > https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html > > Under Latin (table caption, without anchor) we find out what punctuation > Polish has compared to other locales using the same script. > The exact character appears when hovering the header row. > Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is > an error in almost every locale using hyphen. TC is about to correct that. > > Further you will see that while Polish is using apostrophe > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish > CLDR does not have the correct apostrophe for Polish, as opposed eg to French. I understand that by "the correct apostrophe" you mean U+2019 RIGHT SINGLE QUOTATION MARK. > You may wish to note that from now on, both U+0027 APOSTROPHE and > U+0022 QUOTATION MARK are ruled out in almost all locales, given the > preferred characters in publishing are U+2019 and, for Polish, the U+201E and > U+201D that are already found in CLDR pl. The situation seems more complicated because the chart https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html contains different list of punctuation characters than https://www.unicode.org/cldr/charts/34/summary/pl.html. I guess the latter is the primary one, and it contains U+2019 RIGHT SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too). > > Note however that according to the information provided by English Wikipedia: > https://en.wikipedia.org/wiki/Quotation_mark#Polish > Polish also uses single quotes, that by contrast are still missing in CLDR. You are right, but who cares? Looks like this has no practical importance. Nobody complains about the wrong use of quotation marks in Polish by Word or OpenOffice, so looks like the software doesn't use this information. So this is rather a matter of aesthetics... Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Mon Sep 3 04:03:31 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 3 Sep 2018 01:03:31 -0800 Subject: CLDR In-Reply-To: <86wos32fdp.fsf@mimuw.edu.pl> References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> <86wos32fdp.fsf@mimuw.edu.pl> Message-ID: Janusz S. Bie? wrote, > Thanks for the link. I found especially interesting the Polish section > in > > https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html > > Looks like a complete rubbish, e.g. > > plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of > Pomorze) transliterated into the Greek alphabet (and something in > Arabic). And nothing in Armenian, Albanian, or Pashto. If you click on the link at "plpm", it takes you right back to that same entry on that same page, which doesn't seem very helpful. > The header of the page says "The coverage depends on the availability of > data in wikidata for these names" but I was unable to find this rubbish > in Wikidata (but I was not looking very hard). I tried both "plpm" and "?????????" in the Wikidata search box. On the latter, there were some pages which looked to translate place names into various languages, for both Germany and Poland. I couldn't find the exact page, but it would be something like this page: https://www.wikidata.org/wiki/Q54180 (Clicking "All Entered Languages" on that page gives a lengthy list.) >>> > and we really >>> > need to go through the data and correct the many many errors, please. > > But who is the right person or institution to do it? If the CLDR information is driven by Wikidata as the file header indicates, then Wikidata. From unicode at unicode.org Mon Sep 3 04:37:12 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 3 Sep 2018 01:37:12 -0800 Subject: CLDR In-Reply-To: References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> <86wos32fdp.fsf@mimuw.edu.pl> Message-ID: I wrote, > ... I couldn't find the exact page, but it would > be something like this page: > > https://www.wikidata.org/wiki/Q54180 Hmmm, maybe that is the exact page. That page does show the ISO 3166-2 code as "PL-PM". So, if that's the correct page and the English is given as "Pomeranian Voivodeship", why is CLDR giving the English as "Federal Capital Territory"? The Wikidata page was last edited/updated on 2018-08-25. The CLDR page doesn't include last updated information. Perhaps it hasn't been updated in a while. From unicode at unicode.org Mon Sep 3 05:03:36 2018 From: unicode at unicode.org (Arthur Reutenauer via Unicode) Date: Mon, 3 Sep 2018 12:03:36 +0200 Subject: CLDR In-Reply-To: <201809030954.w839sXrQ031569@nef2.ens.fr> References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> <201809030954.w839sXrQ031569@nef2.ens.fr> Message-ID: <20180903100336.GA4175881@phare.normalesup.org> On Mon, Sep 03, 2018 at 09:45:38AM +0200, Janusz S. Bie? via Unicode wrote: > plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of > Pomorze) transliterated into the Greek alphabet (and something in > Arabic). This must be a mistake (a strange copy-paste side effect?). Federal Capital Territory is a subdivision of Nigeria. The Persian name seems correct. Best, Arthur From unicode at unicode.org Mon Sep 3 05:07:39 2018 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Mon, 3 Sep 2018 12:07:39 +0200 Subject: UCD in XML or in CSV? (is: UCD data consumption) In-Reply-To: <86ftyr3xq1.fsf@mimuw.edu.pl> References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> <86ftyr3xq1.fsf@mimuw.edu.pl> Message-ID: <20180903100739.pe5w23ybcpvw5rrx@angband.pl> On Mon, Sep 03, 2018 at 08:24:06AM +0200, Janusz S. Bie? via Unicode wrote: > For a non-programmer like me CVS is much more convenient form than XML - > I can use it not only with a spreadsheet, but also import directly into > a database and analyse with various queries. XML is politically correct, > but practically almost unusable without a specialised parser. And for a programmer, XML is outright insane. You need a complex library to do so, and those fail KISS so badly that you have a CVE roughly yearly. On the other hand, writing a parser for current headerless ;-separated data completely from scratch is just: cut -d';' -f 1,6 On 03/09/18 09:53 Janusz S. Bie? via Unicode wrote: > > On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote: > > The XML files in these folders: > > > > https://unicode.org/repos/cldr/tags/latest/common/ > > Thanks for the link. > > In the meantime I rediscovered Locale Explorer > > http://demo.icu-project.org/icu-bin/locexp > > which I used some time ago. Nice. Actually based on CLDR v31.0.1. > > On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote: > > On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote: > > [?] > >> > Given NamesList.txt / Code Charts comments are kept minimal by design, > >> > one couldn?t simply pop them into XML or whatever, as the result would be > >> > disappointing and call for completion in the aftermath. Yet another task > >> > competing with CLDR survey. > >> > >> Please elaborate. It's not clear for me what do you mean. > > > > These comments are designed for the Code Charts and as such must not be > > disproportionate in exhaustivity. Eg we have lists of related languages ending > > in an ellipsis. > > Looks like we have different comments in mind. Then I?m sorry to be off-topic. [?] > >> > and we really > >> > need to go through the data and correct the many many errors, please. > > But who is the right person or institution to do it? Software vendors are committed to care for the data, and may delegate survey to service providers specialized in localization. Then I think that public language offices should be among the reviewers. Beyond, and especially by lack of the latter, anybody is welcome to contribute as a guest. (Guest votes are 1 and don?t add one to another.) That is consistent with the fact that Unicode relies on volunteers, too. I?m volunteering to personally welcome you to contribute to CLDR. [?] > > Further you will see that while Polish is using apostrophe > > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish > > CLDR does not have the correct apostrophe for Polish, as opposed eg to French. > > I understand that by "the correct apostrophe" you mean U+2019 RIGHT > SINGLE QUOTATION MARK. Yes. > > > You may wish to note that from now on, both U+0027 APOSTROPHE and > > U+0022 QUOTATION MARK are ruled out in almost all locales, given the > > preferred characters in publishing are U+2019 and, for Polish, the U+201E and > > U+201D that are already found in CLDR pl. > > The situation seems more complicated because the chart > > https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html > > contains different list of punctuation characters than > > https://www.unicode.org/cldr/charts/34/summary/pl.html. > > I guess the latter is the primary one, and it contains U+2019 RIGHT > SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too). It?s a bit confusing because there is a column for English and a column for Polish. The characters you retrieved are actually in the English column, while Polish has consistently with By-Type, these quotation marks: ' " ? ? ? ? Hence the set is incomplete. > > > > > Note however that according to the information provided by English Wikipedia: > > https://en.wikipedia.org/wiki/Quotation_mark#Polish > > Polish also uses single quotes, that by contrast are still missing in CLDR. > > You are right, but who cares? Looks like this has no practical > importance. Nobody complains about the wrong use of quotation marks in > Polish by Word or OpenOffice, so looks like the software doesn't use > this information. So this is rather a matter of aesthetics... I?ve come to the position that to let a word processor ?use? quotation marks is to miss the point. Quotation marks are definitely used by the user typing in his or her text, and are expected to be on the keyboard layout he or she is using. So-called smart quotes guessed algorithmically from ASCII simple and double quote are but a hazardous workaround when not installing the appropriate keyboard layout. At least that is my position :) Best regards, Marcel From unicode at unicode.org Mon Sep 3 04:26:38 2018 From: unicode at unicode.org (William_J_G Overington via Unicode) Date: Mon, 3 Sep 2018 10:26:38 +0100 (BST) Subject: Encoding character information for characters of a Private Use Area use (from Re: UCD in XML or in CSV?) In-Reply-To: <86ftyr3xq1.fsf@mimuw.edu.pl> References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> <86ftyr3xq1.fsf@mimuw.edu.pl> Message-ID: <1843880.12150.1535966798390.JavaMail.defaultUser@defaultHost> Janusz S. Bien wrote: > Last but not least, let me remind that the thread was started by a question what is the most convenient way to describe the properties of PUA characters. >From what I have learned during the time period of the discussion it seems to me that using JSON would be a good idea. http://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0144.html http://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0145.html It appears that all that is needed is to define an object named PUAINFO and then put the name PUAINFO inside quotation marks and then define the object in whatever JSON way one chooses to do it. For example, one could have an array of values, one or more of which could be a string listing a PUA (Private Use Area) code point or a range of PUA code points. For examples, "$E001" and "$E100..$E17F", together with strings containing other information. One such string, maybe the first after the colon, whether or not within an array, could be a description of the particular Private Use Area use that the particular file supports. Using JSON would mean that the format would be independent of any particular programming language and could be designed to be straightforwardly read by humans as well. >From reading the documents I think that the structure may start as follows, though I am not congruently sure of the matter at this time. {"PUAINFO": There are then various ways to proceed, such as for example having everything in one array, or for example having many names each of which has data. Having many names each of which has data may well look more elegant in a print out and be more easily read by humans, yet having everything in one array in a known order may mean that getting the format implemented in software applications might be easier and thus more likely to happen. Whichever way it is done, then provided it is done rigorously, a format which becomes implemented widely in applications would be a contribution of lasting value. William Overington Monday 3 September 2018 From unicode at unicode.org Mon Sep 3 14:40:01 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 3 Sep 2018 21:40:01 +0200 Subject: UCD in XML or in CSV? (is: UCD data consumption) In-Reply-To: <20180903100739.pe5w23ybcpvw5rrx@angband.pl> References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> <86ftyr3xq1.fsf@mimuw.edu.pl> <20180903100739.pe5w23ybcpvw5rrx@angband.pl> Message-ID: But CSV is only fine for pure tabular data, and the UCD or CDLR data is has a more complex structure than a simple 2D table. In addition, the schema is evolving, with new kind of datas added everytime; you cannot keep that compatibility by adding more empty columns to a single table; adding new semicolons or other separators to a CSV makes the formaty much less readable, and in fact it will then contain lot of redundancy. Like traditional relational databases, these project need a schema and structure. But if we have to use a RDBMS API, we'll loose the possibility for using various tools. So these Unicode databases are using collections of tables and in some cases you need to split a value into multiple ones with different scoping rules: for that job JSON or XML is fine. But nothing prevents you to load the existing UCD/CLDR database files into a relational database and expose the data in different views. But most applications are in fact built by first laoding this data with a parser specific to the application, that will convert it to its application-defined schema, and data can be recompiled in a new form that will then be exposed by an application API. XML if then fine ! It has no cost for final users that just use the generated applications. It's only up to application compiler projects to parse the data, generate their code, and integrate the data to their API (there are more useful tools than just "grep'ing the UCD/CLDR datafiles. Also the UCD and CLDR files are checked by other automated tools that already parse them, and load them to perform consistency checks and generate multiple presentations: the important ICU project is built and maintained for that, it has all the tools needed, plus a reduced API that can be used directly by final applications. Even some UCD files are now automatically generated from other source files, they contain automatically generated reports, Only the initial main UCD file has kept its initial pure CSV form: it was no longer possible to continue extending this single file, but compatibility has been preserved and it's a good thing. All others contain comment lines, and basic report lines. Le lun. 3 sept. 2018 ? 12:16, Adam Borowski via Unicode a ?crit : > On Mon, Sep 03, 2018 at 08:24:06AM +0200, Janusz S. Bie? via Unicode wrote: > > For a non-programmer like me CVS is much more convenient form than XML - > > I can use it not only with a spreadsheet, but also import directly into > > a database and analyse with various queries. XML is politically correct, > > but practically almost unusable without a specialised parser. > > And for a programmer, XML is outright insane. You need a complex library > to > do so, and those fail KISS so badly that you have a CVE roughly yearly. > On the other hand, writing a parser for current headerless ;-separated data > completely from scratch is just: > > cut -d';' -f 1,6 or: > (split/;/)[0,5] > > JSON is somewhat better, but still needs drastically more effort. > CSV (especially with no escapes) is trivial to handle. > > > ????! > -- > ??????? What Would Jesus Do, MUD/MMORPG edition: > ??????? ? multiplay with an admin char to benefit your mortal [Mt3:16-17] > ??????? ? abuse item cloning bugs [Mt14:17-20, Mt15:34-37] > ??????? ? use glitches to walk on water [Mt14:25-26] > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Sep 4 04:02:50 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 4 Sep 2018 01:02:50 -0800 Subject: CLDR In-Reply-To: <86zhwxy8iq.fsf@mimuw.edu.pl> References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> <86sh2v3ye5.fsf@mimuw.edu.pl> <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> <86wos32fdp.fsf@mimuw.edu.pl> <86zhwxy8iq.fsf@mimuw.edu.pl> Message-ID: (This is the response from Janusz S. Bie? which was sent to the public list.) On Mon, Sep 03 2018 at 1:03 -0800, James Kass wrote: > Janusz S. Bie? wrote, > >> Thanks for the link. I found especially interesting the Polish section >> in >> >> https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html >> >> Looks like a complete rubbish, e.g. >> >> plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of >> Pomorze) transliterated into the Greek alphabet (and something in >> Arabic). > > And nothing in Armenian, Albanian, or Pashto. > > If you click on the link at "plpm", it takes you right back to that > same entry on that same page, which doesn't seem very helpful. > >> The header of the page says "The coverage depends on the availability of >> data in wikidata for these names" but I was unable to find this rubbish >> in Wikidata (but I was not looking very hard). > > I tried both "plpm" and "?????????" in the Wikidata search box. On > the latter, there were some pages which looked to translate place > names into various languages, for both Germany and Poland. I couldn't > find the exact page, but it would be something like this page: > > https://www.wikidata.org/wiki/Q54180 > > (Clicking "All Entered Languages" on that page gives a lengthy list.) Thanks! Most data about Poland at https://www.wikidata.org/wiki/Q36 seem to make sense, but I don't think anybody is using abbreviation like "plpm" (for Pomorze/Pomerania). > >>>> > and we really >>>> > need to go through the data and correct the many many errors, please. >> >> But who is the right person or institution to do it? > > If the CLDR information is driven by Wikidata as the file header > indicates, then Wikidata. I hope not all CLDR data are driven by Wikidata... On Mon, Sep 03 2018 at 12:28 +0200, Marcel Schneider wrote: > On 03/09/18 09:53 Janusz S. Bie? via Unicode wrote: [...] >> > These comments are designed for the Code Charts and as such must not be >> > disproportionate in exhaustivity. Eg we have lists of related languages ending >> > in an ellipsis. >> >> Looks like we have different comments in mind. > > Then I?m sorry to be off-topic. Let's say off the original topic. My primary concern is to preserve somehow such comments as e.g. the one on the bottom of page 14 of https://folk.uib.no/hnooh/mufi/specs/MUFI-CodeChart-4-0.pdf > > [?] >> >> > and we really >> >> > need to go through the data and correct the many many errors, please. >> >> But who is the right person or institution to do it? > > Software vendors are committed to care for the data, and may delegate survey > to service providers specialized in localization. Then I think that public language > offices should be among the reviewers. Beyond, and especially by lack of the > latter, anybody is welcome to contribute as a guest. (Guest votes are 1 and don?t > add one to another.) That is consistent with the fact that Unicode relies on > volunteers, too. > > I?m volunteering to personally welcome you to contribute to CLDR. Thanks. The interesting question is who is/was already contributing from Poland or about Polish language. I vaguely remember a post with this information, but at that time I was not interested enough to take a note. > > [?] >> > Further you will see that while Polish is using apostrophe >> > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish >> > CLDR does not have the correct apostrophe for Polish, as opposed eg to French. >> >> I understand that by "the correct apostrophe" you mean U+2019 RIGHT >> SINGLE QUOTATION MARK. > > Yes. > >> >> > You may wish to note that from now on, both U+0027 APOSTROPHE and >> > U+0022 QUOTATION MARK are ruled out in almost all locales, given the >> > preferred characters in publishing are U+2019 and, for Polish, the U+201E and >> > U+201D that are already found in CLDR pl. [...] > It?s a bit confusing because there is a column for English and a column for Polish. > The characters you retrieved are actually in the English column, while Polish has > consistently with By-Type, these quotation marks: > ' " ? ? ? ? > Hence the set is incomplete. You are right, thanks. But was is the practical importance of it? I noticed that sometimes in Emacs 'forward-word" behaves strangely on a text with unusual characters, but had no motivation to investigate how this is related to the current locale. >> >> > >> > Note however that according to the information provided by English Wikipedia: >> > https://en.wikipedia.org/wiki/Quotation_mark#Polish >> > Polish also uses single quotes, that by contrast are still missing in CLDR. >> >> You are right, but who cares? Looks like this has no practical >> importance. Nobody complains about the wrong use of quotation marks in >> Polish by Word or OpenOffice, so looks like the software doesn't use >> this information. So this is rather a matter of aesthetics... > > I?ve come to the position that to let a word processor ?use? quotation marks > is to miss the point. Quotation marks are definitely used by the user typing > in his or her text, and are expected to be on the keyboard layout he or she > is using. So-called smart quotes guessed algorithmically from ASCII simple > and double quote are but a hazardous workaround when not installing the > appropriate keyboard layout. At least that is my position :) The standard keyboard has a limiting number of keys, so you have to make compromises. It is generally accepted that Polish keyboard layouts (there are primarily two of them) does not contain apostrophe or single quotations marks. There is a proposal by Marcin Woli?ski http://marcinwolinski.pl/keyboard/ which is available in most Linux distributions but it does not seem popular. From unicode at unicode.org Tue Sep 4 21:08:40 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 5 Sep 2018 04:08:40 +0200 (CEST) Subject: CLDR [terminating] Message-ID: <1219205358.11203.1536113320424.JavaMail.www@wwinf1h11> Sorry for not noticing that this thread belongs to CLDR-users, not to Unicode Public. Hence I?m taking it off this list, welcoming participants to follow up there: https://unicode.org/pipermail/cldr-users/2018-September/000833.html From unicode at unicode.org Thu Sep 6 11:58:22 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 06 Sep 2018 09:58:22 -0700 Subject: UCD in XML or in =?UTF-8?Q?CSV=3F=20=28is=3A=20UCD=20in=20YAML=29?= Message-ID: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Marcel Schneider wrote: > BTW what I conjectured about the role of line breaks is true for CSV > too, and any file downloaded from UCD on a semicolon separator basis > becomes unusable when displayed straight in the built-in text editor > of Windows, given Unicode uses Unix EOL. It's been well known for decades that Windows Notepad doesn't display LF-terminated text files correctly. The solution is to use almost any other editor. Notepad++ is free and a great alternative, but there are plenty of others (no editor wars, please). The RFC Editor site explains why it provides PDF versions of every RFC, nearly all of which are plain text: "The primary version of every RFC is encoded as an ASCII text file, which was once the lingua franca of the computer world. However, users of Microsoft Windows often have difficulty displaying vanilla ASCII text files with the correct pagination." which similarly assumes that "users of Microsoft Windows" have only Notepad at their disposal. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Sep 6 19:22:46 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Fri, 7 Sep 2018 05:52:46 +0530 Subject: Shortcuts question Message-ID: Hello. This may be slightly OT for this list but I'm asking it here as it concerns computer usage with multiple scripts and i18n: 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" io Ctrl+A for "all"? 2) How about when the shortcuts are the Alt+ combinations referring to underlined letters in actual user visible strings? 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the other XCV shortcuts) Z key or the Y key which is in the physical position of the QWERTY Z key (and close to the other XCV shortcuts)? 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic or Japanese? 4a) I mean how are they displayed on screen? 4b) Like #1 above, are they changed per language? 4c) Like #2 above, how about for user visible shortcuts? (In India since English is an associate official language, most computer users are at least conversant with basic English so we use the English/QWERTY shortcuts even if the keyboard physically shows an Indic script.) Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Sep 6 22:27:08 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 7 Sep 2018 05:27:08 +0200 (CEST) Subject: Shortcuts question In-Reply-To: References: Message-ID: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> On 07/09/18 02:32 Shriramana Sharma via Unicode wrote: > > Hello. This may be slightly OT for this list but I'm asking it here as it concerns computer usage with multiple scripts and i18n: It actually belongs on CLDR-users list. But coming from you, it shall remain here while I?m posting a quick answer below. > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" io Ctrl+A for "all"? No, Ctrl+A remains Ctrl+A on a French keyboard. > 2) How about when the shortcuts are the Alt+ combinations referring to underlined letters in actual user visible strings? I don?t know, but the accelerator shortcuts usually process text input, so it would be up to the vendor to keep them in sync. > 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the other XCV shortcuts) Z key or the Y key > which is in the physical position of the QWERTY Z key (and close to the other XCV shortcuts)? On Windows, that this question refers to, virtual keys move around with graphics on Latin keyboards. While Ctrl+Z on QWERTZ is not handy, I can tell that it is Ctrl+Z on AZERTY with the key having the Z on it and typing "z". The latter is most relevant on Linux where graphics are used even to process the Ctrl+ shortcuts. > 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic or Japanese? On Windows as they depend on Virtual Keys, they may be laid out on an underlying QWERTY basis. The same may apply on macOS, where distinct levels are present in the XML keylayout (and likewise in system-shipped layouts) to map the letters associated with shortcuts, regardless of the script. On Linux, shortcuts are reported not to work on some non-Latin keyboard layouts (because key names are based on ISO key positions, and XKB doesn?t appear to use a "Group0" level to map the shortcut letters; needs to be investigated). > 4a) I mean how are they displayed on screen?? My short answer is: I?ve got no experience; maybe using Latin letters and locale labels. > 4b) Like #1 above, are they changed per language? Non-Latin scripts typically use QWERTY for ASCII input, so shortcuts may not be changed per language. > 4c) Like #2 above, how about for user visible shortcuts? Again I?m leaving this over to non-Latin script experts. > (In India since English is an associate official language, most computer users are at least conversant with basic English > so we use the English/QWERTY shortcuts even if the keyboard physically shows an Indic script.) The same applies to virtually any non-Latin locale. Michael Kaplan reported that only on Latin keyboards VKs move around. > Thanks! You are welcome. Marcel From unicode at unicode.org Thu Sep 6 22:50:56 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Fri, 7 Sep 2018 05:50:56 +0200 (CEST) Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: <884912406.176.1536292256610.JavaMail.www@wwinf1m09> On 06/09/18 19:09 Doug Ewell via Unicode wrote: > > Marcel Schneider wrote: > > > BTW what I conjectured about the role of line breaks is true for CSV > > too, and any file downloaded from UCD on a semicolon separator basis > > becomes unusable when displayed straight in the built-in text editor > > of Windows, given Unicode uses Unix EOL. > > It's been well known for decades that Windows Notepad doesn't display > LF-terminated text files correctly. The solution is to use almost any > other editor. Notepad++ is free and a great alternative, but there are > plenty of others (no editor wars, please). > > The RFC Editor site explains why it provides PDF versions of every RFC, > nearly all of which are plain text: > > "The primary version of every RFC is encoded as an ASCII text file, > which was once the lingua franca of the computer world. However, users > of Microsoft Windows often have difficulty displaying vanilla ASCII text > files with the correct pagination." > > which similarly assumes that "users of Microsoft Windows" have only > Notepad at their disposal. Thank you, I?ve got the point. I?m taking this opportunity to apologize and disclaim for this post of mine: https://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0134.html where I was not joking, but completely out of matter, unable to make sense of the "Unicode Digest" subject line, that refers to a mail engine feature and remained unchanged due to limited editing capabilities in a cellphone mailer. Likewise "unicode-request at unicode.org" is used by the engine for that purpose. My apologies to Doug Ewell, and thanks for your kind reply taking the pain while having limited access to e-mail. Best regards, Marcel From unicode at unicode.org Fri Sep 7 08:03:46 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Fri, 7 Sep 2018 15:03:46 +0200 (CEST) Subject: Shortcuts question In-Reply-To: References: Message-ID: <534252510.112927.1536325426517@ox.hosteurope.de> Shriramana Sharma: > > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for > "tout" io Ctrl+A for "all"? Some are, many are not. For instance, some text editors use a modifier key with F and K instead of B and I for bold ("fett") and italic ("kursiv"). > 2) How about when the shortcuts are the Alt+ combinations referring to > underlined letters in actual user visible strings? Those depend much more language dependent than Ctrl/Cmd shortcuts. > 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt > the other XCV shortcuts) Z key or the Y key which is in the physical > position of the QWERTY Z key (and close to the other XCV shortcuts)? For some shortcuts the key position is more important (e.g. the one left from the 1 key), for others it's the initial / conventional letter of the command. Most QWERTZ users are not used to expect the undo shortcut (Z) next to the keys for cut (X), copy (C) and paste (V). By the way, accompanying redo is notoriously inconsistent, sometimes Y, sometimes Shift+Z. More serious problems arise with non-letter keys. For instance, square brackets [ and ] are readily available on the US / English keyboard layout, but require modifier keys like Shift or Alt on many other keyboard layouts, which may be the same ones as for the curly braces { and }. This means, some seemingly simple and intuitive shortcuts on an English keyboard become cumbersome on international ones. From unicode at unicode.org Fri Sep 7 12:55:43 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 7 Sep 2018 19:55:43 +0200 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode a ?crit : > Marcel Schneider wrote: > > > BTW what I conjectured about the role of line breaks is true for CSV > > too, and any file downloaded from UCD on a semicolon separator basis > > becomes unusable when displayed straight in the built-in text editor > > of Windows, given Unicode uses Unix EOL. > > It's been well known for decades that Windows Notepad doesn't display > LF-terminated text files correctly. The solution is to use almost any > other editor. Notepad++ is free and a great alternative, but there are > plenty of others (no editor wars, please). > This has changed recently in Windows 10, where the builtin Notepad app now parses text files using LF only correctly (you can edit and save using the same convention for newlines, which is now autodetected; Notepad still creates new files using CRLF and saves them after edit using CRLF). Notepad now displays the newline convention in the status bar as "Windows (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column counters. There's still no preference interface to specify the default convention: CRLF is still the the default for new files. And no way to switch the convention before saving. In Notepad++ you do that with menu "Edit" > "Convert newlines" and select one of "Convert to Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)" -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Sep 7 13:04:05 2018 From: unicode at unicode.org (J Decker via Unicode) Date: Fri, 7 Sep 2018 11:04:05 -0700 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode < unicode at unicode.org> wrote: > > > Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode > a ?crit : > >> Marcel Schneider wrote: >> >> > BTW what I conjectured about the role of line breaks is true for CSV >> > too, and any file downloaded from UCD on a semicolon separator basis >> > becomes unusable when displayed straight in the built-in text editor >> > of Windows, given Unicode uses Unix EOL. >> >> It's been well known for decades that Windows Notepad doesn't display >> LF-terminated text files correctly. The solution is to use almost any >> other editor. Notepad++ is free and a great alternative, but there are >> plenty of others (no editor wars, please). >> > > This has changed recently in Windows 10, where the builtin Notepad app now > parses text files using LF only correctly (you can edit and save using the > same convention for newlines, which is now autodetected; Notepad still > creates new files using CRLF and saves them after edit using CRLF). > > I would love to have a notepad that handled \n. My system is up to date. What update must I get to have notepad handle newline only files? (and I dare say notepad is the ONLY program that doesn't handle either convention, command line `edit` and `wordpad`(write) even handled them) I'm sure there exists other programs that do it wrong; but none I've ever used or found, or written. Notepad now displays the newline convention in the status bar as "Windows > (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column > counters. There's still no preference interface to specify the default > convention: CRLF is still the the default for new files. > > And no way to switch the convention before saving. In Notepad++ you do > that with menu "Edit" > "Convert newlines" and select one of "Convert to > Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)" > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Sep 7 13:18:09 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 7 Sep 2018 20:18:09 +0200 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: That version has been announced in the Windows 10 Hub several weeks ago. I think it is part of the 1809 version (for now RS5 prerelease for Insiders) that may be deployed in the final release coming soon. I hope you'll have also the option to switch the newline convention after loading and before saving to convert these newlines. and may be define the new default preference, so we will finally forget the CRLF convention. I have it working quite well inthe Insider fast ring. In all IDE editors however (including Developer Studio), the 2 or 3 conventions were still available since long. Le ven. 7 sept. 2018 ? 20:04, J Decker a ?crit : > > > On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> >> >> Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode >> a ?crit : >> >>> Marcel Schneider wrote: >>> >>> > BTW what I conjectured about the role of line breaks is true for CSV >>> > too, and any file downloaded from UCD on a semicolon separator basis >>> > becomes unusable when displayed straight in the built-in text editor >>> > of Windows, given Unicode uses Unix EOL. >>> >>> It's been well known for decades that Windows Notepad doesn't display >>> LF-terminated text files correctly. The solution is to use almost any >>> other editor. Notepad++ is free and a great alternative, but there are >>> plenty of others (no editor wars, please). >>> >> >> This has changed recently in Windows 10, where the builtin Notepad app >> now parses text files using LF only correctly (you can edit and save using >> the same convention for newlines, which is now autodetected; Notepad still >> creates new files using CRLF and saves them after edit using CRLF). >> >> I would love to have a notepad that handled \n. > My system is up to date. > What update must I get to have notepad handle newline only files? > (and I dare say notepad is the ONLY program that doesn't handle either > convention, command line `edit` and `wordpad`(write) even handled them) > I'm sure there exists other programs that do it wrong; but none I've ever > used or found, or written. > > Notepad now displays the newline convention in the status bar as "Windows >> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column >> counters. There's still no preference interface to specify the default >> convention: CRLF is still the the default for new files. >> >> And no way to switch the convention before saving. In Notepad++ you do >> that with menu "Edit" > "Convert newlines" and select one of "Convert to >> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)" >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Sep 7 13:19:58 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 7 Sep 2018 20:19:58 +0200 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: See also this page: https://blogs.windows.com/windowsexperience/2018/05/09/announcing-windows-10-insider-preview-build-17666/ Le ven. 7 sept. 2018 ? 20:18, Philippe Verdy a ?crit : > That version has been announced in the Windows 10 Hub several weeks ago. I > think it is part of the 1809 version (for now RS5 prerelease for Insiders) > that may be deployed in the final release coming soon. > I hope you'll have also the option to switch the newline convention after > loading and before saving to convert these newlines. and may be define the > new default preference, so we will finally forget the CRLF convention. > > I have it working quite well inthe Insider fast ring. > > In all IDE editors however (including Developer Studio), the 2 or 3 > conventions were still available since long. > > Le ven. 7 sept. 2018 ? 20:04, J Decker a ?crit : > >> >> >> On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode < >> unicode at unicode.org> wrote: >> >>> >>> >>> Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode < >>> unicode at unicode.org> a ?crit : >>> >>>> Marcel Schneider wrote: >>>> >>>> > BTW what I conjectured about the role of line breaks is true for CSV >>>> > too, and any file downloaded from UCD on a semicolon separator basis >>>> > becomes unusable when displayed straight in the built-in text editor >>>> > of Windows, given Unicode uses Unix EOL. >>>> >>>> It's been well known for decades that Windows Notepad doesn't display >>>> LF-terminated text files correctly. The solution is to use almost any >>>> other editor. Notepad++ is free and a great alternative, but there are >>>> plenty of others (no editor wars, please). >>>> >>> >>> This has changed recently in Windows 10, where the builtin Notepad app >>> now parses text files using LF only correctly (you can edit and save using >>> the same convention for newlines, which is now autodetected; Notepad still >>> creates new files using CRLF and saves them after edit using CRLF). >>> >>> I would love to have a notepad that handled \n. >> My system is up to date. >> What update must I get to have notepad handle newline only files? >> (and I dare say notepad is the ONLY program that doesn't handle either >> convention, command line `edit` and `wordpad`(write) even handled them) >> I'm sure there exists other programs that do it wrong; but none I've >> ever used or found, or written. >> >> Notepad now displays the newline convention in the status bar as "Windows >>> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column >>> counters. There's still no preference interface to specify the default >>> convention: CRLF is still the the default for new files. >>> >>> And no way to switch the convention before saving. In Notepad++ you do >>> that with menu "Edit" > "Convert newlines" and select one of "Convert to >>> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)" >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Sep 7 14:47:44 2018 From: unicode at unicode.org (Rebecca Bettencourt via Unicode) Date: Fri, 7 Sep 2018 12:47:44 -0700 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode < unicode at unicode.org> wrote: > That version has been announced in the Windows 10 Hub several weeks ago. > And it only took them 33 years. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Sep 7 15:00:40 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Fri, 07 Sep 2018 23:00:40 +0300 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: (message from Rebecca Bettencourt via Unicode on Fri, 7 Sep 2018 12:47:44 -0700) References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: <83k1nxt6vr.fsf@gnu.org> > Date: Fri, 7 Sep 2018 12:47:44 -0700 > Cc: d3ck0r at gmail.com, Doug Ewell , > unicode > From: Rebecca Bettencourt via Unicode > > On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode wrote: > > That version has been announced in the Windows 10 Hub several weeks ago. > > And it only took them 33 years. :) That's OK, because Unix tools cannot handle Windows end-of-line format to this very day. About the only one I know of is Emacs (which handles all 3 known EOL formats independently of the platform on which it runs, since 20 years ago). From unicode at unicode.org Fri Sep 7 19:29:12 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 8 Sep 2018 02:29:12 +0200 (CEST) Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD in YAML)) Message-ID: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21> On 07/09/18 22:07 Eli Zaretskii via Unicode wrote: > > > Date: Fri, 7 Sep 2018 12:47:44 -0700 > > Cc: d3ck0r at gmail.com, Doug Ewell , > > unicode > > From: Rebecca Bettencourt via Unicode > > > > On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode wrote: > > > > That version has been announced in the Windows 10 Hub several weeks ago. > > > > And it only took them 33 years. :) > > That's OK, because Unix tools cannot handle Windows end-of-line format > to this very day. About the only one I know of is Emacs (which > handles all 3 known EOL formats independently of the platform on which > it runs, since 20 years ago). What are you referring to when you say ?Unix tools?? Another text editor?the built-in one of many Linux distributions?Gedit allows to choose from ?Unix/Linux?, ?Mac OS Classic?, and ?Windows?, in the Save dialog. But in the preferences I cannot retrieve how to default it to any of the latter two. I?m referring to Ubuntu 16.04. When on Windows in Notepad++ I prefer LF over CRLF because it makes for simpler regexes, and the middle thing between these and plain search is more handy too. (I use \n in regexes rather than the $ convention.) Thanks to Philippe for the Windows 10 news! Best regards, Marcel From unicode at unicode.org Fri Sep 7 20:03:38 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 8 Sep 2018 03:03:38 +0200 (CEST) Subject: Shortcuts question (is: Thread transfer info) In-Reply-To: <534252510.112927.1536325426517@ox.hosteurope.de> References: <534252510.112927.1536325426517@ox.hosteurope.de> Message-ID: <1585982937.16442.1536368619015.JavaMail.www@wwinf1m21> Hello, I?ve followed up on CLDR-users: https://unicode.org/pipermail/cldr-users/2018-September/000837.html As a sidenote ? It might be hard to get a selection of discussions actually happen on CLDR-users instead of Unicode Public mail list, as long as subscribers of this list don?t necessarily subscribe to the other list, too, that still has way less subscribers than Unicode Public. Regards, Marcel From unicode at unicode.org Fri Sep 7 20:50:38 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Sat, 8 Sep 2018 10:50:38 +0900 Subject: UCD in XML or in CSV? (is: UCD in YAML) In-Reply-To: References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com> Message-ID: <67b2d03c-d565-8cae-908d-a3519eceb8eb@it.aoyama.ac.jp> On 2018/09/08 04:47, Rebecca Bettencourt via Unicode wrote: > On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode < > unicode at unicode.org> wrote: > >> That version has been announced in the Windows 10 Hub several weeks ago. >> > > And it only took them 33 years. :) I used to joke that Notepad would add one single feature for each new version of Windows. I think that was when the Save-As feature was added. For a long time, I have set up Notepad++ to come up when Notepad is invoked. Regards, Martin. From unicode at unicode.org Sat Sep 8 01:47:23 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sat, 08 Sep 2018 09:47:23 +0300 Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD in YAML)) In-Reply-To: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21> (message from Marcel Schneider on Sat, 8 Sep 2018 02:29:12 +0200 (CEST)) References: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21> Message-ID: <83d0totric.fsf@gnu.org> > Date: Sat, 8 Sep 2018 02:29:12 +0200 (CEST) > From: Marcel Schneider > Cc: RebeccaBettencourt , verdy_p at wanadoo.fr, > d3ck0r at gmail.com, doug at ewellic.org, unicode at unicode.org > > > > And it only took them 33 years. :) > > > > That's OK, because Unix tools cannot handle Windows end-of-line format > > to this very day. About the only one I know of is Emacs (which > > handles all 3 known EOL formats independently of the platform on which > > it runs, since 20 years ago). > > What are you referring to when you say ?Unix tools?? Sed and Grep don't consider CRLF as end of line, so regexps with $ fail to work as intended; the shell and/or the kernel don't recognize the shebang sequence if it ends in CRLF, system editors display those pesky "^M" at the end of each line, etc. And if you have bad luck of using a Mac-style file, where a single CR ends a line, all bets are off. > Another text editor?the built-in one of many Linux distributions?Gedit allows > to choose from ?Unix/Linux?, ?Mac OS Classic?, and ?Windows?, in the Save dialog. Gedit is not a valid example when you compare it with Notepad. Please compare with editors which come with the OS out of the box: ed, ex, vi, etc. Because Gedit and Emacs are also available on Windows, so they make the point moot. From unicode at unicode.org Sat Sep 8 11:36:00 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 8 Sep 2018 18:36:00 +0200 Subject: Unicode String Models Message-ID: I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome. https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Sep 8 16:01:32 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sat, 8 Sep 2018 15:01:32 -0600 Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD In-Reply-To: References: Message-ID: <1F9AC3158DC04AF2A626C60FAD00B877@DougEwell> To finish (I hope) this thread: 1. Glad to know that Notepad is getting some modern updates, even if belatedly. 2. Sorry that there are still tools out there, on different platforms, that can't handle each other's EOL conventions. (Of course, this is the problem Unicode was trying to solve by introducing LS and PS, but we know how that went.) 3. Unicode data files can be read and processed on any platform, but some careful choice of reading and processing tools might be advisable. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Sep 9 02:59:29 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 9 Sep 2018 08:59:29 +0100 Subject: Unicode String Models In-Reply-To: References: Message-ID: <20180909085929.2d4ff0d2@JRWUBU2> On Sat, 8 Sep 2018 18:36:00 +0200 Mark Davis ?? via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Theoretically at least, the cost of indexing a big string by codepoint is negligible. For example, cost of accessing the middle character is O(1)*, not O(n), where n is the length of the string. The trick is to use a proportionately small amount of memory to store and maintain a partial conversion table from character index to byte index. For example, Emacs claims to offer O(1) access to a UTF-8 buffer by character number, and I can't significantly fault the claim. *There may be some creep, but it doesn't matter for strings that can be stored within a galaxy. Of course, the coefficients implied by big-oh notation also matter. For example, it can be very easy to forget that a bubble sort is often the quickest sorting algorithm. You keep muttering that a a sequence of 8-bit code units can contain invalid sequences, but often forget that that is also true of sequences of 16-bit code units. Do emoji now ensure that confusion between codepoints and code units rapidly comes to light? You seem to keep forgetting that grapheme clusters are not how some people people work. Does the English word 'caf?' contain the letter 'e'? Yes or no? I maintain that it does. I can't help thinking that one might want to look for the letter '?' in Vietnamese and find it whatever the associated tone mark is. You didn't discuss substrings. I'm interested in how subsequences of strings are defined, as the concept of 'substring' isn't really Unicode compliant. Again, expressing '?' as a subsequence of the Vietnamese word 'n?ng' ought to be possible, whether one is using NFD (easier) or NFC. (And there are alternative normalisations that are compatible with canonical equivalence.) I'm most interested in subsequences X of a word W where W is the same as AXB for some strings A and B. Richard. From unicode at unicode.org Sun Sep 9 03:00:27 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 9 Sep 2018 10:00:27 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Thanks, excellent comments. While it is clear that some string models have more complicated structures (with their own pros and cons), my focus was on simple internal structures. The focus was also on immutable strings ? and the tradeoffs for mutable ones can be quite different ? and that needs to be clearer. I'll add some material about those two areas (with pointers to sources where possible). Mark On Sat, Sep 8, 2018 at 9:20 PM John Cowan wrote: > This paper makes the default assumption that the internal storage of a > string is a featureless array. If this assumption is abandoned, it is > possible to get O(1) indexes with fairly low space overhead. The Scheme > language has recently adopted immutable strings called "texts" as a > supplement to its pre-existing mutable strings, and the sample > implementation for this feature uses a vector of either native strings or > bytevectors (char[] vectors in C/Java terms). I would urge anyone > interested in the question of storing and accessing mutable strings to read > the following parts of SRFI 135 at < > https://srfi.schemers.org/srfi-135/srfi-135.html>: Abstract, Rationale, > Specification / Basic concepts, and Implementation. In addition, the > design notes at , > though not up to date (in particular, UTF-16 internals are now allowed as > an alternative to UTF-8), are of interest: unfortunately, the link to the > span API has rotted. > > On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ?? via Unicore < > unicore at unicode.org> wrote: > >> I recently did some extensive revisions of a paper on Unicode string >> models (APIs). Comments are welcome. >> >> >> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# >> >> Mark >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Sep 9 03:56:15 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 09 Sep 2018 10:56:15 +0200 Subject: Unicode String Models In-Reply-To: (Mark Davis's message of "Sat, 8 Sep 2018 18:36:00 +0200") References: Message-ID: <868t4b3v80.fsf@mimuw.edu.pl> On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ?? via Unicode wrote: > I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# It's a good opportunity to propose a better term for "extended grapheme cluster", which usually are neither extended nor clusters, it's also not obvious that they are always graphemes. Cf.the earlier threads https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien From unicode at unicode.org Sun Sep 9 08:42:19 2018 From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode) Date: Sun, 9 Sep 2018 15:42:19 +0200 Subject: Unicode String Models In-Reply-To: References: Message-ID: Hello,? I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API.? The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure. The API defines how the Unicode text is accessed, expressed by what is the result of an indexing operation on the string. The latter is really what matters for the end-user and what I would call the "model". I think the presentation would benefit from making a clear distinction between the internal representation and the API; you could then easily summarize them in a table which would make a nice summary of the design space. I also think you are missing one API which is the one with ECG I would favour: indexing returns Unicode scalar values,?internally be it whatever you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code Point Model: Internal 8/16/32" but that's not what it says, the distinction between code point and scalar value is an important one and I think it would be good to insist on it to clarify the minds in such documents. Best,? Daniel From unicode at unicode.org Sun Sep 9 09:10:26 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 9 Sep 2018 16:10:26 +0200 Subject: Unicode String Models In-Reply-To: <20180909085929.2d4ff0d2@JRWUBU2> References: <20180909085929.2d4ff0d2@JRWUBU2> Message-ID: Le dim. 9 sept. 2018 ? 10:10, Richard Wordingham via Unicode < unicode at unicode.org> a ?crit : > On Sat, 8 Sep 2018 18:36:00 +0200 > Mark Davis ?? via Unicode wrote: > > > I recently did some extensive revisions of a paper on Unicode string > > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > > Theoretically at least, the cost of indexing a big string by codepoint > is negligible. For example, cost of accessing the middle character is > O(1)*, not O(n), where n is the length of the string. The trick is to > use a proportionately small amount of memory to store and maintain a > partial conversion table from character index to byte index. For > example, Emacs claims to offer O(1) access to a UTF-8 buffer by > character number, and I can't significantly fault the claim. > I fully agree, as long as the "middle" character is **approximated** by the middle of the **encoded** length. But if it has to be the exact middle (by code point number), you have to count the codepoints exactly by parsing the whole string as O(n), then compute the middle from it and parse again from the begining to locate the encoded position of that code point index as O(n/2) so the final cost is O(n*3/2). The trick using a "small amount" of memory only is only to avoid the second parsing to get a O(n) result. You get O(1)* only if you keep that "small memory" to locate ofthe indexes. But the claim that it is "small" is wrong if the string is large (big value n). and has no interest if the string is indexed only once. In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will process the whole string (which may not be fully loaded in memory, in which case that "small memory" will need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache of the OS to save I/O, or from network (in which case you'll need to store the network data in a local temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local temporary file. So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to locate line numbers. It has no need to do that for column numbers, as it is just faster to rescan the line (and extremely long lines of text are exceptional, these files are rarely edited with Emacs, unless you use it to load a binary file, whose representation on screen will be very different, notably for controls, which are expanded into another cached form: the column index for display, which is different from the code point index and specific to the Emacs representation for display/editing, is built only line by line, separately from the line index kept for the whole edited file; it is also independant of the effective encoding: it would still be needed even if the encoding of the backing buffer was UTF-32 with only 1 codepoint per code unit, becase the actual display will still expand the code points to other forms using visible escaping mechanisms, and it is even needed when the file is pure 7-bit ASCII, and kept with one byte per code point: choosing the Unicode encoding forms has no impact at all to what is really needed for display in text editors). Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts even on systems with low memory available. As much as possible they attempt to use the OS-level caches of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in their backing store is not just a single string, but a structured collection of buffers, built on top of an interface masking the details: the effective text will then be reencoded and saved from that object, using complex serialization schemes; the text buffer is "virtualized"). Only very basic text editors (such as Notepad) use a native single text buffer, but they are very slow when editing very large files as they constantly need to copy/move large blocks of memory to perform inserts/deletions, and they also use too much the memory reallocator. Even vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a temporary backing store in temporary files, created automatically when needed as you start modifying the content. The final consolidation and serialization will occur only when saving the result. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Sep 9 10:53:12 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 09 Sep 2018 18:53:12 +0300 Subject: Unicode String Models In-Reply-To: (message from Philippe Verdy via Unicode on Sun, 9 Sep 2018 16:10:26 +0200) References: <20180909085929.2d4ff0d2@JRWUBU2> Message-ID: <838t4ar7kn.fsf@gnu.org> > Date: Sun, 9 Sep 2018 16:10:26 +0200 > Cc: unicode Unicode Discussion > From: Philippe Verdy via Unicode > > In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will > process the whole string (which may not be fully loaded in memory, in which case that "small memory" will > need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long > text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the > string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts > come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache > of the OS to save I/O, or from network (in which case you'll need to store the network data in a local > temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by > the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local > temporary file. Emacs maintains caches for byte to character conversions for both strings and buffers. The cache holds data only for the last string and separately the last buffer where Emacs needed to convert character counts to byte counts or vice versa. For buffers, there are 4 places that are maintained for every buffer at all times, for which both the character and byte positions are known, and Emacs uses those whenever it needs to do conversions for a buffer that is not the cached one. > So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to > locate line numbers. That's a different cache, unrelated to what Richard was alluding to (and I think unrelated to the current discussion). > Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts > even on systems with low memory available. As much as possible they attempt to use the OS-level caches > of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in > their backing store is not just a single string, but a structured collection of buffers, built on top of an interface > masking the details: the effective text will then be reencoded and saved from that object, using complex > serialization schemes; the text buffer is "virtualized"). In Emacs, buffer text is a character string with a gap, actually. From unicode at unicode.org Sun Sep 9 12:35:47 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 9 Sep 2018 19:35:47 +0200 Subject: Unicode String Models In-Reply-To: <838t4ar7kn.fsf@gnu.org> References: <20180909085929.2d4ff0d2@JRWUBU2> <838t4ar7kn.fsf@gnu.org> Message-ID: Le dim. 9 sept. 2018 ? 17:53, Eli Zaretskii a ?crit : > > Text editors use various indexing caches always, to manage memory, I/O, > and allow working on large texts > > even on systems with low memory available. As much as possible they > attempt to use the OS-level caches > > of the filesystem. And in all cases, they don't work directly on their > text buffer (whose internal represenation in > > their backing store is not just a single string, but a structured > collection of buffers, built on top of an interface > > masking the details: the effective text will then be reencoded and saved > from that object, using complex > > serialization schemes; the text buffer is "virtualized"). > > In Emacs, buffer text is a character string with a gap, actually. > A text buffer with gaps is a complex structure, not just a plain string. Gaps are one way to manage memory more efficiently and get reasonnable performance when editing, without having to constantly move large blocks: these "strings" with gaps may then actually be just a byte buffer using as a backing store, but that buffer alone does not represent only the currently represented text. A process will still serialize and perform cleanup befire this buffer can be used to save the edited text. Emacs may not necasserily unallocate the end of the buffer, but I doubt it constantly uses a single gap at end (insertions and deletions in the middle would constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users do not want to see what they type appearing on the screen at one keystroke every few seconds because each typed key causes massive block moves and excessive memory paging from/to disk while this move is being performed). All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have small gaps), which are occasionnally merged or splitted when needed (merging does not cause any reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really needed and useful for the performance. But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where parts may be paged out (and will also include more than just the current text, notably it will include parts of the indexes, possibly in another temporary working file). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Sep 9 14:20:16 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Sun, 09 Sep 2018 22:20:16 +0300 Subject: Unicode String Models In-Reply-To: (message from Philippe Verdy on Sun, 9 Sep 2018 19:35:47 +0200) References: <20180909085929.2d4ff0d2@JRWUBU2> <838t4ar7kn.fsf@gnu.org> Message-ID: <834leyqxzj.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 9 Sep 2018 19:35:47 +0200 > Cc: Richard Wordingham , > unicode Unicode Discussion > > In Emacs, buffer text is a character string with a gap, actually. > > A text buffer with gaps is a complex structure, not just a plain string. The difference is very small, and a couple of macros allow you to almost forget about the gap. > I doubt it constantly uses a single gap at end (insertions and deletions in the middle would > constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users > do not want to see what they type appearing on the screen at one keystroke every few seconds because each > typed key causes massive block moves and excessive memory paging from/to disk while this move is being > performed). In Emacs, the gap is always where the text is inserted or deleted, be it in the middle of text or at its end. > All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have > small gaps), which are occasionnally merged or splitted when needed (merging does not cause any > reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is > stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really > needed and useful for the performance. My point was to say that Emacs is not one of these editors you describe. > But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need > to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An > editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if > they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at > least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to > make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not > enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where > parts may be paged out (and will also include more than just the current text, notably it will include parts of the > indexes, possibly in another temporary working file). You ignore or disregard the need to represent raw bytes in editor buffers. That is when the encoding stops being "invisible". From unicode at unicode.org Mon Sep 10 11:05:48 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Mon, 10 Sep 2018 18:05:48 +0200 Subject: Unicode String Models In-Reply-To: <834leyqxzj.fsf@gnu.org> References: <20180909085929.2d4ff0d2@JRWUBU2> <838t4ar7kn.fsf@gnu.org> <834leyqxzj.fsf@gnu.org> Message-ID: > On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode wrote: > > In Emacs, the gap is always where the text is inserted or deleted, be > it in the middle of text or at its end. > >> All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have >> small gaps), which are occasionnally merged or splitted when needed (merging does not cause any >> reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is >> stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really >> needed and useful for the performance. > > My point was to say that Emacs is not one of these editors you > describe. FYI, gap and rope buffers are described at [1-2]; also see the Emacs manual [3]. 1. https://en.wikipedia.org/wiki/Gap_buffer 2. https://en.wikipedia.org/wiki/Rope_(data_structure) 3. https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html From unicode at unicode.org Tue Sep 11 05:12:40 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Tue, 11 Sep 2018 13:12:40 +0300 Subject: Unicode String Models In-Reply-To: References: Message-ID: On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ?? via Unicode wrote: > > I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# * The Grapheme Cluster Model seems to have a couple of disadvantages that are not mentioned: 1) The subunit of string is also a string (a short string conforming to particular constraints). There's a need for *another* more atomic mechanism for examining the internals of the grapheme cluster string. 2) The way an arbitrary string is divided into units when iterating over it changes when the program is executed on a newer version of the language runtime that is aware of newly-assigned codepoints from a newer version of Unicode. * The Python 3.3 model mentions the disadvantages of memory usage cliffs but doesn't mention the associated perfomance cliffs. It would be good to also mention that when a string manipulation causes the storage to expand or contract, there's a performance impact that's not apparent from the nature of the operation if the programmer's intuition works on the assumption that the programmer is dealing with UTF-32. * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM text node storage in Gecko, (I believe but am not 100% sure) V8 and, optionally, HotSpot (https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A). That is, text has UTF-16 semantics, but if the high half of every code unit in a string is zero, only the lower half is stored. This has properties analogous to the Python 3.3 model, except non-BMP doesn't expand to UTF-32 but uses UTF-16 surrogate pairs. * I think the fact that systems that chose UTF-16 or UTF-32 have implemented models that try to save storage by omitting leading zeros and gaining complexity and performance cliffs as a result is a strong indication that UTF-8 should be recommended for newly-designed systems that don't suffer from a forceful legacy need to expose UTF-16 or UTF-32 semantics. * I suggest splitting the "UTF-8 model" into three substantially different models: 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No UTF-8-related operations are performed when ingesting byte-oriented data. Byte buffers and text buffers are type-wise ambiguous. Only iterating over byte data by code point gives the data the UTF-8 interpretation. Unless the data is cleaned up as a side effect of such iteration, malformed sequences in input survive into output. 2) UTF-8 without full trust in ability to retain validity (the model of the UTF-8-using C++ parts of Gecko; I believe this to be the most common UTF-8 model for C and C++, but I don't have evidence to back this up): When data is ingested with text semantics, it is converted to UTF-8. For data that's supposed to already be in UTF-8, this means replacing malformed sequences with the REPLACEMENT CHARACTER, so the data is valid UTF-8 right after input. However, iteration by code point doesn't trust ability of other code to retain UTF-8 validity perfectly and has "else" branches in order not to blow up if invalid UTF-8 creeps into the system. 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers have a different type in the type system than byte buffers. To go from a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data has been tagged as valid UTF-8, the validity is trusted completely so that iteration by code point does not have "else" branches for malformed sequences. If data that the type system indicates to be valid UTF-8 wasn't actually valid, it would be nasal demon time. The language has a default "safe" side and an opt-in "unsafe" side. The unsafe side is for performing low-level operations in a way where the responsibility of upholding invariants is moved from the compiler to the programmer. It's impossible to violate the UTF-8 validity invariant using the safe part of the language. * After working with different string models, I'd recommend the Rust model for newly-designed programming languages. (Not because I work for Mozilla but because I believe Rust's way of dealing with Unicode is the best I've seen.) Rust's standard library provides Unicode version-independent iterations over strings: by code unit and by code point. Iteration by extended grapheme cluster is provided by a library that's easy to include due to the nature of Rust package management (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8 buffer as a read-only byte buffer has zero run-time cost and allows for maximally fast guaranteed-valid-UTF-8 output. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Tue Sep 11 06:13:03 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 11 Sep 2018 14:13:03 +0300 Subject: Unicode String Models In-Reply-To: (message from Henri Sivonen via Unicode on Tue, 11 Sep 2018 13:12:40 +0300) References: Message-ID: <83va7cmgn4.fsf@gnu.org> > Date: Tue, 11 Sep 2018 13:12:40 +0300 > From: Henri Sivonen via Unicode > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting byte-oriented > data. Byte buffers and text buffers are type-wise ambiguous. Only > iterating over byte data by code point gives the data the UTF-8 > interpretation. Unless the data is cleaned up as a side effect of such > iteration, malformed sequences in input survive into output. > > 2) UTF-8 without full trust in ability to retain validity (the model > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > common UTF-8 model for C and C++, but I don't have evidence to back > this up): When data is ingested with text semantics, it is converted > to UTF-8. For data that's supposed to already be in UTF-8, this means > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > data is valid UTF-8 right after input. However, iteration by code > point doesn't trust ability of other code to retain UTF-8 validity > perfectly and has "else" branches in order not to blow up if invalid > UTF-8 creeps into the system. > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > have a different type in the type system than byte buffers. To go from > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > has been tagged as valid UTF-8, the validity is trusted completely so > that iteration by code point does not have "else" branches for > malformed sequences. If data that the type system indicates to be > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > language has a default "safe" side and an opt-in "unsafe" side. The > unsafe side is for performing low-level operations in a way where the > responsibility of upholding invariants is moved from the compiler to > the programmer. It's impossible to violate the UTF-8 validity > invariant using the safe part of the language. There's another model, the one used by Emacs. AFAIU, it is different from all the 3 you describe above. In Emacs, each raw byte belonging to a byte sequence which is invalid under UTF-8 is represented as a special multibyte sequence. IOW, Emacs's internal representation extends UTF-8 with multibyte sequences it uses to represent raw bytes. This allows mixing stray bytes and valid text in the same buffer, without risking lossy conversions (such as those one gets under model 2 above). From unicode at unicode.org Tue Sep 11 09:19:58 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 11 Sep 2018 07:19:58 -0700 Subject: Unicode String Models In-Reply-To: <83va7cmgn4.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> Message-ID: These are all interesting and useful comments. I'll be responding once I get a bit of free time, probably Friday or Saturday. Mark On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode < unicode at unicode.org> wrote: > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > > UTF-8-related operations are performed when ingesting byte-oriented > > data. Byte buffers and text buffers are type-wise ambiguous. Only > > iterating over byte data by code point gives the data the UTF-8 > > interpretation. Unless the data is cleaned up as a side effect of such > > iteration, malformed sequences in input survive into output. > > > > 2) UTF-8 without full trust in ability to retain validity (the model > > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > > common UTF-8 model for C and C++, but I don't have evidence to back > > this up): When data is ingested with text semantics, it is converted > > to UTF-8. For data that's supposed to already be in UTF-8, this means > > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > > data is valid UTF-8 right after input. However, iteration by code > > point doesn't trust ability of other code to retain UTF-8 validity > > perfectly and has "else" branches in order not to blow up if invalid > > UTF-8 creeps into the system. > > > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > > have a different type in the type system than byte buffers. To go from > > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > > has been tagged as valid UTF-8, the validity is trusted completely so > > that iteration by code point does not have "else" branches for > > malformed sequences. If data that the type system indicates to be > > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > > language has a default "safe" side and an opt-in "unsafe" side. The > > unsafe side is for performing low-level operations in a way where the > > responsibility of upholding invariants is moved from the compiler to > > the programmer. It's impossible to violate the UTF-8 validity > > invariant using the safe part of the language. > > There's another model, the one used by Emacs. AFAIU, it is different > from all the 3 you describe above. In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Sep 11 12:13:28 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 11 Sep 2018 19:13:28 +0200 Subject: Unicode String Models In-Reply-To: <83va7cmgn4.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> Message-ID: > On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode wrote: > > In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). Can you give a reference detailing this format? From unicode at unicode.org Tue Sep 11 12:21:07 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 11 Sep 2018 20:21:07 +0300 Subject: Unicode String Models In-Reply-To: (message from Hans =?utf-8?Q?=C3=85berg?= on Tue, 11 Sep 2018 19:13:28 +0200) References: <83va7cmgn4.fsf@gnu.org> Message-ID: <83h8iwlzlo.fsf@gnu.org> > From: Hans ?berg > Date: Tue, 11 Sep 2018 19:13:28 +0200 > Cc: Henri Sivonen , > unicode at unicode.org > > > In Emacs, each raw byte belonging > > to a byte sequence which is invalid under UTF-8 is represented as a > > special multibyte sequence. IOW, Emacs's internal representation > > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > > This allows mixing stray bytes and valid text in the same buffer, > > without risking lossy conversions (such as those one gets under model > > 2 above). > > Can you give a reference detailing this format? There's no formal description as English text, if that's what you meant. The comments, macros and functions in the files src/character.[ch] in the Emacs source tree tell most of that story, albeit indirectly, and some additional info can be found in the section "Text Representation" of the Emacs Lisp Reference manual. From unicode at unicode.org Tue Sep 11 13:14:30 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 11 Sep 2018 20:14:30 +0200 Subject: Unicode String Models In-Reply-To: <83h8iwlzlo.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> Message-ID: <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> > On 11 Sep 2018, at 19:21, Eli Zaretskii wrote: > >> From: Hans ?berg >> Date: Tue, 11 Sep 2018 19:13:28 +0200 >> Cc: Henri Sivonen , >> unicode at unicode.org >> >>> In Emacs, each raw byte belonging >>> to a byte sequence which is invalid under UTF-8 is represented as a >>> special multibyte sequence. IOW, Emacs's internal representation >>> extends UTF-8 with multibyte sequences it uses to represent raw bytes. >>> This allows mixing stray bytes and valid text in the same buffer, >>> without risking lossy conversions (such as those one gets under model >>> 2 above). >> >> Can you give a reference detailing this format? > > There's no formal description as English text, if that's what you > meant. The comments, macros and functions in the files > src/character.[ch] in the Emacs source tree tell most of that story, > albeit indirectly, and some additional info can be found in the > section "Text Representation" of the Emacs Lisp Reference manual. OK. If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs. From unicode at unicode.org Tue Sep 11 13:40:54 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 11 Sep 2018 21:40:54 +0300 Subject: Unicode String Models In-Reply-To: <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> (message from Hans =?utf-8?Q?=C3=85berg?= on Tue, 11 Sep 2018 20:14:30 +0200) References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> Message-ID: <83efdznah5.fsf@gnu.org> > From: Hans ?berg > Date: Tue, 11 Sep 2018 20:14:30 +0200 > Cc: hsivonen at hsivonen.fi, > unicode at unicode.org > > If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs. Yes. And mixed encodings is not the only use case: it may well happen that the initial attempt to decode the file uses incorrect assumption about the encoding, for some reason. In addition, it is important that changing some portion of the file, then saving the modified text will never change any part that the user didn't touch, as will happen if invalid sequences are rejected at input time and replaced with something else. From unicode at unicode.org Tue Sep 11 14:10:03 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 11 Sep 2018 21:10:03 +0200 Subject: Unicode String Models In-Reply-To: <83efdznah5.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> Message-ID: <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> > On 11 Sep 2018, at 20:40, Eli Zaretskii wrote: > >> From: Hans ?berg >> Date: Tue, 11 Sep 2018 20:14:30 +0200 >> Cc: hsivonen at hsivonen.fi, >> unicode at unicode.org >> >> If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs. > > Yes. And mixed encodings is not the only use case: it may well happen > that the initial attempt to decode the file uses incorrect assumption > about the encoding, for some reason. > > In addition, it is important that changing some portion of the file, > then saving the modified text will never change any part that the user > didn't touch, as will happen if invalid sequences are rejected at > input time and replaced with something else. Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files with sections in different Cyrillic and Latin encodings, changing the editor encoding while typing. From unicode at unicode.org Tue Sep 11 16:48:48 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 11 Sep 2018 22:48:48 +0100 Subject: Unicode String Models In-Reply-To: <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> Message-ID: <20180911224848.3aa17406@JRWUBU2> On Tue, 11 Sep 2018 21:10:03 +0200 Hans ?berg via Unicode wrote: > Indeed, before UTF-8, in the 1990s, I recall some Russians using > LaTeX files with sections in different Cyrillic and Latin encodings, > changing the editor encoding while typing. Rather like some of the old Unicode list archives, which are just concatenations of a month's emails, with all sorts of 8-bit encodings and stretches of base64. Richard. From unicode at unicode.org Tue Sep 11 17:13:52 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 12 Sep 2018 00:13:52 +0200 Subject: Unicode String Models In-Reply-To: <20180911224848.3aa17406@JRWUBU2> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> Message-ID: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode wrote: > > On Tue, 11 Sep 2018 21:10:03 +0200 > Hans ?berg via Unicode wrote: > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> LaTeX files with sections in different Cyrillic and Latin encodings, >> changing the editor encoding while typing. > > Rather like some of the old Unicode list archives, which are just > concatenations of a month's emails, with all sorts of 8-bit encodings > and stretches of base64. It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this. From unicode at unicode.org Tue Sep 11 17:40:17 2018 From: unicode at unicode.org (J Decker via Unicode) Date: Tue, 11 Sep 2018 15:40:17 -0700 Subject: Unicode String Models In-Reply-To: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> Message-ID: On Tue, Sep 11, 2018 at 3:15 PM Hans ?berg via Unicode wrote: > > > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > > On Tue, 11 Sep 2018 21:10:03 +0200 > > Hans ?berg via Unicode wrote: > > > >> Indeed, before UTF-8, in the 1990s, I recall some Russians using > >> LaTeX files with sections in different Cyrillic and Latin encodings, > >> changing the editor encoding while typing. > > > > Rather like some of the old Unicode list archives, which are just > > concatenations of a month's emails, with all sorts of 8-bit encodings > > and stretches of base64. > > It might be useful to represent non-UTF-8 bytes as Unicode code points. > One way might be to use a codepoint to indicate high bit set followed by > the byte value with its high bit set to 0, that is, truncated into the > ASCII range. For example, U+0080 looks like it is not in use, though I > could not verify this. > > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 (I'm probably off a bit in the leading byte) UTF-8 can represent from 0 to 0x200000 every value; (which is all defined codepoints) early varients can support up to U+7FFFFFFF... and there's enough bits to carry the pattern forward to support 36 bits or 42 bits... (the last one breaking the standard a bit by allowing a byte wihout one bit off... 0xFF would be the leadin) 0xF8-FF are unused byte values; but those can all be encoded into utf-8. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Sep 11 18:26:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 12 Sep 2018 00:26:42 +0100 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: References: <20180721020131.4b22887b@JRWUBU2> <20180721085026.6aa07876@JRWUBU2> Message-ID: <20180912002642.5e3c64a8@JRWUBU2> On Wed, 29 Aug 2018 21:42:57 +0000 Andrew Glass via Unicode wrote: > Thank you Richard and Shriramana for bringing up this interesting > problem. > > I agree we need to fix this. I don?t want to fix this with a font > hack or change to USE cluster rules or properties. I think the right > place to fix this is in the encoding. This might be either a new > character for Tamil Brahmi Pu??i ? as Shriramana has proposed > (L2/12-226) > ? or separate characters for Tamil Brahmi Short E and Tamil Brahmi > Short O in independent and dependent forms (4 characters total). I?m > inclined to think that a visible virama, Tamil Brahmi Pu??i, is the > right approach. While this would work, please remember that refusing to allow a virama after a vowel also makes USE inappropriate for Khmer and Tai Tham, which use H+consonant rather than consonant+H for subscript final consonants. Richard. From unicode at unicode.org Tue Sep 11 18:41:03 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 12 Sep 2018 01:41:03 +0200 Subject: Unicode String Models In-Reply-To: References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> Message-ID: No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really **do** have UTF-8 encodings (using two bytes). The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!) This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled Bytecode or via the C/C++ interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part of the Java string content itself). That special sequence however is really exposed in the Java API as a true unsigned 16-bit code unit (char) with value 0x0000, and a valid single code point. The same can be done for reencoding each invalid byte in non-UTF-8 conforming texts using sequences with a "UTF-8-like" scheme (still compatible with plain UTF-8 for every valid UTF-8 texts): you may either: * (a) encode each invalid byte separately (using two bytes for each), or by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF) and then needing 3 bytes in the encoding. * (b) encode a private starter (e.g. 0xFF), followed by a byte for the length of the raw bytes sequence that follows, and then the raw bytes sequence of that length without any reencoding: this will never be confused with other valid codepoints (however this scheme may no longer be directly indexable from arbitrary random positions, unlike scheme a which may be marginally longer longer) But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to distinctively represent these bytes, and a way to enter/change them reliably. There's also a possibility of extension if the backing store uses UTF-16, as all code units 0x0000.0xFFFF are used, but one scheme is possible by using unpaired surrogates (notably a low surrogate NOT prefixed by a high surrogate: the low surrogate already has 10 useful bits that can store any raw byte value in its lowest bits): this scheme allows indexing from random position and reliable sequencial traversal in both directions (backward or forward)... ... But the presence of such extension of UTF-16 means that all the implementation code handling standard text has to detect unpaired surrogates, and can no longer assume that a low surrogate necessarily has a high surrogate encoded just before it: it must be tested and that previous position may be before the buffer start, causing a possibly buffer overrun in backward direction (so the code will need to also know the start position of the buffer and check it, or know the index which cannot be negative), possibly exposing unrelated data and causing some security risks, unless the backing store always adds a leading "guard" code unit set arbitrarily to 0x0000. Le mer. 12 sept. 2018 ? 00:48, J Decker via Unicode a ?crit : > > > On Tue, Sep 11, 2018 at 3:15 PM Hans ?berg via Unicode < > unicode at unicode.org> wrote: > >> >> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode < >> unicode at unicode.org> wrote: >> > >> > On Tue, 11 Sep 2018 21:10:03 +0200 >> > Hans ?berg via Unicode wrote: >> > >> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using >> >> LaTeX files with sections in different Cyrillic and Latin encodings, >> >> changing the editor encoding while typing. >> > >> > Rather like some of the old Unicode list archives, which are just >> > concatenations of a month's emails, with all sorts of 8-bit encodings >> > and stretches of base64. >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. >> One way might be to use a codepoint to indicate high bit set followed by >> the byte value with its high bit set to 0, that is, truncated into the >> ASCII range. For example, U+0080 looks like it is not in use, though I >> could not verify this. >> >> > it's used for character 0x400. 0xD0 0x80 or 0x8000 0xE8 0x80 0x80 > (I'm probably off a bit in the leading byte) > UTF-8 can represent from 0 to 0x200000 every value; (which is all defined > codepoints) early varients can support up to U+7FFFFFFF... > and there's enough bits to carry the pattern forward to support 36 bits or > 42 bits... (the last one breaking the standard a bit by allowing a byte > wihout one bit off... 0xFF would be the leadin) > > 0xF8-FF are unused byte values; but those can all be encoded into utf-8. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Sep 11 19:02:44 2018 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Wed, 12 Sep 2018 00:02:44 +0000 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: <20180912002642.5e3c64a8@JRWUBU2> References: <20180721020131.4b22887b@JRWUBU2> <20180721085026.6aa07876@JRWUBU2> <20180912002642.5e3c64a8@JRWUBU2> Message-ID: On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to work through the solution for Tai Tham. I'm opposed to a generic and broad relaxation of virama constraints in USE as that would have impact on many scripts that currently have no requirement for virama after vowels. I'm not opposed to a new Indic Syllabic Category that has virama-like features and is allowed to follow a vowel. If we establish such a property for Tai Tham, we can consider on a case-by-case basis if any virama characters would be better served by the new property?including Brahmi. Cheers, Andrew -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Tuesday, September 11, 2018 4:27 PM To: unicode at unicode.org Subject: Re: Tamil Brahmi Short Mid Vowels On Wed, 29 Aug 2018 21:42:57 +0000 Andrew Glass via Unicode wrote: > Thank you Richard and Shriramana for bringing up this interesting > problem. > > I agree we need to fix this. I don?t want to fix this with a font hack > or change to USE cluster rules or properties. I think the right place > to fix this is in the encoding. This might be either a new character > for Tamil Brahmi Pu??i ? as Shriramana has proposed > (L2/12-226 2F%2Fwww.unicode.org%2FL2%2FL2012%2F12226-brahmi-two-tamil-char.pdf&am > p;data=02%7C01%7CAndrew.Glass%40microsoft.com%7Cc8b7042add6043b2d79608 > d6183f443b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63672305734730 > 4813&sdata=raIc6m1AqKNg8WMpAployLZpkk9BthumjMx%2BPUlFVNE%3D&re > served=0>) ? or separate characters for Tamil Brahmi Short E and Tamil > Brahmi Short O in independent and dependent forms (4 characters > total). I?m inclined to think that a visible virama, Tamil Brahmi > Pu??i, is the right approach. While this would work, please remember that refusing to allow a virama after a vowel also makes USE inappropriate for Khmer and Tai Tham, which use H+consonant rather than consonant+H for subscript final consonants. Richard. From unicode at unicode.org Tue Sep 11 21:34:21 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Wed, 12 Sep 2018 05:34:21 +0300 Subject: Unicode String Models In-Reply-To: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> (message from Hans =?utf-8?Q?=C3=85berg?= via Unicode on Wed, 12 Sep 2018 00:13:52 +0200) References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> Message-ID: <83bm93mok2.fsf@gnu.org> > Date: Wed, 12 Sep 2018 00:13:52 +0200 > Cc: unicode at unicode.org > From: Hans ?berg via Unicode > > It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this. You must use a codepoint that is not defined by Unicode, and never will. That is what Emacs does: it extends the Unicode codepoint space beyond 0x10FFFF. From unicode at unicode.org Tue Sep 11 21:47:06 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 11 Sep 2018 19:47:06 -0700 Subject: Tamil Brahmi Short Mid Vowels In-Reply-To: References: <20180721020131.4b22887b@JRWUBU2> <20180721085026.6aa07876@JRWUBU2> <20180912002642.5e3c64a8@JRWUBU2> Message-ID: <991013af-87cf-1dee-c7ee-10b6a58b4422@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Sep 12 00:38:21 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Wed, 12 Sep 2018 08:38:21 +0300 Subject: Unicode String Models In-Reply-To: <83va7cmgn4.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> Message-ID: On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii wrote: > > > Date: Tue, 11 Sep 2018 13:12:40 +0300 > > From: Henri Sivonen via Unicode > > > > * I suggest splitting the "UTF-8 model" into three substantially > > different models: > > > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > > UTF-8-related operations are performed when ingesting byte-oriented > > data. Byte buffers and text buffers are type-wise ambiguous. Only > > iterating over byte data by code point gives the data the UTF-8 > > interpretation. Unless the data is cleaned up as a side effect of such > > iteration, malformed sequences in input survive into output. > > > > 2) UTF-8 without full trust in ability to retain validity (the model > > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > > common UTF-8 model for C and C++, but I don't have evidence to back > > this up): When data is ingested with text semantics, it is converted > > to UTF-8. For data that's supposed to already be in UTF-8, this means > > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > > data is valid UTF-8 right after input. However, iteration by code > > point doesn't trust ability of other code to retain UTF-8 validity > > perfectly and has "else" branches in order not to blow up if invalid > > UTF-8 creeps into the system. > > > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > > have a different type in the type system than byte buffers. To go from > > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > > has been tagged as valid UTF-8, the validity is trusted completely so > > that iteration by code point does not have "else" branches for > > malformed sequences. If data that the type system indicates to be > > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > > language has a default "safe" side and an opt-in "unsafe" side. The > > unsafe side is for performing low-level operations in a way where the > > responsibility of upholding invariants is moved from the compiler to > > the programmer. It's impossible to violate the UTF-8 validity > > invariant using the safe part of the language. > > There's another model, the one used by Emacs. AFAIU, it is different > from all the 3 you describe above. In Emacs, each raw byte belonging > to a byte sequence which is invalid under UTF-8 is represented as a > special multibyte sequence. IOW, Emacs's internal representation > extends UTF-8 with multibyte sequences it uses to represent raw bytes. > This allows mixing stray bytes and valid text in the same buffer, > without risking lossy conversions (such as those one gets under model > 2 above). I think extensions of UTF-8 that expand the value space beyond Unicode scalar values and the problems these extensions are designed to solve is a worthwhile topic to cover, but I think it's not the same topic as in the document but a slightly adjacent topic. On that topic, these two are relevant: https://simonsapin.github.io/wtf-8/ https://github.com/kennytm/omgwtf8 The former is used in the Rust standard library in order to provide a Unix-like view to Windows file paths in a way that can represent all Windows file paths. File paths on Unix-like systems are sequences of bytes whose presentable-to-humans interpretation (these days) is UTF-8, but there's no guarantee of UTF-8 validity. File paths on Windows are are sequences of unsigned 16-bit numbers whose presentable-to-humans interpretation is UTF-16, but there's no guarantee of UTF-16 validity. WTF-8 can represent all Windows file paths as sequences of bytes such that the paths that are valid UTF-16 as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit representation. This allows application-visible file paths in the Rust standard library to be sequences of bytes both on Windows and non-Windows platforms and to be presentable to humans by decoding as UTF-8 in both cases. To my knowledge, the latter isn't in use yet. The implementation is tracked in https://github.com/rust-lang/rust/issues/49802 -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Wed Sep 12 03:37:00 2018 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Wed, 12 Sep 2018 10:37:00 +0200 Subject: Unicode String Models In-Reply-To: <83bm93mok2.fsf@gnu.org> References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> <83bm93mok2.fsf@gnu.org> Message-ID: > On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode wrote: > >> Date: Wed, 12 Sep 2018 00:13:52 +0200 >> Cc: unicode at unicode.org >> From: Hans ?berg via Unicode >> >> It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this. > > You must use a codepoint that is not defined by Unicode, and never > will. That is what Emacs does: it extends the Unicode codepoint space > beyond 0x10FFFF. The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Then U+0080 has had some use in other encodings, but it looks like not in Unicode itself. But one could use some other value or values, and mark it for this special purpose. There are a number of other byte sequences that are in use, too, like overlong UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also those with the high bit set, then. From unicode at unicode.org Wed Sep 12 09:03:44 2018 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Wed, 12 Sep 2018 17:03:44 +0300 Subject: Unicode String Models In-Reply-To: (message from Philippe Verdy via Unicode on Wed, 12 Sep 2018 01:41:03 +0200) References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> Message-ID: <838t46n77j.fsf@gnu.org> > Date: Wed, 12 Sep 2018 01:41:03 +0200 > Cc: unicode Unicode Discussion , > Richard Wordingham , > Hans Aberg > From: Philippe Verdy via Unicode > > The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid > UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!) > > This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled Bytecode or via the C/C++ > interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part > of the Java string content itself). That special sequence however is really exposed in the Java API as a true > unsigned 16-bit code unit (char) with value 0x0000, and a valid single code point. That's more or less what Emacs does. > But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were > plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to > distinctively represent these bytes, and a way to enter/change them reliably. The experience in Emacs is that no serious text editor can decide that it doesn't support these use cases. Even if editing binary files is out of scope, there will always be text files whose encoding is unknowable and/or guessed/decided wrong, files with mixed encodings, etc. From unicode at unicode.org Thu Sep 13 00:08:19 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Thu, 13 Sep 2018 08:08:19 +0300 Subject: Unicode String Models In-Reply-To: References: <83va7cmgn4.fsf@gnu.org> <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com> <20180911224848.3aa17406@JRWUBU2> <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> <83bm93mok2.fsf@gnu.org> Message-ID: On Wed, Sep 12, 2018 at 11:37 AM Hans ?berg via Unicode wrote: > The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Extending Unicode itself would likely create more problems that it would solve. Extending the value space of Unicode scalar values would be extremely disruptive for systems whose design is deeply committed to the current definitions of UTF-16 and UTF-8 staying unchanged. Assigning a scalar value within the current Unicode scalar value space to currently malformed bytes would have the problem of those scalar values losing information whether they came from malformed bytes or the well-formed encoding of those scalar values. It seems better to let applications that have use cases that involve representing non-Unicode values to use a special-purpose extension on their own. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Sat Sep 15 08:36:37 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 15 Sep 2018 15:36:37 +0200 Subject: Shortcuts question In-Reply-To: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> Message-ID: Le ven. 7 sept. 2018 ? 05:43, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 07/09/18 02:32 Shriramana Sharma via Unicode wrote: > > > > Hello. This may be slightly OT for this list but I'm asking it here as > it concerns computer usage with multiple scripts and i18n: > > It actually belongs on CLDR-users list. But coming from you, it shall > remain here while I?m posting a quick answer below. > > > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for > "tout" io Ctrl+A for "all"? > > No, Ctrl+A remains Ctrl+A on a French keyboard. > Yes but the location on the keyboard maps to the same as CTRL+Q on a Qwerty layout: CTRL+ASCII letter are mapped according to the layout of the letter (without pressing CTRL) on the localized keyboard. Some keyboard layouts don't have all the basic Latin letters becaues their language don't need it (e.g. it may only have one of Q or K, but no C, or it may have no W, or some letters may be holding combined diacritics or could be ligatures, but usuall the basic Latin letter is still accessible by pressing another control key or by switching the layout mode. On non Latin keyboard layouts there's much more freedom, and CTRL+A may be localized according to the main base letter assigned to the key (the position of Latin letter is not always visible). On tactile layouts you cannot guess where CTRL+Latin letter is located, actually it may be accessible very differently on a separate layout for controls, where they will be translated: the CTRL key is not necessarily present, replaced usually by a single key for input mode selection (which may be switching languages, or to emojis, or to symbols/punctuations/digits)... The problematic control keys are those like "CTRL+[" (assuming ASCII as the base layout) where "[" is not present or mapped very differently. As well "CTRL+1"..."CTRL+0" may conflict with the assignment of ASCII controls like "CTRL+[". So yes all control keys are potentially localisable to work best with the base layout anre remaining mnemonic; but the physical key position may be very different. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Sep 16 07:08:55 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 16 Sep 2018 14:08:55 +0200 (CEST) Subject: Shortcuts question In-Reply-To: References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> Message-ID: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> On 15/09/18 15:36, Philippe Verdy wrote: [?] > So yes all control keys are potentially localisable to work best with the base layout anre remaining mnemonic; > but the physical key position may be very different. An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized for text input. But again that is easier on Windows, where VKs are remapped separately, than on Linux that appears to use graphics throughout to process application shortcuts, and only modifiers can be "preserved" for further processing, no underlying letter map that AFAIU appears not to exist on Linux. However, about keyboarding, that may be technically too detailed for this List, so that I?ll step out of this thread here. Please follow up in parallel thread on CLDR-users instead. https://unicode.org/pipermail/cldr-users/2018-September/000837.html Thanks, Marcel From unicode at unicode.org Sun Sep 16 08:28:31 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 16 Sep 2018 15:28:31 +0200 Subject: Shortcuts question In-Reply-To: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> Message-ID: For games, the mnemonic meaning of keys are unlikely to be used because gamers prefer an ergonomic placement of their fingers according to the physical position for essential commands. But this won't apply to control keys, as these commands should be single keystrokes and pressing two keys instead of one would be unpractical and would be a disavantage when playing. That's why the four most common 4 direction keys A/D/S/W on a QWERTY layout will become Q/D/S/Z on a French AZERTY layout. Games that use logical key layouts based on QWERTY are almost unplayable if there's no interface to customize these 4 keys. So games preferably use the virtual keys instead for these commands, or will include builtin layouts adapted for AZERTY and QWERTZ-based layouts and still display the correct keycaps in the UI: games normally don't force the switch to another US layout, so they still need to use the logical layout, simply because they also need to allow users to input real text and not jsut gaming commands (for messaging, or for inputing custom players/objects created in the game itself, or to fill-in user profiles, or input a registration email or to perform online logon with the correct password), in which case they will also need to support characters entered with control keys (AltGr, Shift, Control...), or with a standard tactile panel on screen which will still display the common localized layouts. There are difficulties in games when some of their commands are mapped to something else than just basic Latin letters (including decimal digits : on a French AZERTY keyboard, the digits are composed by pressing Shift, or in ShiftLock mode (there's no CapsLock mode as this ShiftLock is also released when pressing Shift: just like on old French mechanical typewriters, pressing ShiftLock again did not release it, and this ShiftLock applied to all keys on the keyboard, including punctuation keys. On PC keyboards, ShiftLock does not apply to the numeric pad which has its separate NumLock, now largely redundant and that most users would like to disable completely each time there's a numeric pad separated from the directional pad, on these extended keyboards, NumLock is just a nuisance, notably on OS logon screen when Windows turns it off by default unless the BIOS locks it at boot time, and lot of BIOS don't do that or don't have the option to set it permanently). Le dim. 16 sept. 2018 ? 14:18, Marcel Schneider via Unicode < unicode at unicode.org> a ?crit : > On 15/09/18 15:36, Philippe Verdy wrote: > [?] > > So yes all control keys are potentially localisable to work best with > the base layout anre remaining mnemonic; > > but the physical key position may be very different. > > An additional level of complexity is induced by ergonomics. so that most > non-Latin layouts may wish to stick > with QWERTY, and even ergonomic layouts in the footprints of August Dvorak > rather than Shai Coleman are > likely to offer variants with legacy Virtual Key mapping instead of > staying in congruency with graphics optimized > for text input. But again that is easier on Windows, where VKs are > remapped separately, than on Linux that > appears to use graphics throughout to process application shortcuts, and > only modifiers can be "preserved" for > further processing, no underlying letter map that AFAIU appears not to > exist on Linux. > > However, about keyboarding, that may be technically too detailed for this > List, so that I?ll step out of this thread > here. Please follow up in parallel thread on CLDR-users instead. > > https://unicode.org/pipermail/cldr-users/2018-September/000837.html > > Thanks, > > Marcel > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Sep 16 22:38:28 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Mon, 17 Sep 2018 12:38:28 +0900 Subject: Shortcuts question In-Reply-To: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> Message-ID: <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp> On 2018/09/16 21:08, Marcel Schneider via Unicode wrote: > An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick > with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are > likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized > for text input. From my personal experience: A few years ago, installing a Dvorak keyboard (which is what I use every day for typing) didn't remap the control keys, so that Ctrl-C was still on the bottom row of the left hand, and so on. For me, it was really terrible. It may not be the same for everybody, but my experience suggests that it may be similar for some others, and that therefore such a mapping should only be voluntary, not default. Regards, Martin. From unicode at unicode.org Mon Sep 17 09:34:52 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 17 Sep 2018 16:34:52 +0200 (CEST) Subject: Shortcuts question In-Reply-To: <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp> References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp> Message-ID: <813933389.12023.1537194892598.JavaMail.www@wwinf1m12> On 17/09/18 05:38 Martin J. D?rst wrote: [quote] > > From my personal experience: A few years ago, installing a Dvorak > keyboard (which is what I use every day for typing) didn't remap the > control keys, so that Ctrl-C was still on the bottom row of the left > hand, and so on. For me, it was really terrible. > > It may not be the same for everybody, but my experience suggests that it > may be similar for some others, and that therefore such a mapping should > only be voluntary, not default. Got it, thanks! Regards, Marcel From unicode at unicode.org Mon Sep 17 09:47:57 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 17 Sep 2018 16:47:57 +0200 Subject: Shortcuts question In-Reply-To: <813933389.12023.1537194892598.JavaMail.www@wwinf1m12> References: <407936671.61.1536290828336.JavaMail.www@wwinf1m09> <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12> <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp> <813933389.12023.1537194892598.JavaMail.www@wwinf1m12> Message-ID: Note: CLDR concentrates on keyboard layout for text input. Layouts for other functions (such as copy-pasting, gaming controls) are completely different (and not necessarily bound directly to layouts for text, as they may also have their own dedicated physical keys or users can reprogram their keyboard for this; for gaming, softwares should all have a way to customize the layout according to users need, and should provide reasonnable defaults for at least the 3 base layouts: QWERTY, AZERTY and QWERTZ, but I've never seen any game whose UI was tuned for Dvorak) Le lun. 17 sept. 2018 ? 16:42, Marcel Schneider a ?crit : > On 17/09/18 05:38 Martin J. D?rst wrote: > [quote] > > > > From my personal experience: A few years ago, installing a Dvorak > > keyboard (which is what I use every day for typing) didn't remap the > > control keys, so that Ctrl-C was still on the bottom row of the left > > hand, and so on. For me, it was really terrible. > > > > It may not be the same for everybody, but my experience suggests that it > > may be similar for some others, and that therefore such a mapping should > > only be voluntary, not default. > > Got it, thanks! > > Regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Sep 17 14:50:05 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 17 Sep 2018 21:50:05 +0200 (CEST) Subject: Group separator migration from U+00A0 to U+202F Message-ID: <1855270252.16702.1537213805521.JavaMail.www@wwinf1m04> For people monitoring this list but not CLDR-users: To be cost-effective, the migration from the wrong U+00A0 to the correct U+202F as group separator should be synched across all locales using space instead of comma or period. SI is international and specifies narrow fixed-width no-break space as mandatory in the role of a numbers group separator. That is the place to remember that Unicode would have had such a narrow fixed-width no-break space from its very beginning on, if U+2008 PUNCTUATION SPACE had beed treated equally like its relative, U+2007 FIGURE SPACE, both being designed for legacy-style hard-typeset tabular numbers representation. We can only ask why it was not, without any hope of ever getting an authorized response on this list (see a recent thread about non-responsiveness; subscribers knowing the facts are here but don?t post anymore). So this is definitely not the place to vent about that misdesign, but it is about the way of fixing it now. After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F). the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more cost-effective than migrating one locale this time, another locale next time, a handful locales the time after, possibly splitting them up in sublocales with different migration schedules. I really believed that now Unicode proves ready to adopt the real group separator in French, all relevant locales would be consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they are not. http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and sublocales stick with the wrong value makes no sense any more. https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up, then doing it for all at once. There seems to be a misunderstanding: The *locale setting* is whether to use period, comma, space, apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic. Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is **not a locale setting**, but it?s all about Unicode *design* and Unicode *implementation.* I really thought that that was clear and that there?s no need to heavily insist on the ST "French" forum. When referring to the "French thousands separator" I only meant that unlike comma- or period-using locales, the French locale uses space and that the group separator space should be the correct one. That did **not** mean that French should use *another* space than the other locales using space. https://unicode.org/cldr/trac/ticket/11423 Regards, Marcel From unicode at unicode.org Tue Sep 18 00:23:49 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 18 Sep 2018 07:23:49 +0200 (CEST) Subject: Group separator migration from U+00A0 to U+202F Message-ID: <1530688898.269.1537248229668.JavaMail.www@wwinf2219> > I aimed at correcting an error in CLDR, not at making French stand out. So I've to confess that I did focus on French and only applied for fr-FR, but there was a lot of work, see http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth waiting for very few vetters. Nevertheless I also cared for English (see various tickets), and also posted on CLDR-users in a belated P.S. that fr-CA hadn?t caught up the group separator correction yet: https://unicode.org/pipermail/cldr-users/2018-August/000825.html Also I?m sorry for failing to provide appropriate feedback after beta release and to post upstream messages urging to make sure all locales using space for group separator be kept in synchrony. I?m posting here with respect to people not monitoring CLDR-users Mail List. where this post is expanded. For further details and CLDR ticket link, please look up: https://unicode.org/pipermail/cldr-users/2018-September/000843.html Regards, Marcel From unicode at unicode.org Sat Sep 29 10:07:31 2018 From: unicode at unicode.org (Andrew Swaine via Unicode) Date: Sat, 29 Sep 2018 16:07:31 +0100 Subject: Shameless plug: Keyferret keyboard input system Message-ID: It seems from reading back through the archives that efficient and intuitive entry of Unicode characters is a topic that comes up from time to time. I have built a new, free, Windows-based keyboard entry system for Unicode characters that at least some of the people on this list might find interesting. This is effectively a super-Latin keyboard layout with support for the majority of: Basic Latin (ASCII), Latin-1 Supplement, Latin Extended-A, Latin Extended-B, Latin Extended-C, Latin Extended-D, Latin Extended-E, Latin Extended Additional, IPA Extensions, Phonetic Extensions, Phonetic Extensions Supplement, Combining Diacritical Marks, Combining Diacritical Marks Supplement, Letterlike Symbols, Mathematical Alphanumeric Symbols, Enclosed Alphanumerics, Arrows, Mathematical Operators Plus additional layouts selectable using CapsLock give support for: Greek, Greek Extended Cyrillic, Cyrillic Supplement, Cyrillic Extended-A, Cyrillic Extended-B Characters are selected through a context-sensitive compose tree accessed using the Right Alt (AltGr) key, with context-sensitive help in a box that pops up when RAlt is held. Rather than using dead keys, keys are context-sensitive on the previously entered characters. So for example, typing "o" followed by RAlt+/ gives ?. Longer sequences give more complex characters, e.g. RAlt+sh+ for ?. Characters are converted into Normalization form C where possible, so "a" followed by RAlt+' gives \u00e1 (?), not a\u0301. More information on www.keyferret.com if you're interested. If anyone is interested in helping make it a better system, please get in touch. Kind regards, Andrew. -------------- next part -------------- An HTML attachment was scrubbed... URL: