From unicode at unicode.org  Sat Sep  1 01:00:02 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
In-Reply-To: <20180831081953.68476d36@spixxi>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
Message-ID: <1680938489.236.1535781602870.JavaMail.www@wwinf1d31>

On 31/08/18 08:25 Marius Spix via Unicode wrote:
> 
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
> 
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.

Thanks for advice. Already I do use YAML syntaxic highlighting to display 
XCompose files, that use the colon as a separator, too.

Did you figure out how YAML would fit UCD data? It appears to heavily rely
on line breaks, that may get lost as data turns around across environments.
XML indentation is only a readability feature and irrelevant to content. The 
structure is independent of invisible characters and is stable if only graphics
are not corrupted (while it may happen that they are). Linebreaks are odd in
that they are inconsistent across OSes, because Unicode was denied the 
right to impose a unique standard in that matter. The result is mashed-up 
files, and I fear YAML might not hold out.

Like XML, YAML needs to repeat attribute names in every instance. That 
is precisely what CSV gets around of, at the expense of readability in 
plain text. Personally I could use YAML as I do use XML for lookup in
the text editor, but I?m afraid that there is no advantage over CSV with
respect to file size.

Regards,

Marcel
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
> 
[?]


From unicode at unicode.org  Sat Sep  1 02:12:12 2018
From: unicode at unicode.org (Marius Spix via Unicode)
Date: Sat, 1 Sep 2018 09:12:12 +0200
Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
In-Reply-To: <1680938489.236.1535781602870.JavaMail.www@wwinf1d31>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <1680938489.236.1535781602870.JavaMail.www@wwinf1d31>
Message-ID: <20180901091212.03841b71@spixxi>

Hello Marcel,

YAML supports references, so you can refer to another character?s
properties.

Example:

repertoire: 
 char:
  -
   name_alias: 
    - [NUL,abbreviation]
    - ["NULL",control]
   cp: 0000
   na1: "NULL"
   props: &0000
     age: "1.1"
     na: ""
     JSN: ""
     gc: Cc
     ccc: 0
     dt: none
     dm: "#"
     nt: None
     nv: NaN
     bc: BN
     bpt: n
     bpb: "#"
     Bidi_M: N
     bmg: ""
     suc: "#"
     slc: "#"
     stc: "#"
     uc: "#"
     lc: "#"
     tc: "#"
     scf: "#"
     cf: "#"
     jt: U
     jg: No_Joining_Group
     ea: N
     lb: CM
     sc: Zyyy
     scx: Zyyy
     Dash: N
     WSpace: N
     Hyphen: N
     QMark: N
     Radical: N
     Ideo: N
     UIdeo: N
     IDSB: N
     IDST: N
     hst: NA
     DI: N
     ODI: N
     Alpha: N
     OAlpha: N
     Upper: N
     OUpper: N
     Lower: N
     OLower: N
     Math: N
     OMath: N
     Hex: N
     AHex: N
     NChar: N
     VS: N
     Bidi_C: N
     Join_C: N
     Gr_Base: N
     Gr_Ext: N
     OGr_Ext: N
     Gr_Link: N
     STerm: N
     Ext: N
     Term: N
     Dia: N
     Dep: N
     IDS: N
     OIDS: N
     XIDS: N
     IDC: N
     OIDC: N
     XIDC: N
     SD: N
     LOE: N
     Pat_WS: N
     Pat_Syn: N
     GCB: CN
     WB: XX
     SB: XX
     CE: N
     Comp_Ex: N
     NFC_QC: Y
     NFD_QC: Y
     NFKC_QC: Y
     NFKD_QC: Y
     XO_NFC: N
     XO_NFD: N
     XO_NFKC: N
     XO_NFKD: N
     FC_NFKC: "#"
     CI: N
     Cased: N
     CWCF: N
     CWCM: N
     CWKCF: N
     CWL: N
     CWT: N
     CWU: N
     NFKC_CF: "#"
     InSC: Other
     InPC: NA
     PCM: N
     blk: ASCII
     isc: ""

  -
   cp: 0001
   na1: "START OF HEADING"
   name_alias: 
    - [SOH,abbreviation]
    - [START OF HEADING,control]
   props: *0000


Regards,

Marius Spix


On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
schrieb Marcel Schneider wrote:

> On 31/08/18 08:25 Marius Spix via Unicode wrote:
> > 
> > A good compromise between human readability, machine processability
> > and filesize would be using YAML.
> > 
> > Unlike JSON, YAML supports comments, anchors and references,
> > multiple documents in a file and several other features.
> 
> Thanks for advice. Already I do use YAML syntaxic highlighting to
> display XCompose files, that use the colon as a separator, too.
> 
> Did you figure out how YAML would fit UCD data? It appears to heavily
> rely on line breaks, that may get lost as data turns around across
> environments. XML indentation is only a readability feature and
> irrelevant to content. The structure is independent of invisible
> characters and is stable if only graphics are not corrupted (while it
> may happen that they are). Linebreaks are odd in that they are
> inconsistent across OSes, because Unicode was denied the right to
> impose a unique standard in that matter. The result is mashed-up
> files, and I fear YAML might not hold out.
> 
> Like XML, YAML needs to repeat attribute names in every instance.
> That is precisely what CSV gets around of, at the expense of
> readability in plain text. Personally I could use YAML as I do use
> XML for lookup in the text editor, but I?m afraid that there is no
> advantage over CSV with respect to file size.
> 
> Regards,
> 
> Marcel
> > 
> > Regards,
> > 
> > Marius Spix
> > 
> > 
> > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via
> > Unicode wrote:
> > 
> [?]

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://unicode.org/pipermail/unicode/attachments/20180901/359c88e5/attachment.pgp>

From unicode at unicode.org  Sat Sep  1 06:35:32 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 1 Sep 2018 12:35:32 +0100
Subject: UCD in XML or in CSV?
In-Reply-To: <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
Message-ID: <20180901123532.011f10e6@JRWUBU2>

On Fri, 31 Aug 2018 10:36:45 +0200
Manuel Strehl via Unicode <unicode at unicode.org> wrote:

> For me it's currently much easier to have all the data in a single
> place, e.g. a large XML file, than spread over a multitude of files
> _with different ad-hoc syntaxes_.
> 
> The situation would possibly be different, though, if the UCD data
> would be split in several files of the same format. (Be it JSON, CSV,
> YAML, XML, TOML, whatever. Just be consistent.)

Most properties are stored in pretty much the same format in the UCD
files. UnicodeData.txt is the major exception; it seems to date from
when the set of properties was expected to be stable.

The big exception is set-valued properties.  PropList.txt can be viewed
as having an odd syntax for storing the set of miscellaneous Boolean
properties for which the codepoint has the value of 'true'.

Richard.

From unicode at unicode.org  Sat Sep  1 07:16:03 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 1 Sep 2018 14:16:03 +0200 (CEST)
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <20180901091212.03841b71@spixxi>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <1680938489.236.1535781602870.JavaMail.www@wwinf1d31>
 <20180901091212.03841b71@spixxi>
Message-ID: <290536618.2898.1535804163592.JavaMail.www@wwinf1d33>

Thank you Marius for the example. Indeed I now see that YAML is a powerful means
for a file to have an intuitive readability while drastically reducing file size.

BTW what I conjectured about the role of line breaks is true for CSV too, and any file
downloaded from UCD on a semicolon separator basis becomes unusable when 
displayed straight in the built-in text editor of Windows, given Unicode uses Unix EOL.

?Still for use in spreadsheets, YAML needs to be converted to CSV, although that 
might not crash the browser as large XML does.

Regards,

Marcel

On 01/09/18 09:18 Marius Spix via Unicode wrote:
> 
> Hello Marcel,
> 
> YAML supports references, so you can refer to another character?s
> properties.
> 
> Example:
> 
> repertoire: 
> char:
> -
> name_alias: 
> - [NUL,abbreviation]
> - ["NULL",control]
> cp: 0000
> na1: "NULL"
> props: &0000
> age: "1.1"
> na: ""
> JSN: ""
> gc: Cc
> ccc: 0
> dt: none
> dm: "#"
> nt: None
> nv: NaN
> bc: BN
> bpt: n
> bpb: "#"
> Bidi_M: N
> bmg: ""
> suc: "#"
> slc: "#"
> stc: "#"
> uc: "#"
> lc: "#"
> tc: "#"
> scf: "#"
> cf: "#"
> jt: U
> jg: No_Joining_Group
> ea: N
> lb: CM
> sc: Zyyy
> scx: Zyyy
> Dash: N
> WSpace: N
> Hyphen: N
> QMark: N
> Radical: N
> Ideo: N
> UIdeo: N
> IDSB: N
> IDST: N
> hst: NA
> DI: N
> ODI: N
> Alpha: N
> OAlpha: N
> Upper: N
> OUpper: N
> Lower: N
> OLower: N
> Math: N
> OMath: N
> Hex: N
> AHex: N
> NChar: N
> VS: N
> Bidi_C: N
> Join_C: N
> Gr_Base: N
> Gr_Ext: N
> OGr_Ext: N
> Gr_Link: N
> STerm: N
> Ext: N
> Term: N
> Dia: N
> Dep: N
> IDS: N
> OIDS: N
> XIDS: N
> IDC: N
> OIDC: N
> XIDC: N
> SD: N
> LOE: N
> Pat_WS: N
> Pat_Syn: N
> GCB: CN
> WB: XX
> SB: XX
> CE: N
> Comp_Ex: N
> NFC_QC: Y
> NFD_QC: Y
> NFKC_QC: Y
> NFKD_QC: Y
> XO_NFC: N
> XO_NFD: N
> XO_NFKC: N
> XO_NFKD: N
> FC_NFKC: "#"
> CI: N
> Cased: N
> CWCF: N
> CWCM: N
> CWKCF: N
> CWL: N
> CWT: N
> CWU: N
> NFKC_CF: "#"
> InSC: Other
> InPC: NA
> PCM: N
> blk: ASCII
> isc: ""
> 
> -
> cp: 0001
> na1: "START OF HEADING"
> name_alias: 
> - [SOH,abbreviation]
> - [START OF HEADING,control]
> props: *0000
> 
> 
> 
> 
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
> schrieb Marcel Schneider wrote:
> 
[?]


From unicode at unicode.org  Sat Sep  1 08:15:56 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 1 Sep 2018 15:15:56 +0200 (CEST)
Subject: UCD in XML or in CSV? (is: Parsing UCD in XML)
In-Reply-To: <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
Message-ID: <1728293477.3292.1535807756905.JavaMail.www@wwinf1d33>

On 31/08/18 10:47 Manuel Strehl via Unicode wrote:
> 
> To handle the UCD XML file a streaming parser like Expat is necessary.

Thanks for the tip. However for my needs, Expat looks like overkill, and I?m 
looking out for a much simpler standalone tool, just converting XML to CSV.

> 
> For codepoints.net I use that data [?]

Very good site IMO, as it compiles a lot of useful information trying to maximize 
human readability. 

Nice to have added the Adopt-a-character button, too.

Thanks,

Marcel


From unicode at unicode.org  Sat Sep  1 21:16:07 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 2 Sep 2018 04:16:07 +0200 (CEST)
Subject: UCD in XML or in CSV? (is: UCD data consumption)
Message-ID: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24>

I?m not responding without thinking, as I was blamed of when I did,
but it is painful for me to dig into what Ken explained about how 
we should be consuming UCD data. I?ll now try to get some more clarity
into the topic.

> On 31/08/18 19:59 Ken Whistler via Unicode wrote:
> [?]
> > 
> > Third, please remember that folks who come here complaining about the 
> > complications of parsing the UCD are a very small percentage of a very 
> > small percentage of a very small percentage of interested parties. 

OK, among avg. 700 list subscribers, relatively few are ever complaining 
about anything, let alone about this particular topic. But we should always 
keep in mind that many folks out there complaining about Unicode don?t come 
here to do so.

> > Nearly everybody who needs UCD data should be consuming it as a 
> > secondary source (e.g. for reference via codepoints.net), or as a 
> > tertiary source (behind specialized API's, regex, etc.),

Like already suggested, ?as? should probably read ?via? in that part.

> > or as an end 
> > user (just getting behavior they expect for characters in applications). 

That is more than a simple statement about who is consuming UCD data which 
way, as you say ?should.? There seem to be assumptions that it is discouraged 
to dive into the raw data; that folks reading file headers are not doing well;
that the data should be assembled only in certain ways; and that ignorant 
people shouldn?t open the UCD cupboard to pick a file they deem useful.

If so, then it might be surprising to know that when submitting a proposal
about Bidi-mirroring mathematical symbols issues feedback
http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html
I?d started as a quasi-end-user not getting behavior I expected for characters 
in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like
it is implemented in web browsers, because I wanted that end-users could 
experience bidi-mirroring as it works. Unexpectedly a number of math symbols 
did not mirror, despite many of them being even scalar neighbors.

> > Programmers who actually *need* to consume the raw UCD data files and 
> > write parsers for them directly should actually be able to deal with the 
> > format complexity -- and, if anything, slowing them down to make them 
> > think about the reasons for the format complexity might be a good thing, 

I can see one main reason for the format complexity, and that is that data 
from various propeties don?t necessarily telescope the same way to make for 
small files. The complexity of UCD would then mainly be self-induced by the
way of packing data into one small file per property rather than adding the
value to each relevant code point in one large list as is UnicodeData.txt.

While I?m now taking the time to write this up because I?m committed to 
process that information, we can think of many many people who don?t like 
to be slowed down trying to find out why Unicode changed UCD design while 
following the original idea of a large CSV list would be straightforward, 
eventually by setting up a new one if the first one got stuck. What I can 
figure out is that while a new property was added, that particular property 
was always thought of as being the last one. 
(At some point the many files were then dumped into the known XML files.)

If UCD is to be made of small files, it is necessarily complex, and the 
conclusion is that there should be another large CSV grid to make things 
simple again and lightweight alike so far as they can.

> > as it tends to put the lie to the easy initial assumption that the UCD 
> > is nothing more than a bunch of simple attributes for all the code points.

Did you try the sentence when taking off ?simple?? It appears to me as not 
being a lie then. One attribute comes to mind that is so complex that its 
design even changed over time, despite Unicode?s commitment to stability.
The Bidi_Mirrored_Glyph property was originally designed to include ?best-fit?
pairs for least-worse display in applications not supporting RTL glyphs 
(ie without OpenType support), with legibility of math formulae in mind.
Later (probably due to a poorly written OpenType spec), no more best-fit pairs 
were added to BidiMirroring.txt, as if OpenType implementers weren?t to remove
the best-fit pairs anyway prior to using the file (while the spec says to use 
it as-is). That led then to the display problem pointed above.

I?m sparing the particular problem related to 3 pairs of symbols with tilde, 
nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. 

So you can understand that I?m not unaware of the complexity of UCD. Though
I don?t think that this could be an argument for not publishing a medium-size 
CSV file with scalar values listed as in UnicodeData.txt.

> 
> [?]
> Even Excel Starter, that I have, is a great tool helping
> to perform tasks I fail to get with other tools, even spreadsheet software.

Ie not every spreadsheet software seems to do the job as I need it.

Regards,

Marcel


From unicode at unicode.org  Mon Sep  3 01:24:06 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Mon, 03 Sep 2018 08:24:06 +0200
Subject: UCD in XML or in CSV? (is: UCD data consumption)
In-Reply-To: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24> (Marcel
 Schneider via Unicode's message of "Sun, 2 Sep 2018 04:16:07 +0200
 (CEST)")
References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24>
Message-ID: <86ftyr3xq1.fsf@mimuw.edu.pl>

On Sun, Sep 02 2018 at  4:16 +0200, 

[...]

> So you can understand that I?m not unaware of the complexity of UCD. Though
> I don?t think that this could be an argument for not publishing a medium-size 
> CSV file with scalar values listed as in UnicodeData.txt.

For a non-programmer like me CVS is much more convenient form than XML -
I can use it not only with a spreadsheet, but also import directly into
a database and analyse with various queries. XML is politically correct,
but practically almost unusable without a specialised parser.

On Sat, Sep 01 2018 at 15:15 +0200, unicode at unicode.org writes:
> On 31/08/18 10:47 Manuel Strehl via Unicode wrote:
>> 
>> To handle the UCD XML file a streaming parser like Expat is necessary.
>
> Thanks for the tip. However for my needs, Expat looks like overkill, and I?m 
> looking out for a much simpler standalone tool, just converting XML to CSV.

I think CSV and XML can coexist peacefully, we just need an open source
round-trip converter.

Last but not least, let me remind that the thread was started by a
question what is the most convenient way to describe the properties of
PUA characters.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Mon Sep  3 02:45:38 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Mon, 03 Sep 2018 09:45:38 +0200
Subject: CLDR
In-Reply-To: <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37> (Marcel
 Schneider via Unicode's message of "Fri, 31 Aug 2018 12:17:41 +0200
 (CEST)")
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
 <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>
Message-ID: <86wos32fdp.fsf@mimuw.edu.pl>

On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote:
> The XML files in these folders:
>
> https://unicode.org/repos/cldr/tags/latest/common/

Thanks for the link.

In the meantime I rediscovered Locale Explorer

http://demo.icu-project.org/icu-bin/locexp

which I used some time ago.

On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote:
> On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote:
> [?]
>> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
>> > one couldn?t simply pop them into XML or whatever, as the result would be 
>> > disappointing and call for completion in the aftermath. Yet another task 
>> > competing with CLDR survey.
>> 
>> Please elaborate. It's not clear for me what do you mean.
>
> These comments are designed for the Code Charts and as such must not be
> disproportionate in exhaustivity. Eg we have lists of related languages ending 
> in an ellipsis.

Looks like we have different comments in mind.

[...]

>> > Reviewing CLDR data is IMO top priority.
>> > There are many flaws to be fixed in many languages including in English.
>> > A lot of useful digest charts are extracted from XML there,
>> 
>> Which XML? where?
>
> More precisely it is LDML, the CLDR-specific XML.
> What I called ?digest charts? are the charts found here:
>
> http://www.unicode.org/cldr/charts/34/
>
> The access is via this page:
>
> http://cldr.unicode.org/index/downloads
>
> where the charts are in the Charts column, while the raw data is under
> SVN Tag.

Thanks for the link. I found especially interesting the Polish section
in

https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html

Looks like a complete rubbish, e.g.

plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
Pomorze) transliterated into the Greek alphabet (and something in
Arabic).

The header of the page says "The coverage depends on the availability of
data in wikidata for these names" but I was unable to find this rubbish
in Wikidata (but I was not looking very hard).

>
>> 
>> > and we really 
>> > need to go through the data and correct the many many errors, please.

But who is the right person or institution to do it?

>> 
>> Some time ago I tried to have a close look at the Polish locale and
>> found the CLDR site prohibitively confusing.
>
> I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
> for the access to the XML data (except when knowing about SubVersioN).
> Polish data is found here:
>
> https://www.unicode.org/cldr/charts/34/summary/pl.html
>
> The access is via the top of the "Summary" index page (showing root data):
>
> https://www.unicode.org/cldr/charts/34/summary/root.html
>
> You may wish to particularly check the By-Type charts:
>
> https://www.unicode.org/cldr/charts/34/by_type/index.html
>
> Here I?d suggest to first focus on alphabetic information and on punctuation.
>
> https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html
>
> Under Latin (table caption, without anchor) we find out what punctuation 
> Polish has compared to other locales using the same script.
> The exact character appears when hovering the header row.
> Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
> an error in almost every locale using hyphen. TC is about to correct that.
>
> Further you will see that while Polish is using apostrophe
> https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
> CLDR does not have the correct apostrophe for Polish, as opposed eg to French.

I understand that by "the correct apostrophe" you mean U+2019 RIGHT
SINGLE QUOTATION MARK.

> You may wish to note that from now on, both U+0027 APOSTROPHE and 
> U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
> preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
> U+201D that are already found in CLDR pl.

The situation seems more complicated because the chart

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

contains different list of punctuation characters than

https://www.unicode.org/cldr/charts/34/summary/pl.html.

I guess the latter is the primary one, and it contains U+2019 RIGHT
SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too).
               
>
> Note however that according to the information provided by English Wikipedia:
> https://en.wikipedia.org/wiki/Quotation_mark#Polish
> Polish also uses single quotes, that by contrast are still missing in CLDR.

You are right, but who cares? Looks like this has no practical
importance. Nobody complains about the wrong use of quotation marks in
Polish by Word or OpenOffice, so looks like the software doesn't use
this information. So this is rather a matter of aesthetics...

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Mon Sep  3 04:03:31 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 3 Sep 2018 01:03:31 -0800
Subject: CLDR
In-Reply-To: <86wos32fdp.fsf@mimuw.edu.pl>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
 <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>
 <86wos32fdp.fsf@mimuw.edu.pl>
Message-ID: <CABPY6Z1k5UQijcFbdiAOjJaZiLo3Vpf2yHgX0NBFsMKbB_GOvg@mail.gmail.com>

Janusz S. Bie? wrote,

> Thanks for the link. I found especially interesting the Polish section
> in
>
> https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html
>
> Looks like a complete rubbish, e.g.
>
> plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
> Pomorze) transliterated into the Greek alphabet (and something in
> Arabic).

And nothing in Armenian, Albanian, or Pashto.

If you click on the link at "plpm", it takes you right back to that
same entry on that same page, which doesn't seem very helpful.

> The header of the page says "The coverage depends on the availability of
> data in wikidata for these names" but I was unable to find this rubbish
> in Wikidata (but I was not looking very hard).

I tried both "plpm" and "?????????" in the Wikidata search box.  On
the latter, there were some pages which looked to translate place
names into various languages, for both Germany and Poland.  I couldn't
find the exact page, but it would be something like this page:

https://www.wikidata.org/wiki/Q54180

(Clicking "All Entered Languages" on that page gives a lengthy list.)

>>> > and we really
>>> > need to go through the data and correct the many many errors, please.
>
> But who is the right person or institution to do it?

If the CLDR information is driven by Wikidata as the file header
indicates, then Wikidata.


From unicode at unicode.org  Mon Sep  3 04:37:12 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 3 Sep 2018 01:37:12 -0800
Subject: CLDR
In-Reply-To: <CABPY6Z1k5UQijcFbdiAOjJaZiLo3Vpf2yHgX0NBFsMKbB_GOvg@mail.gmail.com>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
 <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>
 <86wos32fdp.fsf@mimuw.edu.pl>
 <CABPY6Z1k5UQijcFbdiAOjJaZiLo3Vpf2yHgX0NBFsMKbB_GOvg@mail.gmail.com>
Message-ID: <CABPY6Z0fdaZzGM20wd61AGhimY6fBss+5XnLGaDU9t3aQWjmEg@mail.gmail.com>

I wrote,

> ...  I couldn't find the exact page, but it would
> be something like this page:
>
> https://www.wikidata.org/wiki/Q54180

Hmmm, maybe that is the exact page.  That page does show the ISO
3166-2 code as "PL-PM".

So, if that's the correct page and the English is given as "Pomeranian
Voivodeship", why is CLDR giving the English as "Federal Capital
Territory"?

The Wikidata page was last edited/updated on 2018-08-25.  The CLDR
page doesn't include last updated information.  Perhaps it hasn't been
updated in a while.

From unicode at unicode.org  Mon Sep  3 05:03:36 2018
From: unicode at unicode.org (Arthur Reutenauer via Unicode)
Date: Mon, 3 Sep 2018 12:03:36 +0200
Subject: CLDR
In-Reply-To: <201809030954.w839sXrQ031569@nef2.ens.fr>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
 <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>
 <201809030954.w839sXrQ031569@nef2.ens.fr>
Message-ID: <20180903100336.GA4175881@phare.normalesup.org>

On Mon, Sep 03, 2018 at 09:45:38AM +0200, Janusz S. Bie? via Unicode wrote:
> plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
> Pomorze) transliterated into the Greek alphabet (and something in
> Arabic).

  This must be a mistake (a strange copy-paste side effect?).  Federal
Capital Territory is a subdivision of Nigeria.  The Persian name seems
correct.

	Best,

		Arthur

From unicode at unicode.org  Mon Sep  3 05:07:39 2018
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Mon, 3 Sep 2018 12:07:39 +0200
Subject: UCD in XML or in CSV? (is: UCD data consumption)
In-Reply-To: <86ftyr3xq1.fsf@mimuw.edu.pl>
References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24>
 <86ftyr3xq1.fsf@mimuw.edu.pl>
Message-ID: <20180903100739.pe5w23ybcpvw5rrx@angband.pl>

On Mon, Sep 03, 2018 at 08:24:06AM +0200, Janusz S. Bie? via Unicode wrote:
> For a non-programmer like me CVS is much more convenient form than XML -
> I can use it not only with a spreadsheet, but also import directly into
> a database and analyse with various queries. XML is politically correct,
> but practically almost unusable without a specialised parser.

And for a programmer, XML is outright insane.  You need a complex library to
do so, and those fail KISS so badly that you have a CVE roughly yearly.
On the other hand, writing a parser for current headerless ;-separated data
completely from scratch is just:

cut -d';' -f 1,6 </usr/share/unicode/UnicodeData.txt
or:
(split/;/)[0,5]

JSON is somewhat better, but still needs drastically more effort.
CSV (especially with no escapes) is trivial to handle.


????!
-- 
??????? What Would Jesus Do, MUD/MMORPG edition:
??????? ? multiplay with an admin char to benefit your mortal [Mt3:16-17]
??????? ? abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
??????? ? use glitches to walk on water [Mt14:25-26]

From unicode at unicode.org  Mon Sep  3 05:28:18 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 3 Sep 2018 12:28:18 +0200 (CEST)
Subject: CLDR
Message-ID: <970715049.6410.1535970498061.JavaMail.www@wwinf1m18>

On 03/09/18 09:53 Janusz S. Bie? via Unicode wrote:
> 
> On Fri, Aug 31 2018 at 10:27 +0200, Manuel Strehl via Unicode wrote:
> > The XML files in these folders:
> >
> > https://unicode.org/repos/cldr/tags/latest/common/
> 
> Thanks for the link.
> 
> In the meantime I rediscovered Locale Explorer
> 
> http://demo.icu-project.org/icu-bin/locexp
> 
> which I used some time ago.

Nice. Actually based on CLDR v31.0.1.

> 
> On Fri, Aug 31 2018 at 12:17 +0200, Marcel Schneider via Unicode wrote:
> > On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote:
> > [?]
> >> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> >> > one couldn?t simply pop them into XML or whatever, as the result would be 
> >> > disappointing and call for completion in the aftermath. Yet another task 
> >> > competing with CLDR survey.
> >> 
> >> Please elaborate. It's not clear for me what do you mean.
> >
> > These comments are designed for the Code Charts and as such must not be
> > disproportionate in exhaustivity. Eg we have lists of related languages ending 
> > in an ellipsis.
> 
> Looks like we have different comments in mind.

Then I?m sorry to be off-topic.

[?]
> >> > and we really 
> >> > need to go through the data and correct the many many errors, please.
> 
> But who is the right person or institution to do it?

Software vendors are committed to care for the data, and may delegate survey 
to service providers specialized in localization. Then I think that public language 
offices should be among the reviewers. Beyond, and especially by lack of the
latter, anybody is welcome to contribute as a guest. (Guest votes are 1 and don?t
add one to another.) That is consistent with the fact that Unicode relies on 
volunteers, too.

I?m volunteering to personally welcome you to contribute to CLDR.

[?]
> > Further you will see that while Polish is using apostrophe
> > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
> > CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
> 
> I understand that by "the correct apostrophe" you mean U+2019 RIGHT
> SINGLE QUOTATION MARK.

Yes.

> 
> > You may wish to note that from now on, both U+0027 APOSTROPHE and 
> > U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
> > preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
> > U+201D that are already found in CLDR pl.
> 
> The situation seems more complicated because the chart
> 
> https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html
> 
> contains different list of punctuation characters than
> 
> https://www.unicode.org/cldr/charts/34/summary/pl.html.
> 
> I guess the latter is the primary one, and it contains U+2019 RIGHT
> SINGLE QUOTATION MARK (and U+0x2018 LEFT SINGLE QUOTATION MARK, too).

It?s a bit confusing because there is a column for English and a column for Polish.
The characters you retrieved are actually in the English column, while Polish has 
consistently with By-Type, these quotation marks:
' " ? ? ? ? 
Hence the set is incomplete.

> 
> >
> > Note however that according to the information provided by English Wikipedia:
> > https://en.wikipedia.org/wiki/Quotation_mark#Polish
> > Polish also uses single quotes, that by contrast are still missing in CLDR.
> 
> You are right, but who cares? Looks like this has no practical
> importance. Nobody complains about the wrong use of quotation marks in
> Polish by Word or OpenOffice, so looks like the software doesn't use
> this information. So this is rather a matter of aesthetics...

I?ve come to the position that to let a word processor ?use? quotation marks
is to miss the point. Quotation marks are definitely used by the user typing
in his or her text, and are expected to be on the keyboard layout he or she
is using. So-called smart quotes guessed algorithmically from ASCII simple 
and double quote are but a hazardous workaround when not installing the 
appropriate keyboard layout. At least that is my position :)

Best regards,

Marcel


From unicode at unicode.org  Mon Sep  3 04:26:38 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 3 Sep 2018 10:26:38 +0100 (BST)
Subject: Encoding character information for characters of a Private Use Area
 use (from Re: UCD in XML or in CSV?)
In-Reply-To: <86ftyr3xq1.fsf@mimuw.edu.pl>
References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24>
 <86ftyr3xq1.fsf@mimuw.edu.pl>
Message-ID: <1843880.12150.1535966798390.JavaMail.defaultUser@defaultHost>

Janusz S. Bien wrote:

> Last but not least, let me remind that the thread was started by a question what is the most convenient way to describe the properties of PUA characters.

>From what I have learned during the time period of the discussion it seems to me that using JSON would be a good idea.

http://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0144.html

http://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0145.html

It appears that all that is needed is to define an object named PUAINFO and then put the name PUAINFO inside quotation marks and then define the object in whatever JSON way one chooses to do it.

For example, one could have an array of values, one or more of which could be a string listing a PUA (Private Use Area) code point or a range of PUA code points. For examples, "$E001" and "$E100..$E17F", together with strings containing other information.

One such string, maybe the first after the colon, whether or not within an array, could be a description of the particular Private Use Area use that the particular file supports. 

Using JSON would mean that the format would be independent of any particular programming language and could be designed to be straightforwardly read by humans as well.

>From reading the documents I think that the structure may start as follows, though I am not congruently sure of the matter at this time.

{"PUAINFO":

There are then various ways to proceed, such as for example having everything in one array, or for example having many names each of which has data.

Having many names each of which has data may well look more elegant in a print out and be more easily read by humans, yet having everything in one array in a known order may mean that getting the format implemented in software applications might be easier and thus more likely to happen.  

Whichever way it is done, then provided it is done rigorously, a format which becomes implemented widely in applications would be a contribution of lasting value.  

William Overington

Monday 3 September 2018


From unicode at unicode.org  Mon Sep  3 14:40:01 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 3 Sep 2018 21:40:01 +0200
Subject: UCD in XML or in CSV? (is: UCD data consumption)
In-Reply-To: <20180903100739.pe5w23ybcpvw5rrx@angband.pl>
References: <1994793107.5.1535854568011.JavaMail.www@wwinf1m24>
 <86ftyr3xq1.fsf@mimuw.edu.pl> <20180903100739.pe5w23ybcpvw5rrx@angband.pl>
Message-ID: <CAGa7JC1fCOwh35mX_awiAzENh8UFoNm+jWp-Ch-QM7RZOcw7YQ@mail.gmail.com>

But CSV is only fine for pure tabular data, and the UCD or CDLR data is has
a more complex structure than a simple 2D table. In addition, the schema is
evolving, with new kind of datas added everytime; you cannot keep that
compatibility by adding more empty columns to a single table; adding new
semicolons or other separators to a CSV makes the formaty much less
readable, and in fact it will then contain lot of redundancy.

Like traditional relational databases, these project need a schema and
structure. But if we have to use a RDBMS API, we'll loose the possibility
for using various tools. So these Unicode databases are using collections
of tables and in some cases you need to split a value into multiple ones
with different scoping rules: for that job JSON or XML is fine. But nothing
prevents you to load the existing UCD/CLDR database files into a relational
database and expose the data in different views. But most applications are
in fact built by first laoding this data with a parser specific to the
application, that will convert it to its application-defined schema, and
data can be recompiled in a new form that will then be exposed by an
application API.

XML if then fine ! It has no cost for final users that just use the
generated applications. It's only up to application compiler projects to
parse the data, generate their code, and integrate the data to their API
(there are more useful tools than just "grep'ing the UCD/CLDR datafiles.
Also the UCD and CLDR files are checked by other automated tools that
already parse them, and load them to perform consistency checks and
generate multiple presentations: the important ICU project is built and
maintained for that, it has all the tools needed, plus a reduced API that
can be used directly by final applications. Even some UCD files are now
automatically generated from other source files, they contain automatically
generated reports, Only the initial main UCD file has kept its initial pure
CSV form: it was no longer possible to continue extending this single file,
but compatibility has been preserved and it's a good thing. All others
contain comment lines, and basic report lines.


Le lun. 3 sept. 2018 ? 12:16, Adam Borowski via Unicode <unicode at unicode.org>
a ?crit :

> On Mon, Sep 03, 2018 at 08:24:06AM +0200, Janusz S. Bie? via Unicode wrote:
> > For a non-programmer like me CVS is much more convenient form than XML -
> > I can use it not only with a spreadsheet, but also import directly into
> > a database and analyse with various queries. XML is politically correct,
> > but practically almost unusable without a specialised parser.
>
> And for a programmer, XML is outright insane.  You need a complex library
> to
> do so, and those fail KISS so badly that you have a CVE roughly yearly.
> On the other hand, writing a parser for current headerless ;-separated data
> completely from scratch is just:
>
> cut -d';' -f 1,6 </usr/share/unicode/UnicodeData.txt
> or:
> (split/;/)[0,5]
>
> JSON is somewhat better, but still needs drastically more effort.
> CSV (especially with no escapes) is trivial to handle.
>
>
> ????!
> --
> ??????? What Would Jesus Do, MUD/MMORPG edition:
> ??????? ? multiplay with an admin char to benefit your mortal [Mt3:16-17]
> ??????? ? abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
> ??????? ? use glitches to walk on water [Mt14:25-26]
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180903/eddde9f3/attachment.html>

From unicode at unicode.org  Tue Sep  4 04:02:50 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 4 Sep 2018 01:02:50 -0800
Subject: CLDR
In-Reply-To: <86zhwxy8iq.fsf@mimuw.edu.pl>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
 <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>
 <86wos32fdp.fsf@mimuw.edu.pl>
 <CABPY6Z1k5UQijcFbdiAOjJaZiLo3Vpf2yHgX0NBFsMKbB_GOvg@mail.gmail.com>
 <86zhwxy8iq.fsf@mimuw.edu.pl>
Message-ID: <CABPY6Z27fY3dfrj9sKCg6_uKBpVej-H_9KyhNwRMZPUYyChC6A@mail.gmail.com>

(This is the response from Janusz S. Bie? which was sent to the public list.)

On Mon, Sep 03 2018 at  1:03 -0800, James Kass wrote:

> Janusz S. Bie? wrote,
>
>> Thanks for the link. I found especially interesting the Polish section
>> in
>>
>> https://www.unicode.org/cldr/charts/34/subdivisionNames/other_indo_european.html
>>
>> Looks like a complete rubbish, e.g.
>>
>> plmp = Federal Capital Territory(???) = Pomerania (Latin/English name of
>> Pomorze) transliterated into the Greek alphabet (and something in
>> Arabic).
>
> And nothing in Armenian, Albanian, or Pashto.
>
> If you click on the link at "plpm", it takes you right back to that
> same entry on that same page, which doesn't seem very helpful.
>
>> The header of the page says "The coverage depends on the availability of
>> data in wikidata for these names" but I was unable to find this rubbish
>> in Wikidata (but I was not looking very hard).
>
> I tried both "plpm" and "?????????" in the Wikidata search box.  On
> the latter, there were some pages which looked to translate place
> names into various languages, for both Germany and Poland.  I couldn't
> find the exact page, but it would be something like this page:
>
> https://www.wikidata.org/wiki/Q54180
>
> (Clicking "All Entered Languages" on that page gives a lengthy list.)

Thanks! Most data about Poland at

https://www.wikidata.org/wiki/Q36

seem to make sense, but I don't think anybody is using abbreviation like
 "plpm" (for Pomorze/Pomerania).

>
>>>> > and we really
>>>> > need to go through the data and correct the many many errors, please.
>>
>> But who is the right person or institution to do it?
>
> If the CLDR information is driven by Wikidata as the file header
> indicates, then Wikidata.

I hope not all CLDR data are driven by Wikidata...

On Mon, Sep 03 2018 at 12:28 +0200, Marcel Schneider wrote:

> On 03/09/18 09:53 Janusz S. Bie? via Unicode wrote:
 [...]
>> > These comments are designed for the Code Charts and as such must not be
>> > disproportionate in exhaustivity. Eg we have lists of related languages ending
>> > in an ellipsis.
>>
>> Looks like we have different comments in mind.
>
> Then I?m sorry to be off-topic.

Let's say off the original topic. My primary concern is to preserve
 somehow such comments as e.g. the one on the bottom of page 14 of

https://folk.uib.no/hnooh/mufi/specs/MUFI-CodeChart-4-0.pdf

>
> [?]
>> >> > and we really
>> >> > need to go through the data and correct the many many errors, please.
>>
>> But who is the right person or institution to do it?
>
> Software vendors are committed to care for the data, and may delegate survey
> to service providers specialized in localization. Then I think that public language
> offices should be among the reviewers. Beyond, and especially by lack of the
> latter, anybody is welcome to contribute as a guest. (Guest votes are 1 and don?t
> add one to another.) That is consistent with the fact that Unicode relies on
> volunteers, too.
>
> I?m volunteering to personally welcome you to contribute to CLDR.

Thanks. The interesting question is who is/was already contributing from
 Poland or about Polish language. I vaguely remember a post with this
 information, but at that time I was not interested enough to take a
 note.

>
> [?]
>> > Further you will see that while Polish is using apostrophe
>> > https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
>> > CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
>>
>> I understand that by "the correct apostrophe" you mean U+2019 RIGHT
>> SINGLE QUOTATION MARK.
>
> Yes.
>
>>
>> > You may wish to note that from now on, both U+0027 APOSTROPHE and
>> > U+0022 QUOTATION MARK are ruled out in almost all locales, given the
>> > preferred characters in publishing are U+2019 and, for Polish, the U+201E and
>> > U+201D that are already found in CLDR pl.
[...]
> It?s a bit confusing because there is a column for English and a column for Polish.
> The characters you retrieved are actually in the English column, while Polish has
> consistently with By-Type, these quotation marks:
> ' " ? ? ? ?
> Hence the set is incomplete.

You are right, thanks. But was is the practical importance of it?
 I noticed that sometimes in Emacs 'forward-word" behaves strangely on a
text with unusual characters, but had no motivation to investigate how
 this is related to the current locale.

>>
>> >
>> > Note however that according to the information provided by English Wikipedia:
>> > https://en.wikipedia.org/wiki/Quotation_mark#Polish
>> > Polish also uses single quotes, that by contrast are still missing in CLDR.
>>
>> You are right, but who cares? Looks like this has no practical
>> importance. Nobody complains about the wrong use of quotation marks in
>> Polish by Word or OpenOffice, so looks like the software doesn't use
>> this information. So this is rather a matter of aesthetics...
>
> I?ve come to the position that to let a word processor ?use? quotation marks
> is to miss the point. Quotation marks are definitely used by the user typing
> in his or her text, and are expected to be on the keyboard layout he or she
> is using. So-called smart quotes guessed algorithmically from ASCII simple
> and double quote are but a hazardous workaround when not installing the
> appropriate keyboard layout. At least that is my position :)

The standard keyboard has a limiting number of keys, so you have to make
 compromises. It is generally accepted that Polish keyboard layouts
 (there are primarily two of them) does not contain apostrophe or single
 quotations marks. There is a proposal by Marcin Woli?ski

http://marcinwolinski.pl/keyboard/

 which is available in most Linux distributions but it does not seem
 popular.


From unicode at unicode.org  Tue Sep  4 21:08:40 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Wed, 5 Sep 2018 04:08:40 +0200 (CEST)
Subject: CLDR [terminating]
Message-ID: <1219205358.11203.1536113320424.JavaMail.www@wwinf1h11>

Sorry for not noticing that this thread belongs to CLDR-users, not to Unicode Public.
Hence I?m taking it off this list, welcoming participants to follow up there:

https://unicode.org/pipermail/cldr-users/2018-September/000833.html


From unicode at unicode.org  Thu Sep  6 11:58:22 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 06 Sep 2018 09:58:22 -0700
Subject: UCD in XML or in
 =?UTF-8?Q?CSV=3F=20=28is=3A=20UCD=20in=20YAML=29?=
Message-ID: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>

Marcel Schneider wrote:
 
> BTW what I conjectured about the role of line breaks is true for CSV
> too, and any file downloaded from UCD on a semicolon separator basis
> becomes unusable when displayed straight in the built-in text editor
> of Windows, given Unicode uses Unix EOL.
 
It's been well known for decades that Windows Notepad doesn't display
LF-terminated text files correctly. The solution is to use almost any
other editor. Notepad++ is free and a great alternative, but there are
plenty of others (no editor wars, please).
 
The RFC Editor site explains why it provides PDF versions of every RFC,
nearly all of which are plain text:
 
"The primary version of every RFC is encoded as an ASCII text file,
which was once the lingua franca of the computer world. However, users
of Microsoft Windows often have difficulty displaying vanilla ASCII text
files with the correct pagination."
 
which similarly assumes that "users of Microsoft Windows" have only
Notepad at their disposal.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Thu Sep  6 19:22:46 2018
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Fri, 7 Sep 2018 05:52:46 +0530
Subject: Shortcuts question
Message-ID: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>

Hello. This may be slightly OT for this list but I'm asking it here as it
concerns computer usage with multiple scripts and i18n:

1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
"tout" io Ctrl+A for "all"?

2) How about when the shortcuts are the Alt+ combinations referring to
underlined letters in actual user visible strings?

3) In a QWERTZ layout for Undo should one still press the (dislocated wrt
the other XCV shortcuts) Z key or the Y key which is in the physical
position of the QWERTY Z key (and close to the other XCV shortcuts)?

4) How are shortcuts handled in the case of non Latin keyboards like
Cyrillic or Japanese?

4a) I mean how are they displayed on screen?

4b) Like #1 above, are they changed per language?

4c) Like #2 above, how about for user visible shortcuts?

(In India since English is an associate official language, most computer
users are at least conversant with basic English so we use the
English/QWERTY shortcuts even if the keyboard physically shows an Indic
script.)

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/05938aae/attachment.html>

From unicode at unicode.org  Thu Sep  6 22:27:08 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 7 Sep 2018 05:27:08 +0200 (CEST)
Subject: Shortcuts question
In-Reply-To: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
Message-ID: <407936671.61.1536290828336.JavaMail.www@wwinf1m09>

On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> 
> Hello. This may be slightly OT for this list but I'm asking it here as it concerns computer usage with multiple scripts and i18n:

It actually belongs on CLDR-users list. But coming from you, it shall remain here while I?m posting a quick answer below.

> 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" io Ctrl+A for "all"?

No, Ctrl+A remains Ctrl+A on a French keyboard.

> 2) How about when the shortcuts are the Alt+ combinations referring to underlined letters in actual user visible strings?

I don?t know, but the accelerator shortcuts usually process text input, so it would be up to the vendor to keep them in sync.

> 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the other XCV shortcuts) Z key or the Y key
> which is in the physical position of the QWERTY Z key (and close to the other XCV shortcuts)?

On Windows, that this question refers to, virtual keys move around with graphics on Latin keyboards. While Ctrl+Z on QWERTZ is 
not handy, I can tell that it is Ctrl+Z on AZERTY with the key having the Z on it and typing "z". The latter is most relevant on Linux
where graphics are used even to process the Ctrl+ shortcuts.

> 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic or Japanese?

On Windows as they depend on Virtual Keys, they may be laid out on an underlying QWERTY basis. The same may apply on macOS, 
where distinct levels are present in the XML keylayout (and likewise in system-shipped layouts) to map the letters associated with
shortcuts, regardless of the script. On Linux, shortcuts are reported not to work on some non-Latin keyboard layouts (because key names
are based on ISO key positions, and XKB doesn?t appear to use a "Group0" level to map the shortcut letters; needs to be investigated).

> 4a) I mean how are they displayed on screen??

My short answer is: I?ve got no experience; maybe using Latin letters and locale labels.

> 4b) Like #1 above, are they changed per language?

Non-Latin scripts typically use QWERTY for ASCII input, so shortcuts may not be changed per language.

> 4c) Like #2 above, how about for user visible shortcuts?

Again I?m leaving this over to non-Latin script experts.

> (In India since English is an associate official language, most computer users are at least conversant with basic English
> so we use the English/QWERTY shortcuts even if the keyboard physically shows an Indic script.)

The same applies to virtually any non-Latin locale. Michael Kaplan reported that only on Latin keyboards VKs move around.

> Thanks!

You are welcome.

Marcel


From unicode at unicode.org  Thu Sep  6 22:50:56 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 7 Sep 2018 05:50:56 +0200 (CEST)
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
Message-ID: <884912406.176.1536292256610.JavaMail.www@wwinf1m09>

On 06/09/18 19:09 Doug Ewell via Unicode wrote:
> 
> Marcel Schneider wrote:
> 
> > BTW what I conjectured about the role of line breaks is true for CSV
> > too, and any file downloaded from UCD on a semicolon separator basis
> > becomes unusable when displayed straight in the built-in text editor
> > of Windows, given Unicode uses Unix EOL.
> 
> It's been well known for decades that Windows Notepad doesn't display
> LF-terminated text files correctly. The solution is to use almost any
> other editor. Notepad++ is free and a great alternative, but there are
> plenty of others (no editor wars, please).
> 
> The RFC Editor site explains why it provides PDF versions of every RFC,
> nearly all of which are plain text:
> 
> "The primary version of every RFC is encoded as an ASCII text file,
> which was once the lingua franca of the computer world. However, users
> of Microsoft Windows often have difficulty displaying vanilla ASCII text
> files with the correct pagination."
> 
> which similarly assumes that "users of Microsoft Windows" have only
> Notepad at their disposal.

Thank you, I?ve got the point.

I?m taking this opportunity to apologize and disclaim for this post of mine:

https://www.unicode.org/mail-arch/unicode-ml/y2018-m08/0134.html

where I was not joking, but completely out of matter, unable to make sense 
of the "Unicode Digest" subject line, that refers to a mail engine feature and 
remained unchanged due to limited editing capabilities in a cellphone mailer.
Likewise "unicode-request at unicode.org" is used by the engine for that purpose.

My apologies to Doug Ewell, and thanks for your kind reply taking the pain 
while having limited access to e-mail.

Best regards,

Marcel


From unicode at unicode.org  Fri Sep  7 08:03:46 2018
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Fri, 7 Sep 2018 15:03:46 +0200 (CEST)
Subject: Shortcuts question
In-Reply-To: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
Message-ID: <534252510.112927.1536325426517@ox.hosteurope.de>

Shriramana Sharma:
> 
> 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
> "tout" io Ctrl+A for "all"?

Some are, many are not. For instance, some text editors use a modifier key with F and K instead of B and I for bold ("fett") and italic ("kursiv").

> 2) How about when the shortcuts are the Alt+ combinations referring to
> underlined letters in actual user visible strings?

Those depend much more language dependent than Ctrl/Cmd shortcuts.

> 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt
> the other XCV shortcuts) Z key or the Y key which is in the physical
> position of the QWERTY Z key (and close to the other XCV shortcuts)?

For some shortcuts the key position is more important (e.g. the one left from the 1 key), for others it's the initial / conventional letter of the command. Most QWERTZ users are not used to expect the undo shortcut (Z) next to the keys for cut (X), copy (C) and paste (V). By the way, accompanying redo is notoriously inconsistent, sometimes Y, sometimes Shift+Z.

More serious problems arise with non-letter keys. For instance, square brackets [ and ] are readily available on the US / English keyboard layout, but require modifier keys like Shift or Alt on many other keyboard layouts, which may be the same ones as for the curly braces { and }. This means, some seemingly simple and intuitive shortcuts on an English keyboard become cumbersome on international ones.

From unicode at unicode.org  Fri Sep  7 12:55:43 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 7 Sep 2018 19:55:43 +0200
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>

Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode <unicode at unicode.org>
a ?crit :

> Marcel Schneider wrote:
>
> > BTW what I conjectured about the role of line breaks is true for CSV
> > too, and any file downloaded from UCD on a semicolon separator basis
> > becomes unusable when displayed straight in the built-in text editor
> > of Windows, given Unicode uses Unix EOL.
>
> It's been well known for decades that Windows Notepad doesn't display
> LF-terminated text files correctly. The solution is to use almost any
> other editor. Notepad++ is free and a great alternative, but there are
> plenty of others (no editor wars, please).
>

This has changed recently in Windows 10, where the builtin Notepad app now
parses text files using LF only correctly (you can edit and save using the
same convention for newlines, which is now autodetected; Notepad still
creates new files using CRLF and saves them after edit using CRLF).

Notepad now displays the newline convention in the status bar as "Windows
(CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column
counters. There's still no preference interface to specify the default
convention: CRLF is still the the default for new files.

And no way to switch the convention before saving. In Notepad++ you do that
with menu "Edit" > "Convert newlines" and select one of "Convert to Windows
(CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/4976d6cf/attachment.html>

From unicode at unicode.org  Fri Sep  7 13:04:05 2018
From: unicode at unicode.org (J Decker via Unicode)
Date: Fri, 7 Sep 2018 11:04:05 -0700
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
Message-ID: <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>

On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode <
unicode at unicode.org> wrote:

>
>
> Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode <unicode at unicode.org>
> a ?crit :
>
>> Marcel Schneider wrote:
>>
>> > BTW what I conjectured about the role of line breaks is true for CSV
>> > too, and any file downloaded from UCD on a semicolon separator basis
>> > becomes unusable when displayed straight in the built-in text editor
>> > of Windows, given Unicode uses Unix EOL.
>>
>> It's been well known for decades that Windows Notepad doesn't display
>> LF-terminated text files correctly. The solution is to use almost any
>> other editor. Notepad++ is free and a great alternative, but there are
>> plenty of others (no editor wars, please).
>>
>
> This has changed recently in Windows 10, where the builtin Notepad app now
> parses text files using LF only correctly (you can edit and save using the
> same convention for newlines, which is now autodetected; Notepad still
> creates new files using CRLF and saves them after edit using CRLF).
>
> I would love to have a notepad that handled \n.
My system is up to date.
What update must I get to have notepad handle newline only files?
(and I dare say notepad is the ONLY program that doesn't handle either
convention, command line `edit` and `wordpad`(write) even handled them)
 I'm sure there exists other programs that do it wrong; but none I've ever
used or found, or written.

Notepad now displays the newline convention in the status bar as "Windows
> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column
> counters. There's still no preference interface to specify the default
> convention: CRLF is still the the default for new files.
>
> And no way to switch the convention before saving. In Notepad++ you do
> that with menu "Edit" > "Convert newlines" and select one of "Convert to
> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)"
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/b593e672/attachment.html>

From unicode at unicode.org  Fri Sep  7 13:18:09 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 7 Sep 2018 20:18:09 +0200
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
 <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
Message-ID: <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>

That version has been announced in the Windows 10 Hub several weeks ago. I
think it is part of the 1809 version (for now RS5 prerelease for Insiders)
that may be deployed in the final release coming soon.
I hope you'll have also the option to switch the newline convention after
loading and before saving to convert these newlines. and may be define the
new default preference, so we will finally forget the CRLF convention.

I have it working quite well inthe Insider fast ring.

In all IDE editors however (including Developer Studio), the 2 or 3
conventions were still available since long.

Le ven. 7 sept. 2018 ? 20:04, J Decker <d3ck0r at gmail.com> a ?crit :

>
>
> On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>>
>>
>> Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode <unicode at unicode.org>
>> a ?crit :
>>
>>> Marcel Schneider wrote:
>>>
>>> > BTW what I conjectured about the role of line breaks is true for CSV
>>> > too, and any file downloaded from UCD on a semicolon separator basis
>>> > becomes unusable when displayed straight in the built-in text editor
>>> > of Windows, given Unicode uses Unix EOL.
>>>
>>> It's been well known for decades that Windows Notepad doesn't display
>>> LF-terminated text files correctly. The solution is to use almost any
>>> other editor. Notepad++ is free and a great alternative, but there are
>>> plenty of others (no editor wars, please).
>>>
>>
>> This has changed recently in Windows 10, where the builtin Notepad app
>> now parses text files using LF only correctly (you can edit and save using
>> the same convention for newlines, which is now autodetected; Notepad still
>> creates new files using CRLF and saves them after edit using CRLF).
>>
>> I would love to have a notepad that handled \n.
> My system is up to date.
> What update must I get to have notepad handle newline only files?
> (and I dare say notepad is the ONLY program that doesn't handle either
> convention, command line `edit` and `wordpad`(write) even handled them)
>  I'm sure there exists other programs that do it wrong; but none I've ever
> used or found, or written.
>
> Notepad now displays the newline convention in the status bar as "Windows
>> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column
>> counters. There's still no preference interface to specify the default
>> convention: CRLF is still the the default for new files.
>>
>> And no way to switch the convention before saving. In Notepad++ you do
>> that with menu "Edit" > "Convert newlines" and select one of "Convert to
>> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)"
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/d2a909df/attachment.html>

From unicode at unicode.org  Fri Sep  7 13:19:58 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 7 Sep 2018 20:19:58 +0200
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
 <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
 <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
Message-ID: <CAGa7JC0rJ9byryrJS2afZd99CENRsX_MOsycxKohJ4Vr_2sOmw@mail.gmail.com>

See also this page:

https://blogs.windows.com/windowsexperience/2018/05/09/announcing-windows-10-insider-preview-build-17666/

Le ven. 7 sept. 2018 ? 20:18, Philippe Verdy <verdy_p at wanadoo.fr> a ?crit :

> That version has been announced in the Windows 10 Hub several weeks ago. I
> think it is part of the 1809 version (for now RS5 prerelease for Insiders)
> that may be deployed in the final release coming soon.
> I hope you'll have also the option to switch the newline convention after
> loading and before saving to convert these newlines. and may be define the
> new default preference, so we will finally forget the CRLF convention.
>
> I have it working quite well inthe Insider fast ring.
>
> In all IDE editors however (including Developer Studio), the 2 or 3
> conventions were still available since long.
>
> Le ven. 7 sept. 2018 ? 20:04, J Decker <d3ck0r at gmail.com> a ?crit :
>
>>
>>
>> On Fri, Sep 7, 2018 at 10:58 AM Philippe Verdy via Unicode <
>> unicode at unicode.org> wrote:
>>
>>>
>>>
>>> Le jeu. 6 sept. 2018 ? 19:11, Doug Ewell via Unicode <
>>> unicode at unicode.org> a ?crit :
>>>
>>>> Marcel Schneider wrote:
>>>>
>>>> > BTW what I conjectured about the role of line breaks is true for CSV
>>>> > too, and any file downloaded from UCD on a semicolon separator basis
>>>> > becomes unusable when displayed straight in the built-in text editor
>>>> > of Windows, given Unicode uses Unix EOL.
>>>>
>>>> It's been well known for decades that Windows Notepad doesn't display
>>>> LF-terminated text files correctly. The solution is to use almost any
>>>> other editor. Notepad++ is free and a great alternative, but there are
>>>> plenty of others (no editor wars, please).
>>>>
>>>
>>> This has changed recently in Windows 10, where the builtin Notepad app
>>> now parses text files using LF only correctly (you can edit and save using
>>> the same convention for newlines, which is now autodetected; Notepad still
>>> creates new files using CRLF and saves them after edit using CRLF).
>>>
>>> I would love to have a notepad that handled \n.
>> My system is up to date.
>> What update must I get to have notepad handle newline only files?
>> (and I dare say notepad is the ONLY program that doesn't handle either
>> convention, command line `edit` and `wordpad`(write) even handled them)
>>  I'm sure there exists other programs that do it wrong; but none I've
>> ever used or found, or written.
>>
>> Notepad now displays the newline convention in the status bar as "Windows
>>> (CRLF)" or "Unix (LF)" (like Notepad++), just before the line/column
>>> counters. There's still no preference interface to specify the default
>>> convention: CRLF is still the the default for new files.
>>>
>>> And no way to switch the convention before saving. In Notepad++ you do
>>> that with menu "Edit" > "Convert newlines" and select one of "Convert to
>>> Windows (CR+LF)", "Convert to Unix (LF)" or "Convert to Mac (CR)"
>>>
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/7fa8d6f1/attachment.html>

From unicode at unicode.org  Fri Sep  7 14:47:44 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Fri, 7 Sep 2018 12:47:44 -0700
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
 <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
 <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
Message-ID: <CAH=y87Y=ed7MWjpJ=WZ4P0f309kYcH+XpwoX7asb=4gHN5QK_Q@mail.gmail.com>

On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode <
unicode at unicode.org> wrote:

> That version has been announced in the Windows 10 Hub several weeks ago.
>

And it only took them 33 years. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180907/a76dd9dc/attachment.html>

From unicode at unicode.org  Fri Sep  7 15:00:40 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 07 Sep 2018 23:00:40 +0300
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAH=y87Y=ed7MWjpJ=WZ4P0f309kYcH+XpwoX7asb=4gHN5QK_Q@mail.gmail.com>
 (message from Rebecca Bettencourt via Unicode on Fri, 7 Sep 2018
 12:47:44 -0700)
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
 <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
 <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
 <CAH=y87Y=ed7MWjpJ=WZ4P0f309kYcH+XpwoX7asb=4gHN5QK_Q@mail.gmail.com>
Message-ID: <83k1nxt6vr.fsf@gnu.org>

> Date: Fri, 7 Sep 2018 12:47:44 -0700
> Cc: d3ck0r at gmail.com, Doug Ewell <doug at ewellic.org>,
>         unicode <unicode at unicode.org>
> From: Rebecca Bettencourt via Unicode <unicode at unicode.org>
> 
> On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> 
>  That version has been announced in the Windows 10 Hub several weeks ago.
> 
> And it only took them 33 years. :) 

That's OK, because Unix tools cannot handle Windows end-of-line format
to this very day.  About the only one I know of is Emacs (which
handles all 3 known EOL formats independently of the platform on which
it runs, since 20 years ago).

From unicode at unicode.org  Fri Sep  7 19:29:12 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 8 Sep 2018 02:29:12 +0200 (CEST)
Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD in YAML))
Message-ID: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21>

On 07/09/18 22:07 Eli Zaretskii via Unicode wrote:
> 
> > Date: Fri, 7 Sep 2018 12:47:44 -0700
> > Cc: d3ck0r at gmail.com, Doug Ewell ,
> > unicode 
> > From: Rebecca Bettencourt via Unicode 
> > 
> > On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode  wrote:
> > 
> > That version has been announced in the Windows 10 Hub several weeks ago.
> > 
> > And it only took them 33 years. :) 
> 
> That's OK, because Unix tools cannot handle Windows end-of-line format
> to this very day. About the only one I know of is Emacs (which
> handles all 3 known EOL formats independently of the platform on which
> it runs, since 20 years ago).

What are you referring to when you say ?Unix tools??
Another text editor?the built-in one of many Linux distributions?Gedit allows 
to choose from ?Unix/Linux?, ?Mac OS Classic?, and ?Windows?, in the Save dialog.
But in the preferences I cannot retrieve how to default it to any of the latter two.
I?m referring to Ubuntu 16.04.

When on Windows in Notepad++ I prefer LF over CRLF because it makes for simpler 
regexes, and the middle thing between these and plain search is more handy too.
(I use \n in regexes rather than the $ convention.)

Thanks to Philippe for the Windows 10 news!

Best regards,

Marcel


From unicode at unicode.org  Fri Sep  7 20:03:38 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 8 Sep 2018 03:03:38 +0200 (CEST)
Subject: Shortcuts question (is: Thread transfer info)
In-Reply-To: <534252510.112927.1536325426517@ox.hosteurope.de>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <534252510.112927.1536325426517@ox.hosteurope.de>
Message-ID: <1585982937.16442.1536368619015.JavaMail.www@wwinf1m21>

Hello,

I?ve followed up on CLDR-users:

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

As a sidenote ? It might be hard to get a selection of discussions 
actually happen on CLDR-users instead of Unicode Public mail list, 
as long as subscribers of this list don?t necessarily subscribe to 
the other list, too, that still has way less subscribers than Unicode Public.

Regards,

Marcel


From unicode at unicode.org  Fri Sep  7 20:50:38 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Sat, 8 Sep 2018 10:50:38 +0900
Subject: UCD in XML or in CSV? (is: UCD in YAML)
In-Reply-To: <CAH=y87Y=ed7MWjpJ=WZ4P0f309kYcH+XpwoX7asb=4gHN5QK_Q@mail.gmail.com>
References: <20180906095822.665a7a7059d7ee80bb4d670165c8327d.e4eead3728.wbe@email03.godaddy.com>
 <CAGa7JC3USogB5HGow2+y6NEEn=-APGEvc5Pp_Rxk-AJJEFcQhQ@mail.gmail.com>
 <CAA2GJqXCgHX5+wjJEhdSiNNngh4AoMa-106Sb9MwRfU7hXk4jg@mail.gmail.com>
 <CAGa7JC32jS366yLvesSTnPioGEPk=ZVSt_8x4KHO3PF28gOZwg@mail.gmail.com>
 <CAH=y87Y=ed7MWjpJ=WZ4P0f309kYcH+XpwoX7asb=4gHN5QK_Q@mail.gmail.com>
Message-ID: <67b2d03c-d565-8cae-908d-a3519eceb8eb@it.aoyama.ac.jp>

On 2018/09/08 04:47, Rebecca Bettencourt via Unicode wrote:
> On Fri, Sep 7, 2018 at 11:20 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
> 
>> That version has been announced in the Windows 10 Hub several weeks ago.
>>
> 
> And it only took them 33 years. :)

I used to joke that Notepad would add one single feature for each new 
version of Windows. I think that was when the Save-As feature was added.

For a long time, I have set up Notepad++ to come up when Notepad is invoked.

Regards,    Martin.

From unicode at unicode.org  Sat Sep  8 01:47:23 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sat, 08 Sep 2018 09:47:23 +0300
Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD in YAML))
In-Reply-To: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21> (message
 from Marcel Schneider on Sat, 8 Sep 2018 02:29:12 +0200 (CEST))
References: <949353676.16405.1536366552164.JavaMail.www@wwinf1m21>
Message-ID: <83d0totric.fsf@gnu.org>

> Date: Sat, 8 Sep 2018 02:29:12 +0200 (CEST)
> From: Marcel Schneider <charupdate at orange.fr>
> Cc: RebeccaBettencourt <beckiergb at gmail.com>, verdy_p at wanadoo.fr, 
> 	d3ck0r at gmail.com, doug at ewellic.org, unicode at unicode.org
> 
> > > And it only took them 33 years. :) 
> > 
> > That's OK, because Unix tools cannot handle Windows end-of-line format
> > to this very day. About the only one I know of is Emacs (which
> > handles all 3 known EOL formats independently of the platform on which
> > it runs, since 20 years ago).
> 
> What are you referring to when you say ?Unix tools??

Sed and Grep don't consider CRLF as end of line, so regexps with $
fail to work as intended; the shell and/or the kernel don't recognize
the shebang sequence if it ends in CRLF, system editors display those
pesky "^M" at the end of each line, etc.  And if you have bad luck of
using a Mac-style file, where a single CR ends a line, all bets are
off.

> Another text editor?the built-in one of many Linux distributions?Gedit allows 
> to choose from ?Unix/Linux?, ?Mac OS Classic?, and ?Windows?, in the Save dialog.

Gedit is not a valid example when you compare it with Notepad.  Please
compare with editors which come with the OS out of the box: ed, ex,
vi, etc.  Because Gedit and Emacs are also available on Windows, so
they make the point moot.

From unicode at unicode.org  Sat Sep  8 11:36:00 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sat, 8 Sep 2018 18:36:00 +0200
Subject: Unicode String Models
Message-ID: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>

I recently did some extensive revisions of a paper on Unicode string models
(APIs). Comments are welcome.

https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180908/d824614b/attachment.html>

From unicode at unicode.org  Sat Sep  8 16:01:32 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Sat, 8 Sep 2018 15:01:32 -0600
Subject: EOL conventions (was: Re: UCD in XML or in CSV? (is: UCD
In-Reply-To: <mailman.5.1536426001.30387.unicode@unicode.org>
References: <mailman.5.1536426001.30387.unicode@unicode.org>
Message-ID: <1F9AC3158DC04AF2A626C60FAD00B877@DougEwell>

To finish (I hope) this thread:

1. Glad to know that Notepad is getting some modern updates, even if 
belatedly.

2. Sorry that there are still tools out there, on different platforms, 
that can't handle each other's EOL conventions. (Of course, this is the 
problem Unicode was trying to solve by introducing LS and PS, but we 
know how that went.)

3. Unicode data files can be read and processed on any platform, but 
some careful choice of reading and processing tools might be advisable.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Sun Sep  9 02:59:29 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 9 Sep 2018 08:59:29 +0100
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
Message-ID: <20180909085929.2d4ff0d2@JRWUBU2>

On Sat, 8 Sep 2018 18:36:00 +0200
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> 
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#


Theoretically at least, the cost of indexing a big string by codepoint
is negligible.  For example, cost of accessing the middle character is
O(1)*, not O(n), where n is the length of the string.  The trick is to
use a proportionately small amount of memory to store and maintain a
partial conversion table from character index to byte index.  For
example, Emacs claims to offer O(1) access to a UTF-8 buffer by
character number, and I can't significantly fault the claim.

*There may be some creep, but it doesn't matter for strings that can be
stored within a galaxy.

Of course, the coefficients implied by big-oh notation also matter.
For example, it can be very easy to forget that a bubble sort is often
the quickest sorting algorithm.

You keep muttering that a a sequence of 8-bit code units can contain
invalid sequences, but often forget that that is also true of sequences
of 16-bit code units.  Do emoji now ensure that confusion between
codepoints and code units rapidly comes to light?

You seem to keep forgetting that grapheme clusters are not how some
people people work.  Does the English word 'caf?' contain the letter
'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
one might want to look for the letter '?' in Vietnamese and find it
whatever the associated tone mark is.

You didn't discuss substrings.  I'm interested in how subsequences of
strings are defined, as the concept of 'substring' isn't really Unicode
compliant.  Again, expressing '?' as a subsequence of the Vietnamese
word 'n?ng' ought to be possible, whether one is using NFD (easier) or
NFC.  (And there are alternative normalisations that are compatible
with canonical equivalence.)  I'm most interested in subsequences X of a
word W where W is the same as AXB for some strings A and B.

Richard.


From unicode at unicode.org  Sun Sep  9 03:00:27 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 9 Sep 2018 10:00:27 +0200
Subject: Unicode String Models
In-Reply-To: <CAD2gp_T-f4tO_xHRCTQFR0BSnj6ZwhQuG5jLOPFte_iHtwnVSA@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAD2gp_T-f4tO_xHRCTQFR0BSnj6ZwhQuG5jLOPFte_iHtwnVSA@mail.gmail.com>
Message-ID: <CAJ2xs_GujYkQW=AkkBZF0=D_KxPbvRWn7i167e4qqziO2_6E7Q@mail.gmail.com>

Thanks, excellent comments. While it is clear that some string models have
more complicated structures (with their own pros and cons), my focus was on
simple internal structures. The focus was also on immutable strings ? and
the tradeoffs for mutable ones can be quite different ? and that needs to
be clearer. I'll add some material about those two areas (with pointers to
sources where possible).

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan <cowan at ccil.org> wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ?? via Unicore <
> unicore at unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180909/3cb75f48/attachment.html>

From unicode at unicode.org  Sun Sep  9 03:56:15 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Sun, 09 Sep 2018 10:56:15 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 (Mark Davis's message of "Sat, 8 Sep 2018 18:36:00 +0200")
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
Message-ID: <868t4b3v80.fsf@mimuw.edu.pl>

On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ?? via Unicode wrote:
> I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

It's a good opportunity to propose a better term for "extended grapheme
cluster", which usually are neither extended nor clusters, it's also not
obvious that they are always graphemes.

Cf.the earlier threads

https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Sun Sep  9 08:42:19 2018
From: unicode at unicode.org (=?utf-8?Q?Daniel_B=C3=BCnzli?= via Unicode)
Date: Sun, 9 Sep 2018 15:42:19 +0200
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
Message-ID: <etPan.5b95233b.6c02cdbb.efff@erratique.ch>

Hello,?

I find your notion of "model" and presentation a bit confusing since it conflates what I would call the internal representation and the API.?

The internal representation defines how the Unicode text is stored and should not really matter to the end user of the string data structure. The API defines how the Unicode text is accessed, expressed by what is the result of an indexing operation on the string. The latter is really what matters for the end-user and what I would call the "model".

I think the presentation would benefit from making a clear distinction between the internal representation and the API; you could then easily summarize them in a table which would make a nice summary of the design space.

I also think you are missing one API which is the one with ECG I would favour: indexing returns Unicode scalar values,?internally be it whatever you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended by the "Code Point Model: Internal 8/16/32" but that's not what it says, the distinction between code point and scalar value is an important one and I think it would be good to insist on it to clarify the minds in such documents.

Best,?

Daniel


From unicode at unicode.org  Sun Sep  9 09:10:26 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 9 Sep 2018 16:10:26 +0200
Subject: Unicode String Models
In-Reply-To: <20180909085929.2d4ff0d2@JRWUBU2>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
Message-ID: <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>

Le dim. 9 sept. 2018 ? 10:10, Richard Wordingham via Unicode <
unicode at unicode.org> a ?crit :

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>

I fully agree, as long as the "middle" character is **approximated** by the
middle of the **encoded** length.

But if it has to be the exact middle (by code point number), you have to
count the codepoints exactly by parsing the whole string as O(n), then
compute the middle from it and parse again from the begining to locate the
encoded position of that code point index as O(n/2) so the final cost is
O(n*3/2).

The trick using a "small amount" of memory only is only to avoid the second
parsing to get a O(n) result. You get O(1)* only if you keep that "small
memory" to locate ofthe indexes. But the claim that it is "small" is wrong
if the string is large (big value n). and has no interest if the string is
indexed only once.

In practive, we use a memory by preparing the "small memory" while
instantiating a new iterator that will process the whole string (which may
not be fully loaded in memory, in which case that "small memory" will need
reallocation as we also read the whole string (but not necessarily keep it
in memory if it's a very long text file: the index buffer will still remain
in memory even if we no longer need to come back to the start of the
string). That "small memory" is just a local helper, its cost must be
evaluated. In practice however, long texts come from I/O: the text will
have its interface from files, in which case you'll benefit from the
filesystem cache of the OS to save I/O, or from network (in which case
you'll need to store the network data in a local temporary file if you
don't want to keep it fully in memory and allow some parts to be paged out
of memory by the OS. But in Emacs, it only works with files: network texts
are necessarily backed at least by a local temporary file.

So that "small memory" for the index is not even needed (but Emacs
maintains an index in memory only to locate line numbers. It has no need to
do that for column numbers, as it is just faster to rescan the line (and
extremely long lines of text are exceptional, these files are rarely edited
with Emacs, unless you use it to load a binary file, whose representation
on screen will be very different, notably for controls, which are expanded
into another cached form: the column index for display, which is different
from the code point index and specific to the Emacs representation for
display/editing, is built only line by line, separately from the line index
kept for the whole edited file; it is also independant of the effective
encoding: it would still be needed even if the encoding of the backing
buffer was UTF-32 with only 1 codepoint per code unit, becase the actual
display will still expand the code points to other forms using visible
escaping mechanisms, and it is even needed when the file is pure 7-bit
ASCII, and kept with one byte per code point: choosing the Unicode encoding
forms has no impact at all to what is really needed for display in text
editors).

Text editors use various indexing caches always, to manage memory, I/O, and
allow working on large texts even on systems with low memory available. As
much as possible they attempt to use the OS-level caches of the filesystem.
And in all cases, they don't work directly on their text buffer (whose
internal represenation in their backing store is not just a single string,
but a structured collection of buffers, built on top of an interface
masking the details: the effective text will then be reencoded and saved
from that object, using complex serialization schemes; the text buffer is
"virtualized").

Only very basic text editors (such as Notepad) use a native single text
buffer, but they are very slow when editing very large files as they
constantly need to copy/move large blocks of memory to perform
inserts/deletions, and they also use too much the memory reallocator. Even
vi(m) or (s)ed in Unix/Linux now use another internal encoded form with a
temporary backing store in temporary files, created automatically when
needed as you start modifying the content. The final consolidation and
serialization will occur only when saving the result.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180909/0d8140e4/attachment.html>

From unicode at unicode.org  Sun Sep  9 10:53:12 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sun, 09 Sep 2018 18:53:12 +0300
Subject: Unicode String Models
In-Reply-To: <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>
 (message from Philippe Verdy via Unicode on Sun, 9 Sep 2018 16:10:26
 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
 <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>
Message-ID: <838t4ar7kn.fsf@gnu.org>

> Date: Sun, 9 Sep 2018 16:10:26 +0200
> Cc: unicode Unicode Discussion <unicode at unicode.org>
> From: Philippe Verdy via Unicode <unicode at unicode.org>
> 
> In practive, we use a memory by preparing the "small memory" while instantiating a new iterator that will
> process the whole string (which may not be fully loaded in memory, in which case that "small memory" will
> need reallocation as we also read the whole string (but not necessarily keep it in memory if it's a very long
> text file: the index buffer will still remain in memory even if we no longer need to come back to the start of the
> string). That "small memory" is just a local helper, its cost must be evaluated. In practice however, long texts
> come from I/O: the text will have its interface from files, in which case you'll benefit from the filesystem cache
> of the OS to save I/O, or from network (in which case you'll need to store the network data in a local
> temporary file if you don't want to keep it fully in memory and allow some parts to be paged out of memory by
> the OS. But in Emacs, it only works with files: network texts are necessarily backed at least by a local
> temporary file.

Emacs maintains caches for byte to character conversions for both
strings and buffers.  The cache holds data only for the last string
and separately the last buffer where Emacs needed to convert character
counts to byte counts or vice versa.  For buffers, there are 4 places
that are maintained for every buffer at all times, for which both the
character and byte positions are known, and Emacs uses those whenever
it needs to do conversions for a buffer that is not the cached one.

> So that "small memory" for the index is not even needed (but Emacs maintains an index in memory only to
> locate line numbers.

That's a different cache, unrelated to what Richard was alluding to
(and I think unrelated to the current discussion).

> Text editors use various indexing caches always, to manage memory, I/O, and allow working on large texts
> even on systems with low memory available. As much as possible they attempt to use the OS-level caches
> of the filesystem. And in all cases, they don't work directly on their text buffer (whose internal represenation in
> their backing store is not just a single string, but a structured collection of buffers, built on top of an interface
> masking the details: the effective text will then be reencoded and saved from that object, using complex
> serialization schemes; the text buffer is "virtualized").

In Emacs, buffer text is a character string with a gap, actually.

From unicode at unicode.org  Sun Sep  9 12:35:47 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 9 Sep 2018 19:35:47 +0200
Subject: Unicode String Models
In-Reply-To: <838t4ar7kn.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
 <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>
 <838t4ar7kn.fsf@gnu.org>
Message-ID: <CAGa7JC3v25N6m5tqmVLRR_DATT0hJ_Yv37=D7iXTPvj8f4fEVQ@mail.gmail.com>

Le dim. 9 sept. 2018 ? 17:53, Eli Zaretskii <eliz at gnu.org> a ?crit :

> > Text editors use various indexing caches always, to manage memory, I/O,
> and allow working on large texts
> > even on systems with low memory available. As much as possible they
> attempt to use the OS-level caches
> > of the filesystem. And in all cases, they don't work directly on their
> text buffer (whose internal represenation in
> > their backing store is not just a single string, but a structured
> collection of buffers, built on top of an interface
> > masking the details: the effective text will then be reencoded and saved
> from that object, using complex
> > serialization schemes; the text buffer is "virtualized").
>
> In Emacs, buffer text is a character string with a gap, actually.
>

A text buffer with gaps is a complex structure, not just a plain string.
Gaps are one way to manage memory more efficiently and get reasonnable
performance when editing, without having to constantly move large blocks:
these "strings" with gaps may then actually be just a byte buffer using as
a backing store, but that buffer alone does not represent only the
currently represented text. A process will still serialize and perform
cleanup befire this buffer can be used to save the edited text. Emacs may
not necasserily unallocate the end of the buffer, but I doubt it constantly
uses a single gap at end (insertions and deletions in the middle would
constant move large blocks and use excessive CPU and memory bandwidth, with
very slow response: users do not want to see what they type appearing on
the screen at one keystroke every few seconds because each typed key causes
massive block moves and excessive memory paging from/to disk while this
move is being performed).

All editors I have seen treat the text as ordered collections of small
buffers (these small buffers may still have small gaps), which are
occasionnally merged or splitted when needed (merging does not cause any
reallocation but may free one of the buffers), some of them being paged out
to tempoary files when memory is stressed. There are some heuristics in the
editor's code to when mainatenance of the collection is really needed and
useful for the performance.

But beside this the performance cost of UTF indexing of the codepoints is
invisible: each buffer will only need to avoid breaking text between
codepoint boundaries, if the current encoding of the edited text is an UTF.
An editor may also avoid breaking buffers in the middle of clusters if they
render clusters (including ligatures if they are supported): clusters are
still small in size in every encoding and reasonnable buffer sizes can hold
at least hundreds of clusters (even the largest ones which occur rarely).
How editors will manage clusters to make them editable is dependant of the
implementation, buyt even the UTF or codepoints boundaries are not enough
to handle that. In all cases the logical text buffer is structured with a
complex backing store, where parts may be paged out (and will also include
more than just the current text, notably it will include parts of the
indexes, possibly in another temporary working file).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180909/e0e89307/attachment.html>

From unicode at unicode.org  Sun Sep  9 14:20:16 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sun, 09 Sep 2018 22:20:16 +0300
Subject: Unicode String Models
In-Reply-To: <CAGa7JC3v25N6m5tqmVLRR_DATT0hJ_Yv37=D7iXTPvj8f4fEVQ@mail.gmail.com>
 (message from Philippe Verdy on Sun, 9 Sep 2018 19:35:47 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
 <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>
 <838t4ar7kn.fsf@gnu.org>
 <CAGa7JC3v25N6m5tqmVLRR_DATT0hJ_Yv37=D7iXTPvj8f4fEVQ@mail.gmail.com>
Message-ID: <834leyqxzj.fsf@gnu.org>

> From: Philippe Verdy <verdy_p at wanadoo.fr>
> Date: Sun, 9 Sep 2018 19:35:47 +0200
> Cc: Richard Wordingham <richard.wordingham at ntlworld.com>, 
> 	unicode Unicode Discussion <unicode at unicode.org>
> 
>  In Emacs, buffer text is a character string with a gap, actually.
> 
> A text buffer with gaps is a complex structure, not just a plain string.

The difference is very small, and a couple of macros allow you to
almost forget about the gap.

> I doubt it constantly uses a single gap at end (insertions and deletions in the middle would
> constant move large blocks and use excessive CPU and memory bandwidth, with very slow response: users
> do not want to see what they type appearing on the screen at one keystroke every few seconds because each
> typed key causes massive block moves and excessive memory paging from/to disk while this move is being
> performed).

In Emacs, the gap is always where the text is inserted or deleted, be
it in the middle of text or at its end.

> All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have
> small gaps), which are occasionnally merged or splitted when needed (merging does not cause any
> reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is
> stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really
> needed and useful for the performance.

My point was to say that Emacs is not one of these editors you
describe.

> But beside this the performance cost of UTF indexing of the codepoints is invisible: each buffer will only need
> to avoid breaking text between codepoint boundaries, if the current encoding of the edited text is an UTF. An
> editor may also avoid breaking buffers in the middle of clusters if they render clusters (including ligatures if
> they are supported): clusters are still small in size in every encoding and reasonnable buffer sizes can hold at
> least hundreds of clusters (even the largest ones which occur rarely). How editors will manage clusters to
> make them editable is dependant of the implementation, buyt even the UTF or codepoints boundaries are not
> enough to handle that. In all cases the logical text buffer is structured with a complex backing store, where
> parts may be paged out (and will also include more than just the current text, notably it will include parts of the
> indexes, possibly in another temporary working file).

You ignore or disregard the need to represent raw bytes in editor
buffers.  That is when the encoding stops being "invisible".

From unicode at unicode.org  Mon Sep 10 11:05:48 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Mon, 10 Sep 2018 18:05:48 +0200
Subject: Unicode String Models
In-Reply-To: <834leyqxzj.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <20180909085929.2d4ff0d2@JRWUBU2>
 <CAGa7JC2n_J_V=CCMCArRBtqJO4B7EcjbNmgnGGp_m-E+hD+Tjg@mail.gmail.com>
 <838t4ar7kn.fsf@gnu.org>
 <CAGa7JC3v25N6m5tqmVLRR_DATT0hJ_Yv37=D7iXTPvj8f4fEVQ@mail.gmail.com>
 <834leyqxzj.fsf@gnu.org>
Message-ID: <E2D13154-1EA6-4821-8E10-590979341994@telia.com>


> On 9 Sep 2018, at 21:20, Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> 
> In Emacs, the gap is always where the text is inserted or deleted, be
> it in the middle of text or at its end.
> 
>> All editors I have seen treat the text as ordered collections of small buffers (these small buffers may still have
>> small gaps), which are occasionnally merged or splitted when needed (merging does not cause any
>> reallocation but may free one of the buffers), some of them being paged out to tempoary files when memory is
>> stressed. There are some heuristics in the editor's code to when mainatenance of the collection is really
>> needed and useful for the performance.
> 
> My point was to say that Emacs is not one of these editors you
> describe.

FYI, gap and rope buffers are described at [1-2]; also see the Emacs manual [3].

1. https://en.wikipedia.org/wiki/Gap_buffer
2. https://en.wikipedia.org/wiki/Rope_(data_structure)
3. https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html


From unicode at unicode.org  Tue Sep 11 05:12:40 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Tue, 11 Sep 2018 13:12:40 +0300
Subject: Unicode String Models
In-Reply-To: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
Message-ID: <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>

On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ?? via Unicode
<unicode at unicode.org> wrote:
>
> I recently did some extensive revisions of a paper on Unicode string models (APIs). Comments are welcome.
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

* The Grapheme Cluster Model seems to have a couple of disadvantages
that are not mentioned:
  1) The subunit of string is also a string (a short string conforming
to particular constraints). There's a need for *another* more atomic
mechanism for examining the internals of the grapheme cluster string.
  2) The way an arbitrary string is divided into units when iterating
over it changes when the program is executed on a newer version of the
language runtime that is aware of newly-assigned codepoints from a
newer version of Unicode.

 * The Python 3.3 model mentions the disadvantages of memory usage
cliffs but doesn't mention the associated perfomance cliffs. It would
be good to also mention that when a string manipulation causes the
storage to expand or contract, there's a performance impact that's not
apparent from the nature of the operation if the programmer's
intuition works on the assumption that the programmer is dealing with
UTF-32.

 * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
text node storage in Gecko, (I believe but am not 100% sure) V8 and,
optionally, HotSpot
(https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A).
That is, text has UTF-16 semantics, but if the high half of every code
unit in a string is zero, only the lower half is stored. This has
properties analogous to the Python 3.3 model, except non-BMP doesn't
expand to UTF-32 but uses UTF-16 surrogate pairs.

 * I think the fact that systems that chose UTF-16 or UTF-32 have
implemented models that try to save storage by omitting leading zeros
and gaining complexity and performance cliffs as a result is a strong
indication that UTF-8 should be recommended for newly-designed systems
that don't suffer from a forceful legacy need to expose UTF-16 or
UTF-32 semantics.

 * I suggest splitting the "UTF-8 model" into three substantially
different models:

 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
UTF-8-related operations are performed when ingesting byte-oriented
data. Byte buffers and text buffers are type-wise ambiguous. Only
iterating over byte data by code point gives the data the UTF-8
interpretation. Unless the data is cleaned up as a side effect of such
iteration, malformed sequences in input survive into output.

 2) UTF-8 without full trust in ability to retain validity (the model
of the UTF-8-using C++ parts of Gecko; I believe this to be the most
common UTF-8 model for C and C++, but I don't have evidence to back
this up): When data is ingested with text semantics, it is converted
to UTF-8. For data that's supposed to already be in UTF-8, this means
replacing malformed sequences with the REPLACEMENT CHARACTER, so the
data is valid UTF-8 right after input. However, iteration by code
point doesn't trust ability of other code to retain UTF-8 validity
perfectly and has "else" branches in order not to blow up if invalid
UTF-8 creeps into the system.

 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
have a different type in the type system than byte buffers. To go from
a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
has been tagged as valid UTF-8, the validity is trusted completely so
that iteration by code point does not have "else" branches for
malformed sequences. If data that the type system indicates to be
valid UTF-8 wasn't actually valid, it would be nasal demon time. The
language has a default "safe" side and an opt-in "unsafe" side. The
unsafe side is for performing low-level operations in a way where the
responsibility of upholding invariants is moved from the compiler to
the programmer. It's impossible to violate the UTF-8 validity
invariant using the safe part of the language.

 * After working with different string models, I'd recommend the Rust
model for newly-designed programming languages. (Not because I work
for Mozilla but because I believe Rust's way of dealing with Unicode
is the best I've seen.) Rust's standard library provides Unicode
version-independent iterations over strings: by code unit and by code
point. Iteration by extended grapheme cluster is provided by a library
that's easy to include due to the nature of Rust package management
(https://crates.io/crates/unicode_segmentation). Viewing a UTF-8
buffer as a read-only byte buffer has zero run-time cost and allows
for maximally fast guaranteed-valid-UTF-8 output.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From unicode at unicode.org  Tue Sep 11 06:13:03 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 11 Sep 2018 14:13:03 +0300
Subject: Unicode String Models
In-Reply-To: <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 (message from Henri Sivonen via Unicode on Tue, 11 Sep 2018 13:12:40
 +0300)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
Message-ID: <83va7cmgn4.fsf@gnu.org>

> Date: Tue, 11 Sep 2018 13:12:40 +0300
> From: Henri Sivonen via Unicode <unicode at unicode.org>
> 
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
> 
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
> 
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
> 
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.

There's another model, the one used by Emacs.  AFAIU, it is different
from all the 3 you describe above.  In Emacs, each raw byte belonging
to a byte sequence which is invalid under UTF-8 is represented as a
special multibyte sequence.  IOW, Emacs's internal representation
extends UTF-8 with multibyte sequences it uses to represent raw bytes.
This allows mixing stray bytes and valid text in the same buffer,
without risking lossy conversions (such as those one gets under model
2 above).

From unicode at unicode.org  Tue Sep 11 09:19:58 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Tue, 11 Sep 2018 07:19:58 -0700
Subject: Unicode String Models
In-Reply-To: <83va7cmgn4.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org>
Message-ID: <CAJ2xs_E3RbhGpNGVPuPcJ=nGCJdxRxzNChY2sHZ5jBeqorxtEA@mail.gmail.com>

These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.

Mark


On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode at unicode.org> wrote:

> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode <unicode at unicode.org>
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180911/513a9411/attachment.html>

From unicode at unicode.org  Tue Sep 11 12:13:28 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 11 Sep 2018 19:13:28 +0200
Subject: Unicode String Models
In-Reply-To: <83va7cmgn4.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org>
Message-ID: <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>


> On 11 Sep 2018, at 13:13, Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> 
> In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

Can you give a reference detailing this format?


From unicode at unicode.org  Tue Sep 11 12:21:07 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 11 Sep 2018 20:21:07 +0300
Subject: Unicode String Models
In-Reply-To: <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com> (message from
 Hans =?utf-8?Q?=C3=85berg?= on Tue, 11 Sep 2018 19:13:28 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
Message-ID: <83h8iwlzlo.fsf@gnu.org>

> From: Hans ?berg <haberg-1 at telia.com>
> Date: Tue, 11 Sep 2018 19:13:28 +0200
> Cc: Henri Sivonen <hsivonen at hsivonen.fi>,
>  unicode at unicode.org
> 
> > In Emacs, each raw byte belonging
> > to a byte sequence which is invalid under UTF-8 is represented as a
> > special multibyte sequence.  IOW, Emacs's internal representation
> > extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> > This allows mixing stray bytes and valid text in the same buffer,
> > without risking lossy conversions (such as those one gets under model
> > 2 above).
> 
> Can you give a reference detailing this format?

There's no formal description as English text, if that's what you
meant.  The comments, macros and functions in the files
src/character.[ch] in the Emacs source tree tell most of that story,
albeit indirectly, and some additional info can be found in the
section "Text Representation" of the Emacs Lisp Reference manual.

From unicode at unicode.org  Tue Sep 11 13:14:30 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 11 Sep 2018 20:14:30 +0200
Subject: Unicode String Models
In-Reply-To: <83h8iwlzlo.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org>
Message-ID: <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>


> On 11 Sep 2018, at 19:21, Eli Zaretskii <eliz at gnu.org> wrote:
> 
>> From: Hans ?berg <haberg-1 at telia.com>
>> Date: Tue, 11 Sep 2018 19:13:28 +0200
>> Cc: Henri Sivonen <hsivonen at hsivonen.fi>,
>> unicode at unicode.org
>> 
>>> In Emacs, each raw byte belonging
>>> to a byte sequence which is invalid under UTF-8 is represented as a
>>> special multibyte sequence.  IOW, Emacs's internal representation
>>> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
>>> This allows mixing stray bytes and valid text in the same buffer,
>>> without risking lossy conversions (such as those one gets under model
>>> 2 above).
>> 
>> Can you give a reference detailing this format?
> 
> There's no formal description as English text, if that's what you
> meant.  The comments, macros and functions in the files
> src/character.[ch] in the Emacs source tree tell most of that story,
> albeit indirectly, and some additional info can be found in the
> section "Text Representation" of the Emacs Lisp Reference manual.

OK. If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs.


From unicode at unicode.org  Tue Sep 11 13:40:54 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 11 Sep 2018 21:40:54 +0300
Subject: Unicode String Models
In-Reply-To: <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com> (message from
 Hans =?utf-8?Q?=C3=85berg?= on Tue, 11 Sep 2018 20:14:30 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
Message-ID: <83efdznah5.fsf@gnu.org>

> From: Hans ?berg <haberg-1 at telia.com>
> Date: Tue, 11 Sep 2018 20:14:30 +0200
> Cc: hsivonen at hsivonen.fi,
>  unicode at unicode.org
> 
> If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs.

Yes.  And mixed encodings is not the only use case: it may well happen
that the initial attempt to decode the file uses incorrect assumption
about the encoding, for some reason.

In addition, it is important that changing some portion of the file,
then saving the modified text will never change any part that the user
didn't touch, as will happen if invalid sequences are rejected at
input time and replaced with something else.

From unicode at unicode.org  Tue Sep 11 14:10:03 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Tue, 11 Sep 2018 21:10:03 +0200
Subject: Unicode String Models
In-Reply-To: <83efdznah5.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org>
Message-ID: <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>


> On 11 Sep 2018, at 20:40, Eli Zaretskii <eliz at gnu.org> wrote:
> 
>> From: Hans ?berg <haberg-1 at telia.com>
>> Date: Tue, 11 Sep 2018 20:14:30 +0200
>> Cc: hsivonen at hsivonen.fi,
>> unicode at unicode.org
>> 
>> If one encounters a file with mixed encodings, it is good to be able to view its contents and then convert it, as I see one can do in Emacs.
> 
> Yes.  And mixed encodings is not the only use case: it may well happen
> that the initial attempt to decode the file uses incorrect assumption
> about the encoding, for some reason.
> 
> In addition, it is important that changing some portion of the file,
> then saving the modified text will never change any part that the user
> didn't touch, as will happen if invalid sequences are rejected at
> input time and replaced with something else.

Indeed, before UTF-8, in the 1990s, I recall some Russians using LaTeX files with sections in different Cyrillic and Latin encodings, changing the editor encoding while typing.


From unicode at unicode.org  Tue Sep 11 16:48:48 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 11 Sep 2018 22:48:48 +0100
Subject: Unicode String Models
In-Reply-To: <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org>
 <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org>
 <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org>
 <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
Message-ID: <20180911224848.3aa17406@JRWUBU2>

On Tue, 11 Sep 2018 21:10:03 +0200
Hans ?berg via Unicode <unicode at unicode.org> wrote:

> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> LaTeX files with sections in different Cyrillic and Latin encodings,
> changing the editor encoding while typing.

Rather like some of the old Unicode list archives, which are just
concatenations of a month's emails, with all sorts of 8-bit encodings
and stretches of base64.

Richard.


From unicode at unicode.org  Tue Sep 11 17:13:52 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Wed, 12 Sep 2018 00:13:52 +0200
Subject: Unicode String Models
In-Reply-To: <20180911224848.3aa17406@JRWUBU2>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
Message-ID: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>


> On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Tue, 11 Sep 2018 21:10:03 +0200
> Hans ?berg via Unicode <unicode at unicode.org> wrote:
> 
>> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> LaTeX files with sections in different Cyrillic and Latin encodings,
>> changing the editor encoding while typing.
> 
> Rather like some of the old Unicode list archives, which are just
> concatenations of a month's emails, with all sorts of 8-bit encodings
> and stretches of base64.

It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.


From unicode at unicode.org  Tue Sep 11 17:40:17 2018
From: unicode at unicode.org (J Decker via Unicode)
Date: Tue, 11 Sep 2018 15:40:17 -0700
Subject: Unicode String Models
In-Reply-To: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
Message-ID: <CAA2GJqVa5A0v4_q9Dr=Cyr7cu=Kf7mRM1zkk1S_iG=ktxHwJQA@mail.gmail.com>

On Tue, Sep 11, 2018 at 3:15 PM Hans ?berg via Unicode <unicode at unicode.org>
wrote:

>
> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
> >
> > On Tue, 11 Sep 2018 21:10:03 +0200
> > Hans ?berg via Unicode <unicode at unicode.org> wrote:
> >
> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
> >> LaTeX files with sections in different Cyrillic and Latin encodings,
> >> changing the editor encoding while typing.
> >
> > Rather like some of the old Unicode list archives, which are just
> > concatenations of a month's emails, with all sorts of 8-bit encodings
> > and stretches of base64.
>
> It might be useful to represent non-UTF-8 bytes as Unicode code points.
> One way might be to use a codepoint to indicate high bit set followed by
> the byte value with its high bit set to 0, that is, truncated into the
> ASCII range. For example, U+0080 looks like it is not in use, though I
> could not verify this.
>
>
it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
(I'm probably off a bit in the leading byte)
UTF-8 can represent from 0 to 0x200000 every value; (which is all defined
codepoints) early varients can support up to U+7FFFFFFF...
and there's enough bits to carry the pattern forward to support 36 bits or
42 bits... (the last one breaking the standard a bit by allowing a byte
wihout one bit off... 0xFF would be the leadin)

0xF8-FF are unused byte values; but those can all be encoded into utf-8.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180911/368dfe75/attachment.html>

From unicode at unicode.org  Tue Sep 11 18:26:42 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 12 Sep 2018 00:26:42 +0100
Subject: Tamil Brahmi Short Mid Vowels
In-Reply-To: <CY4PR21MB06323400D6DDAC81A2613AD98E090@CY4PR21MB0632.namprd21.prod.outlook.com>
References: <20180721020131.4b22887b@JRWUBU2>
 <CAH-HCWVtjScTa-LZipxYFvEjT1cvjnfqhmc6ffZWiA6YQ6xLBA@mail.gmail.com>
 <20180721085026.6aa07876@JRWUBU2>
 <CY4PR21MB06323400D6DDAC81A2613AD98E090@CY4PR21MB0632.namprd21.prod.outlook.com>
Message-ID: <20180912002642.5e3c64a8@JRWUBU2>

On Wed, 29 Aug 2018 21:42:57 +0000
Andrew Glass via Unicode <unicode at unicode.org> wrote:

> Thank you Richard and Shriramana for bringing up this interesting
> problem.
> 
> I agree we need to fix this. I don?t want to fix this with a font
> hack or change to USE cluster rules or properties. I think the right
> place to fix this is in the encoding. This might be either a new
> character for Tamil Brahmi Pu??i ? as Shriramana has proposed
> (L2/12-226<http://www.unicode.org/L2/L2012/12226-brahmi-two-tamil-char.pdf>)
> ? or separate characters for Tamil Brahmi Short E and Tamil Brahmi
> Short O in independent and dependent forms (4 characters total). I?m
> inclined to think that a visible virama, Tamil Brahmi Pu??i, is the
> right approach.

While this would work, please remember that refusing to allow a virama
after a vowel also makes USE inappropriate for Khmer and Tai Tham,
which use H+consonant rather than consonant+H for subscript final
consonants.

Richard. 


From unicode at unicode.org  Tue Sep 11 18:41:03 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Wed, 12 Sep 2018 01:41:03 +0200
Subject: Unicode String Models
In-Reply-To: <CAA2GJqVa5A0v4_q9Dr=Cyr7cu=Kf7mRM1zkk1S_iG=ktxHwJQA@mail.gmail.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
 <CAA2GJqVa5A0v4_q9Dr=Cyr7cu=Kf7mRM1zkk1S_iG=ktxHwJQA@mail.gmail.com>
Message-ID: <CAGa7JC2YyNdq1GBqddgnSKUM4tC8PanQ6A16Oh+bC38nnRv0qw@mail.gmail.com>

No 0xF8..0xFF are not used at all in UTF-8; but U+00F8..U+00FF really
**do** have UTF-8 encodings (using two bytes).

The only safe way to represent arbitrary bytes within strings when they are
not valid UTF-8 is to use invalid UTF-8 sequences, i.e by using a
"UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)

This is what Java does for representing U+0000 by (0xC0,0x80) in the
compiled Bytecode or via the C/C++ interface for JNI when converting the
java string buffer into a C/C++ string terminated by a NULL byte (not part
of the Java string content itself). That special sequence however is really
exposed in the Java API as a true unsigned 16-bit code unit (char) with
value 0x0000, and a valid single code point.

The same can be done for reencoding each invalid byte in non-UTF-8
conforming texts using sequences with a "UTF-8-like" scheme (still
compatible with plain UTF-8 for every valid UTF-8 texts): you may either:
  * (a) encode each invalid byte separately (using two bytes for each), or
by encoding them by groups of 3-bits (represented using bytes 0xF8..0FF)
and then needing 3 bytes in the encoding.
  * (b) encode a private starter (e.g. 0xFF), followed by a byte for the
length of the raw bytes sequence that follows, and then the raw bytes
sequence of that length without any reencoding: this will never be confused
with other valid codepoints (however this scheme may no longer be directly
indexable from arbitrary random positions, unlike scheme a which may be
marginally longer longer)
But both schemes (a) or (b) would be useful in editors allowing to edit
arbitrary binary files as if they were plain-text, even if they contain
null bytes, or invalid UTF-8 sequences (it's up to these editors to find a
way to distinctively represent these bytes, and a way to enter/change them
reliably.

There's also a possibility of extension if the backing store uses UTF-16,
as all code units 0x0000.0xFFFF are used, but one scheme is possible by
using unpaired surrogates (notably a low surrogate NOT prefixed by a high
surrogate: the low surrogate already has 10 useful bits that can store any
raw byte value in its lowest bits): this scheme allows indexing from random
position and reliable sequencial traversal in both directions (backward or
forward)...

... But the presence of such extension of UTF-16 means that all the
implementation code handling standard text has to detect unpaired
surrogates, and can no longer assume that a low surrogate necessarily has a
high surrogate encoded just before it: it must be tested and that previous
position may be before the buffer start, causing a possibly buffer overrun
in backward direction (so the code will need to also know the start
position of the buffer and check it, or know the index which cannot be
negative), possibly exposing unrelated data and causing some security
risks, unless the backing store always adds a leading "guard" code unit set
arbitrarily to 0x0000.


Le mer. 12 sept. 2018 ? 00:48, J Decker via Unicode <unicode at unicode.org> a
?crit :

>
>
> On Tue, Sep 11, 2018 at 3:15 PM Hans ?berg via Unicode <
> unicode at unicode.org> wrote:
>
>>
>> > On 11 Sep 2018, at 23:48, Richard Wordingham via Unicode <
>> unicode at unicode.org> wrote:
>> >
>> > On Tue, 11 Sep 2018 21:10:03 +0200
>> > Hans ?berg via Unicode <unicode at unicode.org> wrote:
>> >
>> >> Indeed, before UTF-8, in the 1990s, I recall some Russians using
>> >> LaTeX files with sections in different Cyrillic and Latin encodings,
>> >> changing the editor encoding while typing.
>> >
>> > Rather like some of the old Unicode list archives, which are just
>> > concatenations of a month's emails, with all sorts of 8-bit encodings
>> > and stretches of base64.
>>
>> It might be useful to represent non-UTF-8 bytes as Unicode code points.
>> One way might be to use a codepoint to indicate high bit set followed by
>> the byte value with its high bit set to 0, that is, truncated into the
>> ASCII range. For example, U+0080 looks like it is not in use, though I
>> could not verify this.
>>
>>
> it's used for character 0x400.   0xD0 0x80   or 0x8000   0xE8 0x80 0x80
> (I'm probably off a bit in the leading byte)
> UTF-8 can represent from 0 to 0x200000 every value; (which is all defined
> codepoints) early varients can support up to U+7FFFFFFF...
> and there's enough bits to carry the pattern forward to support 36 bits or
> 42 bits... (the last one breaking the standard a bit by allowing a byte
> wihout one bit off... 0xFF would be the leadin)
>
> 0xF8-FF are unused byte values; but those can all be encoded into utf-8.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180912/8faab3c0/attachment.html>

From unicode at unicode.org  Tue Sep 11 19:02:44 2018
From: unicode at unicode.org (Andrew Glass via Unicode)
Date: Wed, 12 Sep 2018 00:02:44 +0000
Subject: Tamil Brahmi Short Mid Vowels
In-Reply-To: <20180912002642.5e3c64a8@JRWUBU2>
References: <20180721020131.4b22887b@JRWUBU2>
 <CAH-HCWVtjScTa-LZipxYFvEjT1cvjnfqhmc6ffZWiA6YQ6xLBA@mail.gmail.com>
 <20180721085026.6aa07876@JRWUBU2>
 <CY4PR21MB06323400D6DDAC81A2613AD98E090@CY4PR21MB0632.namprd21.prod.outlook.com>
 <20180912002642.5e3c64a8@JRWUBU2>
Message-ID: <CY4PR21MB0632C18E4E22E8BFC7E160128E1B0@CY4PR21MB0632.namprd21.prod.outlook.com>


On Windows, Khmer is rendered with a dedicated shaping engine. I don't see a need to alter that engine or integrate Khmer with USE. How we fix Tai Tham, which does go to USE is a different matter. We need to work through the solution for Tai Tham. I'm opposed to a generic and broad relaxation of virama constraints in USE as that would have impact on many scripts that currently have no requirement for virama after vowels. I'm not opposed to a new Indic Syllabic Category that has virama-like features and is allowed to follow a vowel. If we establish such a property for Tai Tham, we can consider on a case-by-case basis if any virama characters would be better served by the new property?including Brahmi.

Cheers,

Andrew


-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Richard Wordingham via Unicode
Sent: Tuesday, September 11, 2018 4:27 PM
To: unicode at unicode.org
Subject: Re: Tamil Brahmi Short Mid Vowels

On Wed, 29 Aug 2018 21:42:57 +0000
Andrew Glass via Unicode <unicode at unicode.org> wrote:

> Thank you Richard and Shriramana for bringing up this interesting 
> problem.
> 
> I agree we need to fix this. I don?t want to fix this with a font hack 
> or change to USE cluster rules or properties. I think the right place 
> to fix this is in the encoding. This might be either a new character 
> for Tamil Brahmi Pu??i ? as Shriramana has proposed
> (L2/12-226<https://na01.safelinks.protection.outlook.com/?url=http%3A%
> 2F%2Fwww.unicode.org%2FL2%2FL2012%2F12226-brahmi-two-tamil-char.pdf&am
> p;data=02%7C01%7CAndrew.Glass%40microsoft.com%7Cc8b7042add6043b2d79608
> d6183f443b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63672305734730
> 4813&amp;sdata=raIc6m1AqKNg8WMpAployLZpkk9BthumjMx%2BPUlFVNE%3D&amp;re
> served=0>) ? or separate characters for Tamil Brahmi Short E and Tamil 
> Brahmi Short O in independent and dependent forms (4 characters 
> total). I?m inclined to think that a visible virama, Tamil Brahmi 
> Pu??i, is the right approach.

While this would work, please remember that refusing to allow a virama after a vowel also makes USE inappropriate for Khmer and Tai Tham, which use H+consonant rather than consonant+H for subscript final consonants.

Richard. 


From unicode at unicode.org  Tue Sep 11 21:34:21 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 12 Sep 2018 05:34:21 +0300
Subject: Unicode String Models
In-Reply-To: <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> (message from
 Hans =?utf-8?Q?=C3=85berg?= via Unicode on Wed, 12 Sep 2018 00:13:52 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
Message-ID: <83bm93mok2.fsf@gnu.org>

> Date: Wed, 12 Sep 2018 00:13:52 +0200
> Cc: unicode at unicode.org
> From: Hans ?berg via Unicode <unicode at unicode.org>
> 
> It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.

You must use a codepoint that is not defined by Unicode, and never
will.  That is what Emacs does: it extends the Unicode codepoint space
beyond 0x10FFFF.

From unicode at unicode.org  Tue Sep 11 21:47:06 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 11 Sep 2018 19:47:06 -0700
Subject: Tamil Brahmi Short Mid Vowels
In-Reply-To: <CY4PR21MB0632C18E4E22E8BFC7E160128E1B0@CY4PR21MB0632.namprd21.prod.outlook.com>
References: <20180721020131.4b22887b@JRWUBU2>
 <CAH-HCWVtjScTa-LZipxYFvEjT1cvjnfqhmc6ffZWiA6YQ6xLBA@mail.gmail.com>
 <20180721085026.6aa07876@JRWUBU2>
 <CY4PR21MB06323400D6DDAC81A2613AD98E090@CY4PR21MB0632.namprd21.prod.outlook.com>
 <20180912002642.5e3c64a8@JRWUBU2>
 <CY4PR21MB0632C18E4E22E8BFC7E160128E1B0@CY4PR21MB0632.namprd21.prod.outlook.com>
Message-ID: <991013af-87cf-1dee-c7ee-10b6a58b4422@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180911/41201090/attachment.html>

From unicode at unicode.org  Wed Sep 12 00:38:21 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Wed, 12 Sep 2018 08:38:21 +0300
Subject: Unicode String Models
In-Reply-To: <83va7cmgn4.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org>
Message-ID: <CAJQvAucmi6X68gmXivUhToEv1sxWT+SkEbHAgHZQRfCDXTt0vA@mail.gmail.com>

On Tue, Sep 11, 2018 at 2:13 PM Eli Zaretskii <eliz at gnu.org> wrote:
>
> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode <unicode at unicode.org>
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).

I think extensions of UTF-8 that expand the value space beyond Unicode
scalar values and the problems these extensions are designed to solve
is a worthwhile topic to cover, but I think it's not the same topic as
in the document but a slightly adjacent topic.

On that topic, these two are relevant:
https://simonsapin.github.io/wtf-8/
https://github.com/kennytm/omgwtf8

The former is used in the Rust standard library in order to provide a
Unix-like view to Windows file paths in a way that can represent all
Windows file paths. File paths on Unix-like systems are sequences of
bytes whose presentable-to-humans interpretation (these days) is
UTF-8, but there's no guarantee of UTF-8 validity. File paths on
Windows are are sequences of unsigned 16-bit numbers whose
presentable-to-humans interpretation is UTF-16, but there's no
guarantee of UTF-16 validity. WTF-8 can represent all Windows file
paths as sequences of bytes such that the paths that are valid UTF-16
as sequences of 16-bit units are valid UTF-8 in the 8-bit-unit
representation. This allows application-visible file paths in the Rust
standard library to be sequences of bytes both on Windows and
non-Windows platforms and to be presentable to humans by decoding as
UTF-8 in both cases.

To my knowledge, the latter isn't in use yet. The implementation is
tracked in https://github.com/rust-lang/rust/issues/49802

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/

From unicode at unicode.org  Wed Sep 12 03:37:00 2018
From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode)
Date: Wed, 12 Sep 2018 10:37:00 +0200
Subject: Unicode String Models
In-Reply-To: <83bm93mok2.fsf@gnu.org>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com> <83bm93mok2.fsf@gnu.org>
Message-ID: <CF70C6B6-D3AC-4D39-AE98-E6D9AFC3280F@telia.com>


> On 12 Sep 2018, at 04:34, Eli Zaretskii via Unicode <unicode at unicode.org> wrote:
> 
>> Date: Wed, 12 Sep 2018 00:13:52 +0200
>> Cc: unicode at unicode.org
>> From: Hans ?berg via Unicode <unicode at unicode.org>
>> 
>> It might be useful to represent non-UTF-8 bytes as Unicode code points. One way might be to use a codepoint to indicate high bit set followed by the byte value with its high bit set to 0, that is, truncated into the ASCII range. For example, U+0080 looks like it is not in use, though I could not verify this.
> 
> You must use a codepoint that is not defined by Unicode, and never
> will.  That is what Emacs does: it extends the Unicode codepoint space
> beyond 0x10FFFF.

The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints. Then U+0080 has had some use in other encodings, but it looks like not in Unicode itself. But one could use some other value or values, and mark it for this special purpose.

There are a number of other byte sequences that are in use, too, like overlong UTF-8. Also original UTF-8 can be extended to handle all 32-bit words, also those with the high bit set, then.


From unicode at unicode.org  Wed Sep 12 09:03:44 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Wed, 12 Sep 2018 17:03:44 +0300
Subject: Unicode String Models
In-Reply-To: <CAGa7JC2YyNdq1GBqddgnSKUM4tC8PanQ6A16Oh+bC38nnRv0qw@mail.gmail.com>
 (message from Philippe Verdy via Unicode on Wed, 12 Sep 2018 01:41:03
 +0200)
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
 <CAA2GJqVa5A0v4_q9Dr=Cyr7cu=Kf7mRM1zkk1S_iG=ktxHwJQA@mail.gmail.com>
 <CAGa7JC2YyNdq1GBqddgnSKUM4tC8PanQ6A16Oh+bC38nnRv0qw@mail.gmail.com>
Message-ID: <838t46n77j.fsf@gnu.org>

> Date: Wed, 12 Sep 2018 01:41:03 +0200
> Cc: unicode Unicode Discussion <unicode at unicode.org>,
>         Richard Wordingham <richard.wordingham at ntlworld.com>,
>         Hans Aberg <haberg-1 at telia.com>
> From: Philippe Verdy via Unicode <unicode at unicode.org>
> 
> The only safe way to represent arbitrary bytes within strings when they are not valid UTF-8 is to use invalid
> UTF-8 sequences, i.e by using a "UTF-8-like" private extension of UTF-8 (that extension is still not UTF-8!)
> 
> This is what Java does for representing U+0000 by (0xC0,0x80) in the compiled Bytecode or via the C/C++
> interface for JNI when converting the java string buffer into a C/C++ string terminated by a NULL byte (not part
> of the Java string content itself). That special sequence however is really exposed in the Java API as a true
> unsigned 16-bit code unit (char) with value 0x0000, and a valid single code point.

That's more or less what Emacs does.

> But both schemes (a) or (b) would be useful in editors allowing to edit arbitrary binary files as if they were
> plain-text, even if they contain null bytes, or invalid UTF-8 sequences (it's up to these editors to find a way to
> distinctively represent these bytes, and a way to enter/change them reliably.

The experience in Emacs is that no serious text editor can decide that
it doesn't support these use cases.  Even if editing binary files is
out of scope, there will always be text files whose encoding is
unknowable and/or guessed/decided wrong, files with mixed encodings,
etc.

From unicode at unicode.org  Thu Sep 13 00:08:19 2018
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Thu, 13 Sep 2018 08:08:19 +0300
Subject: Unicode String Models
In-Reply-To: <CF70C6B6-D3AC-4D39-AE98-E6D9AFC3280F@telia.com>
References: <CAJ2xs_FaOLOJ+ixPNZbhghC4108E1xD1wtjSYRFiWNkDj0joRQ@mail.gmail.com>
 <CAJQvAuf0XkhEFNLgYzEaNNzTO_h21H=GrE3aqNccnvgPc6ySog@mail.gmail.com>
 <83va7cmgn4.fsf@gnu.org> <D0A3C04E-0BA4-4FE3-8CD1-B3819354BF2D@telia.com>
 <83h8iwlzlo.fsf@gnu.org> <275BC03D-47DB-474B-8377-0F6A5749E151@telia.com>
 <83efdznah5.fsf@gnu.org> <3F3B4617-115F-4F94-A9B9-B52B75DF7812@telia.com>
 <20180911224848.3aa17406@JRWUBU2>
 <0B2423BC-25BF-4742-BE97-333E4E87F4FF@telia.com>
 <83bm93mok2.fsf@gnu.org> <CF70C6B6-D3AC-4D39-AE98-E6D9AFC3280F@telia.com>
Message-ID: <CAJQvAueKZ1gs-NNnsvPN+mfPzhpjTFnmVHjuF-4YD44GBAkSFw@mail.gmail.com>

On Wed, Sep 12, 2018 at 11:37 AM Hans ?berg via Unicode
<unicode at unicode.org> wrote:
> The idea is to extend Unicode itself, so that those bytes can be represented by legal codepoints.

Extending Unicode itself would likely create more problems that it
would solve. Extending the value space of Unicode scalar values would
be extremely disruptive for systems whose design is deeply committed
to the current definitions of UTF-16 and UTF-8 staying unchanged.
Assigning a scalar value within the current Unicode scalar value space
to currently malformed bytes would have the problem of those scalar
values losing information whether they came from malformed bytes or
the well-formed encoding of those scalar values.

It seems better to let applications that have use cases that involve
representing non-Unicode values to use a special-purpose extension on
their own.

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From unicode at unicode.org  Sat Sep 15 08:36:37 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 15 Sep 2018 15:36:37 +0200
Subject: Shortcuts question
In-Reply-To: <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
Message-ID: <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>

Le ven. 7 sept. 2018 ? 05:43, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 07/09/18 02:32 Shriramana Sharma via Unicode wrote:
> >
> > Hello. This may be slightly OT for this list but I'm asking it here as
> it concerns computer usage with multiple scripts and i18n:
>
> It actually belongs on CLDR-users list. But coming from you, it shall
> remain here while I?m posting a quick answer below.
>
> > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for
> "tout" io Ctrl+A for "all"?
>
> No, Ctrl+A remains Ctrl+A on a French keyboard.
>

Yes but the location on the keyboard maps to the same as CTRL+Q on a Qwerty
layout: CTRL+ASCII letter are mapped according to the layout of the letter
(without pressing CTRL) on the localized keyboard. Some keyboard layouts
don't have all the basic Latin letters becaues their language don't need it
(e.g. it may only have one of Q or K, but no C, or it may have no W, or
some letters may be holding combined diacritics or could be ligatures, but
usuall the basic Latin letter is still accessible by pressing another
control key or by switching the layout mode.

On non Latin keyboard layouts there's much more freedom, and CTRL+A may be
localized according to the main base letter assigned to the key (the
position of Latin letter is not always visible).

On tactile layouts you cannot guess where CTRL+Latin letter is located,
actually it may be accessible very differently on a separate layout for
controls, where they will be translated: the CTRL key is not necessarily
present, replaced usually by a single key for input mode selection (which
may be switching languages, or to emojis, or to
symbols/punctuations/digits)...

The problematic control keys are those like "CTRL+[" (assuming ASCII as the
base layout) where "[" is not present or mapped very differently. As well
"CTRL+1"..."CTRL+0" may conflict with the assignment of ASCII controls like
"CTRL+[".

So yes all control keys are potentially localisable to work best with the
base layout anre remaining mnemonic; but the physical key position may be
very different.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180915/c439354b/attachment.html>

From unicode at unicode.org  Sun Sep 16 07:08:55 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sun, 16 Sep 2018 14:08:55 +0200 (CEST)
Subject: Shortcuts question
In-Reply-To: <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
 <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
Message-ID: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>

On 15/09/18 15:36, Philippe Verdy wrote:
[?]
> So yes all control keys are potentially localisable to work best with the base layout anre remaining mnemonic;
> but the physical key position may be very different.

An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick 
with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are 
likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized 
for text input. But again that is easier on Windows, where VKs are remapped separately, than on Linux that 
appears to use graphics throughout to process application shortcuts, and only modifiers can be "preserved" for
further processing, no underlying letter map that AFAIU appears not to exist on Linux.

However, about keyboarding, that may be technically too detailed for this List, so that I?ll step out of this thread 
here. Please follow up in parallel thread on CLDR-users instead.

https://unicode.org/pipermail/cldr-users/2018-September/000837.html

Thanks,

Marcel


From unicode at unicode.org  Sun Sep 16 08:28:31 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 16 Sep 2018 15:28:31 +0200
Subject: Shortcuts question
In-Reply-To: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
 <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
 <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
Message-ID: <CAGa7JC008d1v_7hyWo7HmjYCOwvMM6QdWRLgUQUzwaYysKfmkg@mail.gmail.com>

For games, the mnemonic meaning of keys are unlikely to be used because
gamers prefer an ergonomic placement of their fingers according to the
physical position for essential commands.
But this won't apply to control keys, as these commands should be single
keystrokes and pressing two keys instead of one would be unpractical and
would be a disavantage when playing.

That's why the four most common 4 direction keys A/D/S/W on a QWERTY layout
will become Q/D/S/Z on a French AZERTY layout. Games that use logical key
layouts based on QWERTY are almost unplayable if there's no interface to
customize these 4 keys. So games preferably use the virtual keys instead
for these commands, or will include builtin layouts adapted for AZERTY and
QWERTZ-based layouts and still display the correct keycaps in the UI: games
normally don't force the switch to another US layout, so they still need to
use the logical layout, simply because they also need to allow users to
input real text and not jsut gaming commands (for messaging, or for
inputing custom players/objects created in the game itself, or to fill-in
user profiles, or input a registration email or to perform online logon
with the correct password), in which case they will also need to support
characters entered with control keys (AltGr, Shift, Control...), or with a
standard tactile panel on screen which will still display the common
localized layouts.

There are difficulties in games when some of their commands are mapped to
something else than just basic Latin letters (including decimal digits : on
a French AZERTY keyboard, the digits are composed by pressing Shift, or in
ShiftLock mode (there's no CapsLock mode as this ShiftLock is also released
when pressing Shift: just like on old French mechanical typewriters,
pressing ShiftLock again did not release it, and this ShiftLock applied to
all keys on the keyboard, including punctuation keys. On PC keyboards,
ShiftLock does not apply to the numeric pad which has its separate NumLock,
now largely redundant and that most users would like to disable completely
each time there's a numeric pad separated from the directional pad, on
these extended keyboards, NumLock is just a nuisance, notably on OS logon
screen when Windows turns it off by default unless the BIOS locks it at
boot time, and lot of BIOS don't do that or don't have the option to set it
permanently).


Le dim. 16 sept. 2018 ? 14:18, Marcel Schneider via Unicode <
unicode at unicode.org> a ?crit :

> On 15/09/18 15:36, Philippe Verdy wrote:
> [?]
> > So yes all control keys are potentially localisable to work best with
> the base layout anre remaining mnemonic;
> > but the physical key position may be very different.
>
> An additional level of complexity is induced by ergonomics. so that most
> non-Latin layouts may wish to stick
> with QWERTY, and even ergonomic layouts in the footprints of August Dvorak
> rather than Shai Coleman are
> likely to offer variants with legacy Virtual Key mapping instead of
> staying in congruency with graphics optimized
> for text input. But again that is easier on Windows, where VKs are
> remapped separately, than on Linux that
> appears to use graphics throughout to process application shortcuts, and
> only modifiers can be "preserved" for
> further processing, no underlying letter map that AFAIU appears not to
> exist on Linux.
>
> However, about keyboarding, that may be technically too detailed for this
> List, so that I?ll step out of this thread
> here. Please follow up in parallel thread on CLDR-users instead.
>
> https://unicode.org/pipermail/cldr-users/2018-September/000837.html
>
> Thanks,
>
> Marcel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180916/48aa2dfe/attachment.html>

From unicode at unicode.org  Sun Sep 16 22:38:28 2018
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Mon, 17 Sep 2018 12:38:28 +0900
Subject: Shortcuts question
In-Reply-To: <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
 <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
 <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
Message-ID: <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp>

On 2018/09/16 21:08, Marcel Schneider via Unicode wrote:

> An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick
> with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are
> likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized
> for text input.

 From my personal experience: A few years ago, installing a Dvorak 
keyboard (which is what I use every day for typing) didn't remap the 
control keys, so that Ctrl-C was still on the bottom row of the left 
hand, and so on. For me, it was really terrible.

It may not be the same for everybody, but my experience suggests that it 
may be similar for some others, and that therefore such a mapping should 
only be voluntary, not default.

Regards,   Martin.


From unicode at unicode.org  Mon Sep 17 09:34:52 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 17 Sep 2018 16:34:52 +0200 (CEST)
Subject: Shortcuts question
In-Reply-To: <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
 <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
 <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
 <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp>
Message-ID: <813933389.12023.1537194892598.JavaMail.www@wwinf1m12>

On 17/09/18 05:38 Martin J. D?rst wrote:
[quote]
> 
> From my personal experience: A few years ago, installing a Dvorak 
> keyboard (which is what I use every day for typing) didn't remap the 
> control keys, so that Ctrl-C was still on the bottom row of the left 
> hand, and so on. For me, it was really terrible.
> 
> It may not be the same for everybody, but my experience suggests that it 
> may be similar for some others, and that therefore such a mapping should 
> only be voluntary, not default.

Got it, thanks!

Regards,

Marcel


From unicode at unicode.org  Mon Sep 17 09:47:57 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 17 Sep 2018 16:47:57 +0200
Subject: Shortcuts question
In-Reply-To: <813933389.12023.1537194892598.JavaMail.www@wwinf1m12>
References: <CAH-HCWWUahduMCCvhqg3+Rr8CoRJUTzckgxeQhQtW+bcoGg3-Q@mail.gmail.com>
 <407936671.61.1536290828336.JavaMail.www@wwinf1m09>
 <CAGa7JC3pMiNHRZfPHkF2j7N0deVqA1FgNNSD8BiYyVZbgtW+UQ@mail.gmail.com>
 <1195258399.3378.1537099735503.JavaMail.www@wwinf1m12>
 <6c9218b8-55de-788c-a1e0-ea5247048122@it.aoyama.ac.jp>
 <813933389.12023.1537194892598.JavaMail.www@wwinf1m12>
Message-ID: <CAGa7JC0fJDvsxuMJPoY6F4r2uEBAeKMLb5MeK1x_aH+DMqQDog@mail.gmail.com>

Note: CLDR concentrates on keyboard layout for text input. Layouts for
other functions (such as copy-pasting, gaming controls) are completely
different (and not necessarily bound directly to layouts for text, as they
may also have their own dedicated physical keys or users can reprogram
their keyboard for this; for gaming, softwares should all have a way to
customize the layout according to users need, and should provide
reasonnable defaults for at least the 3 base layouts: QWERTY, AZERTY and
QWERTZ, but I've never seen any game whose UI was tuned for Dvorak)

Le lun. 17 sept. 2018 ? 16:42, Marcel Schneider <charupdate at orange.fr> a
?crit :

> On 17/09/18 05:38 Martin J. D?rst wrote:
> [quote]
> >
> > From my personal experience: A few years ago, installing a Dvorak
> > keyboard (which is what I use every day for typing) didn't remap the
> > control keys, so that Ctrl-C was still on the bottom row of the left
> > hand, and so on. For me, it was really terrible.
> >
> > It may not be the same for everybody, but my experience suggests that it
> > may be similar for some others, and that therefore such a mapping should
> > only be voluntary, not default.
>
> Got it, thanks!
>
> Regards,
>
> Marcel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180917/d42bc89c/attachment.html>

From unicode at unicode.org  Mon Sep 17 14:50:05 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Mon, 17 Sep 2018 21:50:05 +0200 (CEST)
Subject: Group separator migration from U+00A0 to U+202F
Message-ID: <1855270252.16702.1537213805521.JavaMail.www@wwinf1m04>

For people monitoring this list but not CLDR-users:

To be cost-effective, the migration from the wrong U+00A0 to the correct U+202F as group separator 
should be synched across all locales using space instead of comma or period. SI is international and 
specifies narrow fixed-width no-break space as mandatory in the role of a numbers group separator.

That is the place to remember that Unicode would have had such a narrow fixed-width no-break space
from its very beginning on, if U+2008 PUNCTUATION SPACE had beed treated equally like its relative,
U+2007 FIGURE SPACE, both being designed for legacy-style hard-typeset tabular numbers representation.
We can only ask why it was not, without any hope of ever getting an authorized response on this list (see 
a recent thread about non-responsiveness; subscribers knowing the facts are here but don?t post anymore).

So this is definitely not the place to vent about that misdesign, but it is about the way of fixing it now.
After having painstakingly catched up support of some narrow fixed-width no-break space (U+202F). 
the industry is now ready to migrate from U+00A0 to U+202F. Doing it in a single rush is way more 
cost-effective than migrating one locale this time, another locale next time, a handful locales the time 
after, possibly splitting them up in sublocales with different migration schedules. I really believed that 
now Unicode proves ready to adopt the real group separator in French, all relevant locales would be 
consistently pushed for correcting that value in release 34. The v34 alpha overview makes clear they
are not.

http://cldr.unicode.org/index/downloads/cldr-34#TOC-Migration

I aimed at correcting an error in CLDR, not at making French stand out. Having many locales and 
sublocales stick with the wrong value makes no sense any more.

https://www.unicode.org/cldr/charts/34/by_type/numbers.symbols.html#a1ef41eaeb6982d

The only effect is implementers skipping migration for fr-FR while waiting for the others to catch up, 
then doing it for all at once.

There seems to be a misunderstanding: The *locale setting* is whether to use period, comma, space, 
apostrophe, U+066C ARABIC THOUSANDS SEPARATOR, or another graphic.
Whether "space" is NO-BREAK SPACE or NARROW NO-BREAK SPACE is **not a locale setting**,
but it?s all about Unicode *design* and Unicode *implementation.*
I really thought that that was clear and that there?s no need to heavily insist on the ST "French" forum.
When referring to the "French thousands separator" I only meant that unlike comma- or period-using 
locales, the French locale uses space and that the group separator space should be the correct one.
That did **not** mean that French should use *another* space than the other locales using space.

https://unicode.org/cldr/trac/ticket/11423

Regards,

Marcel


From unicode at unicode.org  Tue Sep 18 00:23:49 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Tue, 18 Sep 2018 07:23:49 +0200 (CEST)
Subject: Group separator migration from U+00A0 to U+202F
Message-ID: <1530688898.269.1537248229668.JavaMail.www@wwinf2219>

> I aimed at correcting an error in CLDR, not at making French stand out.

So I've to confess that I did focus on French and only applied for fr-FR, but 
there was a lot of work, see http://cldr.unicode.org/index/downloads/cldr-34#TOC-Growth
waiting for very few vetters. Nevertheless I also cared for English (see various tickets), 
and also posted on CLDR-users in a belated P.S. that fr-CA hadn?t caught up the group 
separator correction yet:
https://unicode.org/pipermail/cldr-users/2018-August/000825.html

Also I?m sorry for failing to provide appropriate feedback after beta release and to post 
upstream messages urging to make sure all locales using space for group separator be 
kept in synchrony.

I?m posting here with respect to people not monitoring CLDR-users Mail List. where this 
post is expanded. For further details and CLDR ticket link, please look up:

https://unicode.org/pipermail/cldr-users/2018-September/000843.html

Regards,

Marcel


From unicode at unicode.org  Sat Sep 29 10:07:31 2018
From: unicode at unicode.org (Andrew Swaine via Unicode)
Date: Sat, 29 Sep 2018 16:07:31 +0100
Subject: Shameless plug: Keyferret keyboard input system
Message-ID: <CAHH2FNwxnzAfpKCeQ-XHunoM5av59XqtDyu--nhBAQ_ibGZe0Q@mail.gmail.com>

It seems from reading back through the archives that efficient and
intuitive entry of Unicode characters is a topic that comes up from time to
time. I have built a new, free, Windows-based keyboard entry system for
Unicode characters that at least some of the people on this list might find
interesting.

This is effectively a super-Latin keyboard layout with support for the
majority of:

Basic Latin (ASCII), Latin-1 Supplement, Latin Extended-A, Latin
Extended-B, Latin Extended-C, Latin Extended-D, Latin Extended-E, Latin
Extended Additional, IPA Extensions, Phonetic Extensions, Phonetic
Extensions Supplement, Combining Diacritical Marks, Combining Diacritical
Marks Supplement, Letterlike Symbols, Mathematical Alphanumeric Symbols,
Enclosed Alphanumerics, Arrows, Mathematical Operators

Plus additional layouts selectable using CapsLock give support for:

Greek, Greek Extended
Cyrillic, Cyrillic Supplement, Cyrillic Extended-A, Cyrillic Extended-B

Characters are selected through a context-sensitive compose tree accessed
using the Right Alt (AltGr) key, with context-sensitive help in a box that
pops up when RAlt is held. Rather than using dead keys, keys are
context-sensitive on the previously entered characters. So for example,
typing "o" followed by RAlt+/ gives ?. Longer sequences give more complex
characters, e.g. RAlt+sh+ for ?. Characters are converted into
Normalization form C where possible, so "a" followed by RAlt+' gives \u00e1
(?), not a\u0301.

More information on www.keyferret.com if you're interested. If anyone is
interested in helping make it a better system, please get in touch.

Kind regards,

Andrew.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180929/575a77b9/attachment.html>