From cldr-users at unicode.org  Tue Aug  8 20:06:37 2017
From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users)
Date: Tue, 8 Aug 2017 18:06:37 -0700
Subject: Additional Word Break Questions
Message-ID: <CAECedD_w9XnDU+xgUkb8e_sF5sSNuGv7sXHBzzNYr0fgC19o+w@mail.gmail.com>

Dear CLDR users,

As you may recall I emailed this list a few months ago with a question
about the word break rules, and today I've run into several more of what I
think are disagreements between the word break rules and the published word
break test cases.

*First Issue*

This is the word break test case in question: ? 200D ? 261D ?

It would appear that rule 3.3 matches at index 1, i.e. the index between
the two characters. Rule 3.3 is: $ZWJ ? ($Extended_Pict | $EmojiNRK)

Character 200D has word break property values of Extend and ZWJ, while
character 261D has a word break property value of E_Base. Therefore, the
left-hand side of rule 3.3 matches 200D and the right-hand side matches
261D. Since the rule indicates no break, I'm confused by the presence test
case. What am I doing wrong here?

*Second Issue*

The other test cases my implementation is failing to pass are these:

? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 0062 ?
? 0061 ? 1F1E6 ? 1F1E7 ? 200D ? 1F1E8 ? 0062 ?
? 0061 ? 1F1E6 ? 200D ? 1F1E7 ? 1F1E8 ? 0062 ?
? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 1F1E9 ? 0062 ?

In all cases, the issue lies with the expected non-break between the second
and third characters, eg. 1F1E6 and 1F1E7. The word break property value of
both these characters is Regional_Indicator. The only rule that looks like
it might match is 15: ^$Regional_Indicator ? $Regional_Indicator. However,
rule 15 does not match.

Thanks for your help in advance!

-Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170808/89646d31/attachment.html>

From cldr-users at unicode.org  Wed Aug  9 04:23:44 2017
From: cldr-users at unicode.org (Martin Hosken via CLDR-Users)
Date: Wed, 9 Aug 2017 16:23:44 +0700
Subject: collation tailoring using before
Message-ID: <20170809162344.340d3a69@sil-mh7>

Dear All,

I am trying to tailor (for the sake of argument) \u0300 to be primary ignorable and have a secondary collation key less than that of a primary character (a).

I tried:

&[before 2][first primary ignorable] << \u0300

But then I get CEs of this form:

a	[2900.0500.0500]
\u0300	[0000.8000.0500]

I'm wondering how I can get \u0300 [0000.0400.0500].

TIA,
Yours,
Martin

From cldr-users at unicode.org  Wed Aug  9 06:52:56 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Wed, 9 Aug 2017 13:52:56 +0200
Subject: collation tailoring using before
In-Reply-To: <20170809162344.340d3a69@sil-mh7>
References: <20170809162344.340d3a69@sil-mh7>
Message-ID: <CAGa7JC2WEPMeydOPR9Np+xqd5S=+evJo8MdBsn1_XorWjwpzng@mail.gmail.com>

What's the problem? [0000.8000.0500] is primary ignorable (0000) and then
gets a synthetic secondary key (8000) whose value will not matter relative
to "a" given that "a" has a non-zero primary key and will sort properly.

Note that we know that:
    &[before 2][first primary ignorable] << "a"
and your tailoring says nothing about the relative secondary ordering
between "a" and \u0300 (like in mathematics, with the condition x < u and x
< a you say that u < a or  a < u, this is not specified)

I suppose you want to add this constraint:
    &[before 2][first primary ignorable] >> \u0300
(similar to saying x > u and x < a, which would be equivalent to u<x<a,
from which you can conclude u<a)
but this kind of reset is not possible.


Clearly you are forgetting to add rules, the one you specify is not enough
to tailor the secondary ordering as you want.

But the following could do the trick for "a", but not all primary non-ignorable
characters):
    &[before 2][first primary ignorable] << \u0300 << "a"

So you would get CEs like:
    a       [2900.8001.0500],
    \u0300  [0000.8000.0500],
but actually not:
    a       [2900.0500.0500],
    \u0300  [0000.0400.0500],
even if for this pair of characters they sort equivalently at all 3 levels
(the binary ordering at 4th level is unaltered)


2017-08-09 11:23 GMT+02:00 Martin Hosken via CLDR-Users <
cldr-users at unicode.org>:

> Dear All,
>
> I am trying to tailor (for the sake of argument) \u0300 to be primary
> ignorable and have a secondary collation key less than that of a primary
> character (a).
>
> I tried:
>
> &[before 2][first primary ignorable] << \u0300
>
> But then I get CEs of this form:
>
> a       [2900.0500.0500]
> \u0300  [0000.8000.0500]
>
> I'm wondering how I can get \u0300 [0000.0400.0500].
>
> TIA,
> Yours,
> Martin
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170809/e5970807/attachment.html>

From cldr-users at unicode.org  Wed Aug  9 08:42:41 2017
From: cldr-users at unicode.org (Markus Scherer via CLDR-Users)
Date: Wed, 9 Aug 2017 06:42:41 -0700
Subject: collation tailoring using before
In-Reply-To: <20170809162344.340d3a69@sil-mh7>
References: <20170809162344.340d3a69@sil-mh7>
Message-ID: <CAN49p6qmWj58+WFPuHEdyM1uGXQfxQFB6rrjZ8cBD2=K5jtchg@mail.gmail.com>

On Wed, Aug 9, 2017 at 2:23 AM, Martin Hosken via CLDR-Users <
cldr-users at unicode.org> wrote:

> I am trying to tailor (for the sake of argument) \u0300 to be primary
> ignorable and have a secondary collation key less than that of a primary
> character (a).
> [...]
> I'm wondering how I can get \u0300 [0000.0400.0500].
>

You can't, if you want to build a well-formed Collation Element Table:
http://www.unicode.org/reports/tr10/#WF2

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170809/81ed7cb6/attachment.html>

From cldr-users at unicode.org  Thu Aug 10 10:00:14 2017
From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users)
Date: Thu, 10 Aug 2017 16:00:14 +0100
Subject: collation tailoring using before
In-Reply-To: <20170809162344.340d3a69@sil-mh7>
References: <20170809162344.340d3a69@sil-mh7>
Message-ID: <20170810160014.51c4f050@JRWUBU2>

On Wed, 9 Aug 2017 16:23:44 +0700
Martin Hosken via CLDR-Users <cldr-users at unicode.org> wrote:

> I am trying to tailor (for the sake of argument) \u0300 to be primary
> ignorable and have a secondary collation key less than that of a
> primary character (a).
> 
> I tried:
> 
> &[before 2][first primary ignorable] << \u0300
> 
> But then I get CEs of this form:
> 
> a	[2900.0500.0500]
> \u0300	[0000.8000.0500]
> 
> I'm wondering how I can get \u0300 [0000.0400.0500].

What your declared goal would result in is

a << ? < ?b << ab

The assumption is that no-one would want this, which is why the
collation is denigrated as ill-formed.  (Now DUCET is ill-formed,
though that's not why ICU doesn't support it.)

If what you want is

? << a < ?b << ab

then the Pinyin collation provides an example:

<cr><![CDATA[
                &[before 2]a<<?<<<?<<?<<<?<<?<<<?<<?<<<?
                &[before 2]e<<?<<<?<<?<<<?<<?<<<?<<?<<<?
                &e<<e??<<<E??<<e??<<<E??<<e??<<<E??<<e??<<<E??
                &[before 2]i<<?<<<?<<?<<<?<<?<<<?<<?<<<?
                &[before 2]m<<m?<<<M?<<?<<<?<<m?<<<M?<<m?<<<M?
                &[before 2]n<<n?<<<N?<<?<<<?<<?<<<?<<?<<<?
                &[before 2]o<<?<<<?<<?<<<?<<?<<<?<<?<<<?
                &[before 2]u<<?<<<?<<?<<<?<<?<<<?<<?<<<?
                &U<<?<<<?<<?<<<?<<?<<<?<<?<<<?<<?<<<?
           ]]></cr>


This gives us

? << a < ?p << ap

Richard.


From cldr-users at unicode.org  Sun Aug 13 17:58:16 2017
From: cldr-users at unicode.org (Kip Cole via CLDR-Users)
Date: Mon, 14 Aug 2017 06:58:16 +0800
Subject: common/bcp47/* not included in the json extract?
Message-ID: <89FF2776-73F4-486F-A716-94379C067266@gmail.com>

I've been working my way through building a lib to support some of CLDR for the Elixir language (https://github.com/kipcole9/cldr), based upon the json sources on GitHub  (which I recognise is not the canonical form).  I notice that common/bcp47/timezone.xml doesn?t appear to be converted to json, and seemingly the other files in the common/bcp47 directory.   Curious if this is a design decision, or some other factor at work?  

?Kip


From cldr-users at unicode.org  Thu Aug 17 11:21:13 2017
From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users)
Date: Thu, 17 Aug 2017 16:21:13 +0000
Subject: Additional Word Break Questions
In-Reply-To: <CAECedD_w9XnDU+xgUkb8e_sF5sSNuGv7sXHBzzNYr0fgC19o+w@mail.gmail.com>
References: <CAECedD_w9XnDU+xgUkb8e_sF5sSNuGv7sXHBzzNYr0fgC19o+w@mail.gmail.com>
Message-ID: <CAECedD_7T7ViZP+MhEiHAWCX=9zpUdWSSNDYB16VS_UscGxKCA@mail.gmail.com>

Hey everyone,

Just wanted to bump this thread since I haven't received any responses yet.
For the time being I've deleted the test cases in question from my test
suite, but I'd like to understand more, since I'm not convinced my
implementation is correct.

-Cameron

On Tue, Aug 8, 2017 at 6:06 PM Cameron Dutro <cameron at lumoslabs.com> wrote:

> Dear CLDR users,
>
> As you may recall I emailed this list a few months ago with a question
> about the word break rules, and today I've run into several more of what I
> think are disagreements between the word break rules and the published word
> break test cases.
>
> *First Issue*
>
> This is the word break test case in question: ? 200D ? 261D ?
>
> It would appear that rule 3.3 matches at index 1, i.e. the index between
> the two characters. Rule 3.3 is: $ZWJ ? ($Extended_Pict | $EmojiNRK)
>
> Character 200D has word break property values of Extend and ZWJ, while
> character 261D has a word break property value of E_Base. Therefore, the
> left-hand side of rule 3.3 matches 200D and the right-hand side matches
> 261D. Since the rule indicates no break, I'm confused by the presence test
> case. What am I doing wrong here?
>
> *Second Issue*
>
> The other test cases my implementation is failing to pass are these:
>
> ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 0062 ?
> ? 0061 ? 1F1E6 ? 1F1E7 ? 200D ? 1F1E8 ? 0062 ?
> ? 0061 ? 1F1E6 ? 200D ? 1F1E7 ? 1F1E8 ? 0062 ?
> ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 1F1E9 ? 0062 ?
>
> In all cases, the issue lies with the expected non-break between the
> second and third characters, eg. 1F1E6 and 1F1E7. The word break property
> value of both these characters is Regional_Indicator. The only rule that
> looks like it might match is 15: ^$Regional_Indicator ?
> $Regional_Indicator. However, rule 15 does not match.
>
> Thanks for your help in advance!
>
> -Cameron
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170817/c8b61cc0/attachment-0001.html>

From cldr-users at unicode.org  Sat Aug 19 02:08:39 2017
From: cldr-users at unicode.org (Kip Cole via CLDR-Users)
Date: Sat, 19 Aug 2017 17:08:39 +1000
Subject: Time format characters 'h' and 'k'
Message-ID: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>

Looking for some understanding on the hour format characters ?h? and ?k?.  I understand how to map ?H? and ?K? from times from 00:00:00 to 23:59:59 but I?m not sure how to interpret what TR35 says about:

?h?:  Hour 1-12
?k?: Hour 1-24

Any help appreciated!


From cldr-users at unicode.org  Sat Aug 19 02:32:40 2017
From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users)
Date: Sat, 19 Aug 2017 09:32:40 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
Message-ID: <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>

http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this
captures it:

midn. noon midn.
h 12 1 ... 11 12 1 ... 11 12
H 0 1 ... 11 12 13 ... 23 0
K 0 1 ... 11 0 1 ... 11 0
k 24 1 ... 11 12 13 ... 23 24


Mark

(https://twitter.com/mark_e_davis)

On Sat, Aug 19, 2017 at 9:08 AM, Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> Looking for some understanding on the hour format characters ?h? and ?k?.
> I understand how to map ?H? and ?K? from times from 00:00:00 to 23:59:59
> but I?m not sure how to interpret what TR35 says about:
>
> ?h?:  Hour 1-12
> ?k?: Hour 1-24
>
> Any help appreciated!
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170819/433d52e4/attachment.html>

From cldr-users at unicode.org  Sat Aug 19 11:07:13 2017
From: cldr-users at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via CLDR-Users)
Date: Sun, 20 Aug 2017 01:07:13 +0900
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
Message-ID: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>

Hello Mark, others,

On 2017/08/19 16:32, Mark Davis ?? via CLDR-Users wrote:
> http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this
> captures it:
> 
> midn. noon midn.
> h 12 1 ... 11 12 1 ... 11 12
> H 0 1 ... 11 12 13 ... 23 0
> K 0 1 ... 11 0 1 ... 11 0
> k 24 1 ... 11 12 13 ... 23 24

I don't know the details. It looks to me as if these would be the fours 
selections that make sense (12 vs. 24 hours, and 0 vs. 1 index origin).

However, what doesn't make sense to me here is that while the 
distinction of origin is made by upper vs. lower case (0 origin: H, K;
1 origin: h, k), the distinction between 12h and 24h is mixed up (12 
hours: h, K; 24 hours: H, k).

I wonder who came up with this, or if it is a mistake.

Regards,    Martin.


From cldr-users at unicode.org  Sun Aug 20 04:42:32 2017
From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users)
Date: Sun, 20 Aug 2017 11:42:32 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
Message-ID: <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>

As I recall, one of those historical anomalies (like the surrogate range
not being at the top of the BMP). Then h and then H came first. When later
we found evidence of 1..24 systems, we added 'k' (we tend to do lowercases
first), and still later we found the 0..1 case, and that got 'K'.

It's a bit like the narrow form being MMMMM; that was created long after
the abbreviated and long forms.

Mark

(https://twitter.com/mark_e_davis)

On Sat, Aug 19, 2017 at 6:07 PM, Martin J. D?rst <duerst at it.aoyama.ac.jp>
wrote:

> Hello Mark, others,
>
> On 2017/08/19 16:32, Mark Davis ?? via CLDR-Users wrote:
>
>> http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this
>> captures it:
>>
>> midn. noon midn.
>> h 12 1 ... 11 12 1 ... 11 12
>> H 0 1 ... 11 12 13 ... 23 0
>> K 0 1 ... 11 0 1 ... 11 0
>> k 24 1 ... 11 12 13 ... 23 24
>>
>
> I don't know the details. It looks to me as if these would be the fours
> selections that make sense (12 vs. 24 hours, and 0 vs. 1 index origin).
>
> However, what doesn't make sense to me here is that while the distinction
> of origin is made by upper vs. lower case (0 origin: H, K;
> 1 origin: h, k), the distinction between 12h and 24h is mixed up (12
> hours: h, K; 24 hours: H, k).
>
> I wonder who came up with this, or if it is a mistake.
>
> Regards,    Martin.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170820/89573318/attachment.html>

From cldr-users at unicode.org  Sun Aug 20 06:16:27 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Sun, 20 Aug 2017 13:16:27 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
Message-ID: <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>

2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users <
cldr-users at unicode.org>:

> As I recall, one of those historical anomalies (like the surrogate range
> not being at the top of the BMP).
>

I don't think this is an anomaly: placing the surrogates at top would have
avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII.
Placing them at top would have broken many more things and UTF-8 would have
not become the most useful encoding and the default now to be supported by
all web standards.

Ideally the surrogates should have been at end of the BMP (possibly leaving
only a few non-characters after them or tweaking surrogates and allcoations
in places so that they would have not used U+FFFE and U+FFFF kept as
reserved surrogates not used in any pair for valid codepoints). They would
have sorted in binary mode and preserved the binary order between UTF-8,
UTF-16 and UTF-32...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170820/7f4c79f8/attachment.html>

From cldr-users at unicode.org  Sun Aug 20 06:41:19 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Sun, 20 Aug 2017 13:41:19 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
Message-ID: <CAGa7JC1Y0v2y93gkDWd+aL07U_jz8w3PwNuKW88zs4PFy-KN5w@mail.gmail.com>

Anyway, I think that UTF-16 (and their surrogates) will later be
deprecated. Even for apps that want 16-bit code units, it is probable that
another 16-bit encoding will appear, preserving the sam level of
compactness but simplifying the binary order:

It is easy to create such alternative while maintaining the 8-bit and 32bit
encodings unchanged (UTF-8 and UTF-32).

Notably you can create a 16-bit encoding placing surrogates (named "UTF16S"
with the "S" meaning "shifted") at end (in 0xF800-0xFFFF), by shifting
U+E000..U+FFFF to 0xD800..0xF7FF). As U+FFFE and U+FFFF are non-characters,
they will fall down to 0xF7FE..0xF7FF, still usable as special markers such
as end of text, so that all 16-bit code units in this 0x0000-0xF7FD will be
valid (0xF7FC representing U+FFFD, i.e. the replacement character used as a
possible substitute for transcoding from texts with invalid/non-matching
encodings).

With this, instead of using U+0000 in strings as end of string markers, we
could use U+FFFF encoded as 0xF7FF in this 16-bit encoding. The NULL
control would remain encoded as 0x0000 but would no longer mark an end of
string. Another alternative would be to use 0x0000 in this encoding to
represent U+FFFF, by shifting also all codepoints that are non-characters,
and then U+0000 would be represented by 0xF7FD (but binary order would not
be preserved) or as 0x0001 (preserving binary order of assigned characters,
but all ASCII characters would be shifted up by one position in this 16-bit
encoding.

There would still remain the two columns of  non-characters (in the Arabic
compatiblity block) but they could be shifted as well just before the
surrogates, in another variant that would place **all** non-characters
(including surrogates) at end of the 16-bit encoding.


2017-08-20 13:16 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users <
> cldr-users at unicode.org>:
>
>> As I recall, one of those historical anomalies (like the surrogate range
>> not being at the top of the BMP).
>>
>
> I don't think this is an anomaly: placing the surrogates at top would have
> avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII.
> Placing them at top would have broken many more things and UTF-8 would have
> not become the most useful encoding and the default now to be supported by
> all web standards.
>
> Ideally the surrogates should have been at end of the BMP (possibly
> leaving only a few non-characters after them or tweaking surrogates and
> allcoations in places so that they would have not used U+FFFE and U+FFFF
> kept as reserved surrogates not used in any pair for valid codepoints).
> They would have sorted in binary mode and preserved the binary order
> between UTF-8, UTF-16 and UTF-32...
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170820/cb0247e5/attachment-0001.html>

From cldr-users at unicode.org  Sun Aug 20 09:13:27 2017
From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users)
Date: Sun, 20 Aug 2017 16:13:27 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
Message-ID: <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>

> placing the surrogates at top would have avoided the emergence of UTF-8
with compatiblity with 7-bit US-ASCII

I don't see what you're talking about at all ? and I was one of the people
present at the times when all the decisions were being made.

Mark

(https://twitter.com/mark_e_davis)

On Sun, Aug 20, 2017 at 1:16 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users <
> cldr-users at unicode.org>:
>
>> As I recall, one of those historical anomalies (like the surrogate range
>> not being at the top of the BMP).
>>
>
> I don't think this is an anomaly: placing the surrogates at top would have
> avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII.
> Placing them at top would have broken many more things and UTF-8 would have
> not become the most useful encoding and the default now to be supported by
> all web standards.
>
> Ideally the surrogates should have been at end of the BMP (possibly
> leaving only a few non-characters after them or tweaking surrogates and
> allcoations in places so that they would have not used U+FFFE and U+FFFF
> kept as reserved surrogates not used in any pair for valid codepoints).
> They would have sorted in binary mode and preserved the binary order
> between UTF-8, UTF-16 and UTF-32...
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170820/3c4668e8/attachment.html>

From cldr-users at unicode.org  Mon Aug 21 05:00:44 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Mon, 21 Aug 2017 12:00:44 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
Message-ID: <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>

2017-08-20 16:13 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> > placing the surrogates at top would have avoided the emergence of UTF-8
> with compatiblity with 7-bit US-ASCII
>
> I don't see what you're talking about at all ? and I was one of the people
> present at the times when all the decisions were being made.
>

I was replying explicitly to your own remark about the placement of
surrogates in the BMP. You suggested explicitly that it is an "historical
anomaly" and retrospectively think it should have been at top of it. I
included your own citation:

>> "As I recall, one of those historical anomalies (like the surrogate
range not being at the top of the BMP)."
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170821/f5e7b24b/attachment.html>

From cldr-users at unicode.org  Mon Aug 21 07:03:26 2017
From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users)
Date: Mon, 21 Aug 2017 14:03:26 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
 <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
Message-ID: <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>

And your reply "placing the surrogates at top would have avoided the
emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make
sense to me.

Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the
development of UTF-8.

Mark

(https://twitter.com/mark_e_davis)

On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2017-08-20 16:13 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>
>> > placing the surrogates at top would have avoided the emergence of UTF-8
>> with compatiblity with 7-bit US-ASCII
>>
>> I don't see what you're talking about at all ? and I was one of the
>> people present at the times when all the decisions were being made.
>>
>
> I was replying explicitly to your own remark about the placement of
> surrogates in the BMP. You suggested explicitly that it is an "historical
> anomaly" and retrospectively think it should have been at top of it. I
> included your own citation:
>
> >> "As I recall, one of those historical anomalies (like the surrogate
> range not being at the top of the BMP)."
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170821/f255fc47/attachment.html>

From cldr-users at unicode.org  Mon Aug 21 07:19:18 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Mon, 21 Aug 2017 14:19:18 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
 <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
 <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>
Message-ID: <CAGa7JC0asAe-Rf28MneLX+D9-X7-h-nJZhUEmntTp9dZ0yNisw@mail.gmail.com>

But then why do you think that surrogates should have been allocated
elsewhere? If surrogates at been alocated at top, the low codepoints used
by ASCII would not be available for surrogates, and all code points would
be larger; UTF-8 may still place ASCII at top in the low position, but the
conversion between codepoint numeric values and UTF-8 would have been
radically different.

So what I understand now is that for your "top" word, you meant F800..FFFF
("end" of the BMP: I do agree with that), but I had understood you meant
they should have been at "start" (at 0000..07FF) for an unexplained strange
reason. In common sense, "top" opposes to "bottom" and means the start, not
the end, and the charmaps published are also placing the start of planes at
top of charts, not the bottom (this also matches the normal reading order
in all modern scripts)...


2017-08-21 14:03 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> And your reply "placing the surrogates at top would have avoided the
> emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make
> sense to me.
>
> Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the
> development of UTF-8.
>
> Mark
>
> (https://twitter.com/mark_e_davis)
>
> On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> 2017-08-20 16:13 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>>
>>> > placing the surrogates at top would have avoided the emergence of
>>> UTF-8 with compatiblity with 7-bit US-ASCII
>>>
>>> I don't see what you're talking about at all ? and I was one of the
>>> people present at the times when all the decisions were being made.
>>>
>>
>> I was replying explicitly to your own remark about the placement of
>> surrogates in the BMP. You suggested explicitly that it is an "historical
>> anomaly" and retrospectively think it should have been at top of it. I
>> included your own citation:
>>
>> >> "As I recall, one of those historical anomalies (like the surrogate
>> range not being at the top of the BMP)."
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170821/30ef7b6e/attachment.html>

From cldr-users at unicode.org  Mon Aug 21 08:12:01 2017
From: cldr-users at unicode.org (Peter Constable via CLDR-Users)
Date: Mon, 21 Aug 2017 13:12:01 +0000
Subject: Time format characters 'h' and 'k'
In-Reply-To: <CAGa7JC0asAe-Rf28MneLX+D9-X7-h-nJZhUEmntTp9dZ0yNisw@mail.gmail.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
 <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
 <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>,
 <CAGa7JC0asAe-Rf28MneLX+D9-X7-h-nJZhUEmntTp9dZ0yNisw@mail.gmail.com>
Message-ID: <DM2PR21MB00284D8CD26363E509F19E3BD5870@DM2PR21MB0028.namprd21.prod.outlook.com>

This is waaaayyyyy off topic.

Sent from my Windows 10 phone

From: Philippe Verdy via CLDR-Users<mailto:cldr-users at unicode.org>
Sent: Monday, August 21, 2017 5:22 AM
To: mark<mailto:mark at macchiato.com>
Cc: Kip Cole<mailto:kipcole9 at gmail.com>; Martin J. D?rst<mailto:duerst at it.aoyama.ac.jp>; cldr-users at unicode.org<mailto:cldr-users at unicode.org>
Subject: Re: Time format characters 'h' and 'k'

But then why do you think that surrogates should have been allocated elsewhere? If surrogates at been alocated at top, the low codepoints used by ASCII would not be available for surrogates, and all code points would be larger; UTF-8 may still place ASCII at top in the low position, but the conversion between codepoint numeric values and UTF-8 would have been radically different.

So what I understand now is that for your "top" word, you meant F800..FFFF ("end" of the BMP: I do agree with that), but I had understood you meant they should have been at "start" (at 0000..07FF) for an unexplained strange reason. In common sense, "top" opposes to "bottom" and means the start, not the end, and the charmaps published are also placing the start of planes at top of charts, not the bottom (this also matches the normal reading order in all modern scripts)...


2017-08-21 14:03 GMT+02:00 Mark Davis ?? <mark at macchiato.com<mailto:mark at macchiato.com>>:
And your reply "placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make sense to me.

Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the development of UTF-8.

Mark

(https://twitter.com/mark_e_davis<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fmark_e_davis&data=02%7C01%7Cpetercon%40microsoft.com%7Ca497f09e2eae4f74b55a08d4e88f3fc9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636389149339827278&sdata=21qCAw9eYm0iiJCMoy54s8eSX3pH5l0vsTxL4iOceKM%3D&reserved=0>)

On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy <verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>> wrote:
2017-08-20 16:13 GMT+02:00 Mark Davis ?? <mark at macchiato.com<mailto:mark at macchiato.com>>:
> placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII

I don't see what you're talking about at all ? and I was one of the people present at the times when all the decisions were being made.

I was replying explicitly to your own remark about the placement of surrogates in the BMP. You suggested explicitly that it is an "historical anomaly" and retrospectively think it should have been at top of it. I included your own citation:

>> "As I recall, one of those historical anomalies (like the surrogate range not being at the top of the BMP)."


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170821/d761c29b/attachment-0001.html>

From cldr-users at unicode.org  Mon Aug 21 16:32:28 2017
From: cldr-users at unicode.org (Asmus Freytag via CLDR-Users)
Date: Mon, 21 Aug 2017 14:32:28 -0700
Subject: Time format characters 'h' and 'k'
In-Reply-To: <DM2PR21MB00284D8CD26363E509F19E3BD5870@DM2PR21MB0028.namprd21.prod.outlook.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
 <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
 <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>
 <CAGa7JC0asAe-Rf28MneLX+D9-X7-h-nJZhUEmntTp9dZ0yNisw@mail.gmail.com>
 <DM2PR21MB00284D8CD26363E509F19E3BD5870@DM2PR21MB0028.namprd21.prod.outlook.com>
Message-ID: <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170821/a4fb4ab3/attachment.html>

From cldr-users at unicode.org  Mon Aug 21 17:33:57 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Tue, 22 Aug 2017 00:33:57 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com>
References: <A86205B4-2CBA-496F-8CC0-66F708FA2502@gmail.com>
 <CAJ2xs_EqyY0JU46LUzwCtu4wtcWUEDQa61_X-Qjp9UFgwgnTAg@mail.gmail.com>
 <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp>
 <CAJ2xs_FdLc2Sh0aXDauvGsNryOnKMrDVwVWfHxyG4nGR7fAakg@mail.gmail.com>
 <CAGa7JC3=x8_YeXOoKMW5a3Qis9SyVfRR5x36pswzDCj2P7NQYQ@mail.gmail.com>
 <CAJ2xs_Hq3PvLidUc8GcZpE4GUvkYC9ku6QdC3RAbHa_iL53W0A@mail.gmail.com>
 <CAGa7JC0mfv_hAJYhWGUqyCpud5_YD_=8hmSKEYtQO6i3bb4fZQ@mail.gmail.com>
 <CAJ2xs_GKY_1GeFoKf9gOY+2rtkizDwUqy+F1sFuwUz2dHyo2mw@mail.gmail.com>
 <CAGa7JC0asAe-Rf28MneLX+D9-X7-h-nJZhUEmntTp9dZ0yNisw@mail.gmail.com>
 <DM2PR21MB00284D8CD26363E509F19E3BD5870@DM2PR21MB0028.namprd21.prod.outlook.com>
 <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com>
Message-ID: <CAGa7JC1L48vWq2HROjxgexG6po-UBRNRRt1Lqi22Kqhci4J8hg@mail.gmail.com>

2017-08-21 23:32 GMT+02:00 Asmus Freytag via CLDR-Users <
cldr-users at unicode.org>:

> On 8/21/2017 6:12 AM, Peter Constable via CLDR-Users wrote:
>
> This is waaaayyyyy off topic.
>
>
> Well, it gives a nice window into Philippe's thinking. He must be viewing
> encodings as if they were printed on giant wall charts, instead of thinking
> of them numerically where the "upper" end of a range (top) would the higher
> value.
>

IO'm n,ot alone to think like this, given this is the default presentation
in the standard itself and all code charts included and officially
released. Unicode is about encoding text, and naturally it adopts a text
-related point of view where reading order is considered; even with Bidi,
all the encoded scripts are using top to bottom layouts by default (except
historic scripts using boustrophedon and special uses such as rotating
Latin texts to render it along a vertical border of a rectangular area or a
very narrow strip, where it could be bottom to top).
The terms top and bottom are used in many places in the standard for
presentation related descriptions, it makes no sense if you speak about
numerical values where the correct terms are "first"/"last" or
"lower"/"upper" or "start"/"end" which are explicitly referencing a precise
order (unlike "top" and "bottom" that are just 2D directions).

Of course there remains the expression English "top ten" (or with any other
number), but the number is required to give it the meaning "first" (in a
sequence frequently open-ended).

But the discussion was not out of topic: Mark Davis did not justify really
why he would have prefered the surrogates at end of the BMP. Various
decisions had to be made rapidly to accelerate the merging of efforts
between Unicode and ISO. But not all was done coherently even if the new
compatible version of both standards were created which were both
incompatible with their former initial version. With the price of
interoperability break already paid, some errors could have been avoided.

If there are some reasons why surrogates should be have been at end of the
BMP, it would have been
* to have 16-bit binary ordering compatible with the binary ordering of
UTF-8 and UTF-32
* to facilitate some algorithms or optimize some storage

These reasons are valid, but nothing prohibit creating a 16-bit encoding
that can do that (with exactly the same price as UTF-16): this can be done
only by remapping the 16-bit encoding space with a bijection that swaps
some value ranges.

Here we speak about implementations of algorithms, not about interchanges
where UTF-16 is defined to be used, but that has fallen out of use (except
for the Windows-specific "Unicode" APIs which are however not really UTF-16
compliant, and in the NTFS filesystem or FAT32/exFAT with their "Unicode"
extension that are also not completely Unciode compliant).

What I mean is that the remaining allocation of surrogates in the BMP is
just a waste only kept because of UTF-16, and not even needed for data
interchange, while implementations can still work internally with their own
16-bit encoding if needed, without even needing to comply strictly with
what is defined in UTF-16. So what is done in NTFS or FAT32/exFAT does not
matter, it just follows what is defined in these filesystems by Microsoft.

But other 16-bit encodings are largely possible using other mappings
(notably for Chinese, but as well for Latin)

So you continue to think this is out of topic of the Unicode mailing list,
when the subject was introduced (but not justified at all) by Mark Davis
not explaining his opinion about the placement of surrogates in the BMP,
and using as well a confusing term to say it. I am convinced this was
really fully ON topic of Unicode (including on its mailing list where Mark
Davis started his remark), its history, and its usage and implementations.

I'd say that surrogates were bad solutions for what was not really a
problem and should have remained in private implementations. Even UTF-16
should have not even been standardized, it was not needed at all (and we
would havbe avoided as well the nightmare of ZWNBSP used as BOM in text
files, and its later disunification using another codepoint, another
historical error) !

If you asked to people, they'd say : remove UTF-16 from the standard, and
remove surrogates completely from the BMP. Or at least deprecate them
completely (leaving their use only for internal legacy implementations, but
not for any interchange, or leaving implementations defining their own
16-bit encoding if they want and documenting it if needed).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170822/1e09d3cb/attachment.html>

From cldr-users at unicode.org  Wed Aug 23 12:06:46 2017
From: cldr-users at unicode.org (Doug Ewell via CLDR-Users)
Date: Wed, 23 Aug 2017 10:06:46 -0700
Subject: Time format characters 'h' and 'k'
Message-ID: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> But the discussion was not out of topic: Mark Davis did not justify
> really why he would have prefered the surrogates at end of the BMP.

He didn't say he would have preferred it; he said it was a historical
anomaly. And it was just a parenthetical aside.

Check the Subject line again. Peter is right: waaaayyyyy.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From cldr-users at unicode.org  Wed Aug 23 14:13:07 2017
From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users)
Date: Wed, 23 Aug 2017 21:13:07 +0200
Subject: Time format characters 'h' and 'k'
In-Reply-To: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com>
References: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC3dAtDweW3+0hJ=wuacH900J1K=Q15oRJuFF3wSEG_usA@mail.gmail.com>

2017-08-23 19:06 GMT+02:00 Doug Ewell via CLDR-Users <cldr-users at unicode.org
>:

> Philippe Verdy wrote:
>
> > But the discussion was not out of topic: Mark Davis did not justify
> > really why he would have prefered the surrogates at end of the BMP.
>
> he said it was a historical anomaly.


And does not say why, though this is the most interesting thing. Saying
something WAS an anomaly brings this question. If no one can explain that,
then there was no "anomaly" at all and the parenthetical side is also no
relevant at all (for this subject line). But it's still an interesting
question: are UTF-16, BOM's and surrogates really useful as a normative
part of the standard instead of a technical annex kept only for historic
references because of its usage in the Windows API, or indirect reference
from CESU-8 (which is also not part of the standard but kept as another
possibilty for handling 16-bit code units without the problem of byte order.

As a parenthetical side, I can use the same argument: the disunification of
ZWNJ was also an historical anomaly (only because of its predominant usage
as BOM's in Windows "Notepad.exe" to support UTF-16 encoded texts instead
of UTF-16BE or UTF-16LE, two other related and unneeded 8-bit encodings,
that are probably even worse than CESU-8 not needing these damn'ed BOMs
that have polluted the usage of UTF-8). Windows clearly did not even needed
these BOMs, when it could store the text encoding in NTFS or ReFS metadata
(e.g. in a tiny alternate datastream, not using more storage cluster space
as it would fit directly in directory entries, just like what it does for
marking files downloaded from Internet from a third party domain name by
also using an ADS); for FAT32 and exFAT, there's also solutions using
conventional metadata folders (solution unused on MacOS for storing its
structured "resource forks" with similar goals and capabilities, or used on
webservers for HTTP metadata using an additional database or index file);
on OS/2 and VMS there were "extended attributes" with the same goal.

In my opinion there are still more (and better) ways to bijectively remap
codepoints to 16-bit codeunits and UTF-16 is just one of them (though it is
not strictly bijective but only surjective, causing additional problems for
the reverse conversion back to codepoints) but not the best for all uses
which won't like having to check  exceptions everywhere (notably unpaired
surrogates except at end of streams with premature truncation).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170823/9b169606/attachment.html>

From cldr-users at unicode.org  Wed Aug 23 14:19:44 2017
From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users)
Date: Wed, 23 Aug 2017 21:19:44 +0200
Subject: Change the subject line AND audience (was Re: Time format characters
 'h' and 'k')
Message-ID: <CAJ2xs_EuTgaQoWaeB_5bdjsvCkwKFaEzawmvUCk-STkZYN4xNA@mail.gmail.com>

Philippe,

This is the wrong topic, *and wrong audience*, as has been pointed out to
you. Please change both of them.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170823/1225bcd3/attachment.html>