From cldr-users at unicode.org Tue Aug 8 20:06:37 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Tue, 8 Aug 2017 18:06:37 -0700 Subject: Additional Word Break Questions Message-ID: Dear CLDR users, As you may recall I emailed this list a few months ago with a question about the word break rules, and today I've run into several more of what I think are disagreements between the word break rules and the published word break test cases. *First Issue* This is the word break test case in question: ? 200D ? 261D ? It would appear that rule 3.3 matches at index 1, i.e. the index between the two characters. Rule 3.3 is: $ZWJ ? ($Extended_Pict | $EmojiNRK) Character 200D has word break property values of Extend and ZWJ, while character 261D has a word break property value of E_Base. Therefore, the left-hand side of rule 3.3 matches 200D and the right-hand side matches 261D. Since the rule indicates no break, I'm confused by the presence test case. What am I doing wrong here? *Second Issue* The other test cases my implementation is failing to pass are these: ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 0062 ? ? 0061 ? 1F1E6 ? 1F1E7 ? 200D ? 1F1E8 ? 0062 ? ? 0061 ? 1F1E6 ? 200D ? 1F1E7 ? 1F1E8 ? 0062 ? ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 1F1E9 ? 0062 ? In all cases, the issue lies with the expected non-break between the second and third characters, eg. 1F1E6 and 1F1E7. The word break property value of both these characters is Regional_Indicator. The only rule that looks like it might match is 15: ^$Regional_Indicator ? $Regional_Indicator. However, rule 15 does not match. Thanks for your help in advance! -Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Aug 9 04:23:44 2017 From: cldr-users at unicode.org (Martin Hosken via CLDR-Users) Date: Wed, 9 Aug 2017 16:23:44 +0700 Subject: collation tailoring using before Message-ID: <20170809162344.340d3a69@sil-mh7> Dear All, I am trying to tailor (for the sake of argument) \u0300 to be primary ignorable and have a secondary collation key less than that of a primary character (a). I tried: &[before 2][first primary ignorable] << \u0300 But then I get CEs of this form: a [2900.0500.0500] \u0300 [0000.8000.0500] I'm wondering how I can get \u0300 [0000.0400.0500]. TIA, Yours, Martin From cldr-users at unicode.org Wed Aug 9 06:52:56 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Wed, 9 Aug 2017 13:52:56 +0200 Subject: collation tailoring using before In-Reply-To: <20170809162344.340d3a69@sil-mh7> References: <20170809162344.340d3a69@sil-mh7> Message-ID: What's the problem? [0000.8000.0500] is primary ignorable (0000) and then gets a synthetic secondary key (8000) whose value will not matter relative to "a" given that "a" has a non-zero primary key and will sort properly. Note that we know that: &[before 2][first primary ignorable] << "a" and your tailoring says nothing about the relative secondary ordering between "a" and \u0300 (like in mathematics, with the condition x < u and x < a you say that u < a or a < u, this is not specified) I suppose you want to add this constraint: &[before 2][first primary ignorable] >> \u0300 (similar to saying x > u and x < a, which would be equivalent to u: > Dear All, > > I am trying to tailor (for the sake of argument) \u0300 to be primary > ignorable and have a secondary collation key less than that of a primary > character (a). > > I tried: > > &[before 2][first primary ignorable] << \u0300 > > But then I get CEs of this form: > > a [2900.0500.0500] > \u0300 [0000.8000.0500] > > I'm wondering how I can get \u0300 [0000.0400.0500]. > > TIA, > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Aug 9 08:42:41 2017 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Wed, 9 Aug 2017 06:42:41 -0700 Subject: collation tailoring using before In-Reply-To: <20170809162344.340d3a69@sil-mh7> References: <20170809162344.340d3a69@sil-mh7> Message-ID: On Wed, Aug 9, 2017 at 2:23 AM, Martin Hosken via CLDR-Users < cldr-users at unicode.org> wrote: > I am trying to tailor (for the sake of argument) \u0300 to be primary > ignorable and have a secondary collation key less than that of a primary > character (a). > [...] > I'm wondering how I can get \u0300 [0000.0400.0500]. > You can't, if you want to build a well-formed Collation Element Table: http://www.unicode.org/reports/tr10/#WF2 markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu Aug 10 10:00:14 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Thu, 10 Aug 2017 16:00:14 +0100 Subject: collation tailoring using before In-Reply-To: <20170809162344.340d3a69@sil-mh7> References: <20170809162344.340d3a69@sil-mh7> Message-ID: <20170810160014.51c4f050@JRWUBU2> On Wed, 9 Aug 2017 16:23:44 +0700 Martin Hosken via CLDR-Users wrote: > I am trying to tailor (for the sake of argument) \u0300 to be primary > ignorable and have a secondary collation key less than that of a > primary character (a). > > I tried: > > &[before 2][first primary ignorable] << \u0300 > > But then I get CEs of this form: > > a [2900.0500.0500] > \u0300 [0000.8000.0500] > > I'm wondering how I can get \u0300 [0000.0400.0500]. What your declared goal would result in is a << ? < ?b << ab The assumption is that no-one would want this, which is why the collation is denigrated as ill-formed. (Now DUCET is ill-formed, though that's not why ICU doesn't support it.) If what you want is ? << a < ?b << ab then the Pinyin collation provides an example: This gives us ? << a < ?p << ap Richard. From cldr-users at unicode.org Sun Aug 13 17:58:16 2017 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Mon, 14 Aug 2017 06:58:16 +0800 Subject: common/bcp47/* not included in the json extract? Message-ID: <89FF2776-73F4-486F-A716-94379C067266@gmail.com> I've been working my way through building a lib to support some of CLDR for the Elixir language (https://github.com/kipcole9/cldr), based upon the json sources on GitHub (which I recognise is not the canonical form). I notice that common/bcp47/timezone.xml doesn?t appear to be converted to json, and seemingly the other files in the common/bcp47 directory. Curious if this is a design decision, or some other factor at work? ?Kip From cldr-users at unicode.org Thu Aug 17 11:21:13 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Thu, 17 Aug 2017 16:21:13 +0000 Subject: Additional Word Break Questions In-Reply-To: References: Message-ID: Hey everyone, Just wanted to bump this thread since I haven't received any responses yet. For the time being I've deleted the test cases in question from my test suite, but I'd like to understand more, since I'm not convinced my implementation is correct. -Cameron On Tue, Aug 8, 2017 at 6:06 PM Cameron Dutro wrote: > Dear CLDR users, > > As you may recall I emailed this list a few months ago with a question > about the word break rules, and today I've run into several more of what I > think are disagreements between the word break rules and the published word > break test cases. > > *First Issue* > > This is the word break test case in question: ? 200D ? 261D ? > > It would appear that rule 3.3 matches at index 1, i.e. the index between > the two characters. Rule 3.3 is: $ZWJ ? ($Extended_Pict | $EmojiNRK) > > Character 200D has word break property values of Extend and ZWJ, while > character 261D has a word break property value of E_Base. Therefore, the > left-hand side of rule 3.3 matches 200D and the right-hand side matches > 261D. Since the rule indicates no break, I'm confused by the presence test > case. What am I doing wrong here? > > *Second Issue* > > The other test cases my implementation is failing to pass are these: > > ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 0062 ? > ? 0061 ? 1F1E6 ? 1F1E7 ? 200D ? 1F1E8 ? 0062 ? > ? 0061 ? 1F1E6 ? 200D ? 1F1E7 ? 1F1E8 ? 0062 ? > ? 0061 ? 1F1E6 ? 1F1E7 ? 1F1E8 ? 1F1E9 ? 0062 ? > > In all cases, the issue lies with the expected non-break between the > second and third characters, eg. 1F1E6 and 1F1E7. The word break property > value of both these characters is Regional_Indicator. The only rule that > looks like it might match is 15: ^$Regional_Indicator ? > $Regional_Indicator. However, rule 15 does not match. > > Thanks for your help in advance! > > -Cameron > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Aug 19 02:08:39 2017 From: cldr-users at unicode.org (Kip Cole via CLDR-Users) Date: Sat, 19 Aug 2017 17:08:39 +1000 Subject: Time format characters 'h' and 'k' Message-ID: Looking for some understanding on the hour format characters ?h? and ?k?. I understand how to map ?H? and ?K? from times from 00:00:00 to 23:59:59 but I?m not sure how to interpret what TR35 says about: ?h?: Hour 1-12 ?k?: Hour 1-24 Any help appreciated! From cldr-users at unicode.org Sat Aug 19 02:32:40 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sat, 19 Aug 2017 09:32:40 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: Message-ID: http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this captures it: midn. noon midn. h 12 1 ... 11 12 1 ... 11 12 H 0 1 ... 11 12 13 ... 23 0 K 0 1 ... 11 0 1 ... 11 0 k 24 1 ... 11 12 13 ... 23 24 Mark (https://twitter.com/mark_e_davis) On Sat, Aug 19, 2017 at 9:08 AM, Kip Cole via CLDR-Users < cldr-users at unicode.org> wrote: > Looking for some understanding on the hour format characters ?h? and ?k?. > I understand how to map ?H? and ?K? from times from 00:00:00 to 23:59:59 > but I?m not sure how to interpret what TR35 says about: > > ?h?: Hour 1-12 > ?k?: Hour 1-24 > > Any help appreciated! > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sat Aug 19 11:07:13 2017 From: cldr-users at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via CLDR-Users) Date: Sun, 20 Aug 2017 01:07:13 +0900 Subject: Time format characters 'h' and 'k' In-Reply-To: References: Message-ID: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Hello Mark, others, On 2017/08/19 16:32, Mark Davis ?? via CLDR-Users wrote: > http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this > captures it: > > midn. noon midn. > h 12 1 ... 11 12 1 ... 11 12 > H 0 1 ... 11 12 13 ... 23 0 > K 0 1 ... 11 0 1 ... 11 0 > k 24 1 ... 11 12 13 ... 23 24 I don't know the details. It looks to me as if these would be the fours selections that make sense (12 vs. 24 hours, and 0 vs. 1 index origin). However, what doesn't make sense to me here is that while the distinction of origin is made by upper vs. lower case (0 origin: H, K; 1 origin: h, k), the distinction between 12h and 24h is mixed up (12 hours: h, K; 24 hours: H, k). I wonder who came up with this, or if it is a mistake. Regards, Martin. From cldr-users at unicode.org Sun Aug 20 04:42:32 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Aug 2017 11:42:32 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: As I recall, one of those historical anomalies (like the surrogate range not being at the top of the BMP). Then h and then H came first. When later we found evidence of 1..24 systems, we added 'k' (we tend to do lowercases first), and still later we found the 0..1 case, and that got 'K'. It's a bit like the narrow form being MMMMM; that was created long after the abbreviated and long forms. Mark (https://twitter.com/mark_e_davis) On Sat, Aug 19, 2017 at 6:07 PM, Martin J. D?rst wrote: > Hello Mark, others, > > On 2017/08/19 16:32, Mark Davis ?? via CLDR-Users wrote: > >> http://unicode.org/reports/tr35/tr35-dates.html#dfst-hour. I think this >> captures it: >> >> midn. noon midn. >> h 12 1 ... 11 12 1 ... 11 12 >> H 0 1 ... 11 12 13 ... 23 0 >> K 0 1 ... 11 0 1 ... 11 0 >> k 24 1 ... 11 12 13 ... 23 24 >> > > I don't know the details. It looks to me as if these would be the fours > selections that make sense (12 vs. 24 hours, and 0 vs. 1 index origin). > > However, what doesn't make sense to me here is that while the distinction > of origin is made by upper vs. lower case (0 origin: H, K; > 1 origin: h, k), the distinction between 12h and 24h is mixed up (12 > hours: h, K; 24 hours: H, k). > > I wonder who came up with this, or if it is a mistake. > > Regards, Martin. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Aug 20 06:16:27 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Sun, 20 Aug 2017 13:16:27 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: 2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users < cldr-users at unicode.org>: > As I recall, one of those historical anomalies (like the surrogate range > not being at the top of the BMP). > I don't think this is an anomaly: placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII. Placing them at top would have broken many more things and UTF-8 would have not become the most useful encoding and the default now to be supported by all web standards. Ideally the surrogates should have been at end of the BMP (possibly leaving only a few non-characters after them or tweaking surrogates and allcoations in places so that they would have not used U+FFFE and U+FFFF kept as reserved surrogates not used in any pair for valid codepoints). They would have sorted in binary mode and preserved the binary order between UTF-8, UTF-16 and UTF-32... -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Aug 20 06:41:19 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Sun, 20 Aug 2017 13:41:19 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: Anyway, I think that UTF-16 (and their surrogates) will later be deprecated. Even for apps that want 16-bit code units, it is probable that another 16-bit encoding will appear, preserving the sam level of compactness but simplifying the binary order: It is easy to create such alternative while maintaining the 8-bit and 32bit encodings unchanged (UTF-8 and UTF-32). Notably you can create a 16-bit encoding placing surrogates (named "UTF16S" with the "S" meaning "shifted") at end (in 0xF800-0xFFFF), by shifting U+E000..U+FFFF to 0xD800..0xF7FF). As U+FFFE and U+FFFF are non-characters, they will fall down to 0xF7FE..0xF7FF, still usable as special markers such as end of text, so that all 16-bit code units in this 0x0000-0xF7FD will be valid (0xF7FC representing U+FFFD, i.e. the replacement character used as a possible substitute for transcoding from texts with invalid/non-matching encodings). With this, instead of using U+0000 in strings as end of string markers, we could use U+FFFF encoded as 0xF7FF in this 16-bit encoding. The NULL control would remain encoded as 0x0000 but would no longer mark an end of string. Another alternative would be to use 0x0000 in this encoding to represent U+FFFF, by shifting also all codepoints that are non-characters, and then U+0000 would be represented by 0xF7FD (but binary order would not be preserved) or as 0x0001 (preserving binary order of assigned characters, but all ASCII characters would be shifted up by one position in this 16-bit encoding. There would still remain the two columns of non-characters (in the Arabic compatiblity block) but they could be shifted as well just before the surrogates, in another variant that would place **all** non-characters (including surrogates) at end of the 16-bit encoding. 2017-08-20 13:16 GMT+02:00 Philippe Verdy : > 2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users < > cldr-users at unicode.org>: > >> As I recall, one of those historical anomalies (like the surrogate range >> not being at the top of the BMP). >> > > I don't think this is an anomaly: placing the surrogates at top would have > avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII. > Placing them at top would have broken many more things and UTF-8 would have > not become the most useful encoding and the default now to be supported by > all web standards. > > Ideally the surrogates should have been at end of the BMP (possibly > leaving only a few non-characters after them or tweaking surrogates and > allcoations in places so that they would have not used U+FFFE and U+FFFF > kept as reserved surrogates not used in any pair for valid codepoints). > They would have sorted in binary mode and preserved the binary order > between UTF-8, UTF-16 and UTF-32... > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Sun Aug 20 09:13:27 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Sun, 20 Aug 2017 16:13:27 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: > placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII I don't see what you're talking about at all ? and I was one of the people present at the times when all the decisions were being made. Mark (https://twitter.com/mark_e_davis) On Sun, Aug 20, 2017 at 1:16 PM, Philippe Verdy wrote: > 2017-08-20 11:42 GMT+02:00 Mark Davis ?? via CLDR-Users < > cldr-users at unicode.org>: > >> As I recall, one of those historical anomalies (like the surrogate range >> not being at the top of the BMP). >> > > I don't think this is an anomaly: placing the surrogates at top would have > avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII. > Placing them at top would have broken many more things and UTF-8 would have > not become the most useful encoding and the default now to be supported by > all web standards. > > Ideally the surrogates should have been at end of the BMP (possibly > leaving only a few non-characters after them or tweaking surrogates and > allcoations in places so that they would have not used U+FFFE and U+FFFF > kept as reserved surrogates not used in any pair for valid codepoints). > They would have sorted in binary mode and preserved the binary order > between UTF-8, UTF-16 and UTF-32... > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 05:00:44 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Mon, 21 Aug 2017 12:00:44 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: 2017-08-20 16:13 GMT+02:00 Mark Davis ?? : > > placing the surrogates at top would have avoided the emergence of UTF-8 > with compatiblity with 7-bit US-ASCII > > I don't see what you're talking about at all ? and I was one of the people > present at the times when all the decisions were being made. > I was replying explicitly to your own remark about the placement of surrogates in the BMP. You suggested explicitly that it is an "historical anomaly" and retrospectively think it should have been at top of it. I included your own citation: >> "As I recall, one of those historical anomalies (like the surrogate range not being at the top of the BMP)." -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 07:03:26 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Mon, 21 Aug 2017 14:03:26 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: And your reply "placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make sense to me. Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the development of UTF-8. Mark (https://twitter.com/mark_e_davis) On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy wrote: > 2017-08-20 16:13 GMT+02:00 Mark Davis ?? : > >> > placing the surrogates at top would have avoided the emergence of UTF-8 >> with compatiblity with 7-bit US-ASCII >> >> I don't see what you're talking about at all ? and I was one of the >> people present at the times when all the decisions were being made. >> > > I was replying explicitly to your own remark about the placement of > surrogates in the BMP. You suggested explicitly that it is an "historical > anomaly" and retrospectively think it should have been at top of it. I > included your own citation: > > >> "As I recall, one of those historical anomalies (like the surrogate > range not being at the top of the BMP)." > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 07:19:18 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Mon, 21 Aug 2017 14:19:18 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: But then why do you think that surrogates should have been allocated elsewhere? If surrogates at been alocated at top, the low codepoints used by ASCII would not be available for surrogates, and all code points would be larger; UTF-8 may still place ASCII at top in the low position, but the conversion between codepoint numeric values and UTF-8 would have been radically different. So what I understand now is that for your "top" word, you meant F800..FFFF ("end" of the BMP: I do agree with that), but I had understood you meant they should have been at "start" (at 0000..07FF) for an unexplained strange reason. In common sense, "top" opposes to "bottom" and means the start, not the end, and the charmaps published are also placing the start of planes at top of charts, not the bottom (this also matches the normal reading order in all modern scripts)... 2017-08-21 14:03 GMT+02:00 Mark Davis ?? : > And your reply "placing the surrogates at top would have avoided the > emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make > sense to me. > > Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the > development of UTF-8. > > Mark > > (https://twitter.com/mark_e_davis) > > On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy > wrote: > >> 2017-08-20 16:13 GMT+02:00 Mark Davis ?? : >> >>> > placing the surrogates at top would have avoided the emergence of >>> UTF-8 with compatiblity with 7-bit US-ASCII >>> >>> I don't see what you're talking about at all ? and I was one of the >>> people present at the times when all the decisions were being made. >>> >> >> I was replying explicitly to your own remark about the placement of >> surrogates in the BMP. You suggested explicitly that it is an "historical >> anomaly" and retrospectively think it should have been at top of it. I >> included your own citation: >> >> >> "As I recall, one of those historical anomalies (like the surrogate >> range not being at the top of the BMP)." >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 08:12:01 2017 From: cldr-users at unicode.org (Peter Constable via CLDR-Users) Date: Mon, 21 Aug 2017 13:12:01 +0000 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> , Message-ID: This is waaaayyyyy off topic. Sent from my Windows 10 phone From: Philippe Verdy via CLDR-Users Sent: Monday, August 21, 2017 5:22 AM To: mark Cc: Kip Cole; Martin J. D?rst; cldr-users at unicode.org Subject: Re: Time format characters 'h' and 'k' But then why do you think that surrogates should have been allocated elsewhere? If surrogates at been alocated at top, the low codepoints used by ASCII would not be available for surrogates, and all code points would be larger; UTF-8 may still place ASCII at top in the low position, but the conversion between codepoint numeric values and UTF-8 would have been radically different. So what I understand now is that for your "top" word, you meant F800..FFFF ("end" of the BMP: I do agree with that), but I had understood you meant they should have been at "start" (at 0000..07FF) for an unexplained strange reason. In common sense, "top" opposes to "bottom" and means the start, not the end, and the charmaps published are also placing the start of planes at top of charts, not the bottom (this also matches the normal reading order in all modern scripts)... 2017-08-21 14:03 GMT+02:00 Mark Davis ?? >: And your reply "placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII" doesn't make sense to me. Having the surrogate zone at D800..DFFF vs F800..FFFF had no effect on the development of UTF-8. Mark (https://twitter.com/mark_e_davis) On Mon, Aug 21, 2017 at 12:00 PM, Philippe Verdy > wrote: 2017-08-20 16:13 GMT+02:00 Mark Davis ?? >: > placing the surrogates at top would have avoided the emergence of UTF-8 with compatiblity with 7-bit US-ASCII I don't see what you're talking about at all ? and I was one of the people present at the times when all the decisions were being made. I was replying explicitly to your own remark about the placement of surrogates in the BMP. You suggested explicitly that it is an "historical anomaly" and retrospectively think it should have been at top of it. I included your own citation: >> "As I recall, one of those historical anomalies (like the surrogate range not being at the top of the BMP)." -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 16:32:28 2017 From: cldr-users at unicode.org (Asmus Freytag via CLDR-Users) Date: Mon, 21 Aug 2017 14:32:28 -0700 Subject: Time format characters 'h' and 'k' In-Reply-To: References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> Message-ID: <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com> An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Mon Aug 21 17:33:57 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Tue, 22 Aug 2017 00:33:57 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com> References: <32319fad-3d73-5211-8579-9b113b4a0d3d@it.aoyama.ac.jp> <9e87da45-83e8-914a-13bf-794d28bd4669@ix.netcom.com> Message-ID: 2017-08-21 23:32 GMT+02:00 Asmus Freytag via CLDR-Users < cldr-users at unicode.org>: > On 8/21/2017 6:12 AM, Peter Constable via CLDR-Users wrote: > > This is waaaayyyyy off topic. > > > Well, it gives a nice window into Philippe's thinking. He must be viewing > encodings as if they were printed on giant wall charts, instead of thinking > of them numerically where the "upper" end of a range (top) would the higher > value. > IO'm n,ot alone to think like this, given this is the default presentation in the standard itself and all code charts included and officially released. Unicode is about encoding text, and naturally it adopts a text -related point of view where reading order is considered; even with Bidi, all the encoded scripts are using top to bottom layouts by default (except historic scripts using boustrophedon and special uses such as rotating Latin texts to render it along a vertical border of a rectangular area or a very narrow strip, where it could be bottom to top). The terms top and bottom are used in many places in the standard for presentation related descriptions, it makes no sense if you speak about numerical values where the correct terms are "first"/"last" or "lower"/"upper" or "start"/"end" which are explicitly referencing a precise order (unlike "top" and "bottom" that are just 2D directions). Of course there remains the expression English "top ten" (or with any other number), but the number is required to give it the meaning "first" (in a sequence frequently open-ended). But the discussion was not out of topic: Mark Davis did not justify really why he would have prefered the surrogates at end of the BMP. Various decisions had to be made rapidly to accelerate the merging of efforts between Unicode and ISO. But not all was done coherently even if the new compatible version of both standards were created which were both incompatible with their former initial version. With the price of interoperability break already paid, some errors could have been avoided. If there are some reasons why surrogates should be have been at end of the BMP, it would have been * to have 16-bit binary ordering compatible with the binary ordering of UTF-8 and UTF-32 * to facilitate some algorithms or optimize some storage These reasons are valid, but nothing prohibit creating a 16-bit encoding that can do that (with exactly the same price as UTF-16): this can be done only by remapping the 16-bit encoding space with a bijection that swaps some value ranges. Here we speak about implementations of algorithms, not about interchanges where UTF-16 is defined to be used, but that has fallen out of use (except for the Windows-specific "Unicode" APIs which are however not really UTF-16 compliant, and in the NTFS filesystem or FAT32/exFAT with their "Unicode" extension that are also not completely Unciode compliant). What I mean is that the remaining allocation of surrogates in the BMP is just a waste only kept because of UTF-16, and not even needed for data interchange, while implementations can still work internally with their own 16-bit encoding if needed, without even needing to comply strictly with what is defined in UTF-16. So what is done in NTFS or FAT32/exFAT does not matter, it just follows what is defined in these filesystems by Microsoft. But other 16-bit encodings are largely possible using other mappings (notably for Chinese, but as well for Latin) So you continue to think this is out of topic of the Unicode mailing list, when the subject was introduced (but not justified at all) by Mark Davis not explaining his opinion about the placement of surrogates in the BMP, and using as well a confusing term to say it. I am convinced this was really fully ON topic of Unicode (including on its mailing list where Mark Davis started his remark), its history, and its usage and implementations. I'd say that surrogates were bad solutions for what was not really a problem and should have remained in private implementations. Even UTF-16 should have not even been standardized, it was not needed at all (and we would havbe avoided as well the nightmare of ZWNBSP used as BOM in text files, and its later disunification using another codepoint, another historical error) ! If you asked to people, they'd say : remove UTF-16 from the standard, and remove surrogates completely from the BMP. Or at least deprecate them completely (leaving their use only for internal legacy implementations, but not for any interchange, or leaving implementations defining their own 16-bit encoding if they want and documenting it if needed). -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Aug 23 12:06:46 2017 From: cldr-users at unicode.org (Doug Ewell via CLDR-Users) Date: Wed, 23 Aug 2017 10:06:46 -0700 Subject: Time format characters 'h' and 'k' Message-ID: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com> Philippe Verdy wrote: > But the discussion was not out of topic: Mark Davis did not justify > really why he would have prefered the surrogates at end of the BMP. He didn't say he would have preferred it; he said it was a historical anomaly. And it was just a parenthetical aside. Check the Subject line again. Peter is right: waaaayyyyy. -- Doug Ewell | Thornton, CO, US | ewellic.org From cldr-users at unicode.org Wed Aug 23 14:13:07 2017 From: cldr-users at unicode.org (Philippe Verdy via CLDR-Users) Date: Wed, 23 Aug 2017 21:13:07 +0200 Subject: Time format characters 'h' and 'k' In-Reply-To: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com> References: <20170823100646.665a7a7059d7ee80bb4d670165c8327d.552fce00b0.wbe@email03.godaddy.com> Message-ID: 2017-08-23 19:06 GMT+02:00 Doug Ewell via CLDR-Users : > Philippe Verdy wrote: > > > But the discussion was not out of topic: Mark Davis did not justify > > really why he would have prefered the surrogates at end of the BMP. > > he said it was a historical anomaly. And does not say why, though this is the most interesting thing. Saying something WAS an anomaly brings this question. If no one can explain that, then there was no "anomaly" at all and the parenthetical side is also no relevant at all (for this subject line). But it's still an interesting question: are UTF-16, BOM's and surrogates really useful as a normative part of the standard instead of a technical annex kept only for historic references because of its usage in the Windows API, or indirect reference from CESU-8 (which is also not part of the standard but kept as another possibilty for handling 16-bit code units without the problem of byte order. As a parenthetical side, I can use the same argument: the disunification of ZWNJ was also an historical anomaly (only because of its predominant usage as BOM's in Windows "Notepad.exe" to support UTF-16 encoded texts instead of UTF-16BE or UTF-16LE, two other related and unneeded 8-bit encodings, that are probably even worse than CESU-8 not needing these damn'ed BOMs that have polluted the usage of UTF-8). Windows clearly did not even needed these BOMs, when it could store the text encoding in NTFS or ReFS metadata (e.g. in a tiny alternate datastream, not using more storage cluster space as it would fit directly in directory entries, just like what it does for marking files downloaded from Internet from a third party domain name by also using an ADS); for FAT32 and exFAT, there's also solutions using conventional metadata folders (solution unused on MacOS for storing its structured "resource forks" with similar goals and capabilities, or used on webservers for HTTP metadata using an additional database or index file); on OS/2 and VMS there were "extended attributes" with the same goal. In my opinion there are still more (and better) ways to bijectively remap codepoints to 16-bit codeunits and UTF-16 is just one of them (though it is not strictly bijective but only surjective, causing additional problems for the reverse conversion back to codepoints) but not the best for all uses which won't like having to check exceptions everywhere (notably unpaired surrogates except at end of streams with premature truncation). -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Wed Aug 23 14:19:44 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Wed, 23 Aug 2017 21:19:44 +0200 Subject: Change the subject line AND audience (was Re: Time format characters 'h' and 'k') Message-ID: Philippe, This is the wrong topic, *and wrong audience*, as has been pointed out to you. Please change both of them. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: