From unicode at unicode.org Mon Jan 1 01:54:29 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Mon, 1 Jan 2018 13:24:29 +0530 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) Message-ID: In UAX 29, the GB10 rule[1] (and the WB14 rule[2]) states that we should not break before E_modifier characters in case it is after an emoji base (with optional Extend characters in between) Given that the spec is allowed to ignore degenerates, is there any value lost by merging E_Modifier and Extend into a single category? This means we can completely get rid of the Emoji_Base category, and the EBG category gets merged with GAZ. sounds very much like a degenerate case to me. also feels rather degenerate. There are only three GAZes (heart (U+2764), kiss (U+1F48B), speech bubble (U+1F5E8)) and I can't see why you'd end up with a skin tone modifier on them except by accident. (Unless we plan to support lip colors or something but in that case the kiss emoji would switch to EBG anyway) Thanks, -Manish [1]: http://www.unicode.org/reports/tr29/#GB10 [2]: http://www.unicode.org/reports/tr29/#WB14 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 1 02:06:27 2018 From: unicode at unicode.org (Jonathan Rosenne via Unicode) Date: Mon, 1 Jan 2018 08:06:27 +0000 Subject: Popular wordprocessors treating U+00A0 as fixed-width In-Reply-To: References: Message-ID: May we all please keep this discussion civil. People, being human, may sometimes make mistakes, but that does not necessarily justify calling them names. Best Regards, Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy via Unicode Sent: Monday, January 01, 2018 5:43 AM To: Shriramana Sharma Cc: UnicoDe List Subject: Re: Popular wordprocessors treating U+00A0 as fixed-width Well it's unfortunate that Microsoft's own response (by its MSVP) is completely wrong, suggesting to use Narrow non-breaking space to get justification, which is exactly the reverse where these NNBSP should NOT be justified and keep their width. Microsoft's developers have absolutely misunderstood the standard where both SPACE and NBSP should really behave the same for justification (being different only for the existence of the break opportunity). This Microsoft response is completrrely supid, and it even breaks the classic typography for French use of NNBSP ("fine" in French) around some punctuations (before :;!?? or after ?) and as group separators in numbers (note that NNBSP was introduced in Unicode very late in the standard (and before that NBSP was used only because this was the only non-breaking space available but it was much too large!) Still many documents use NBSP instead of NNBSP around punctuations or as group separators (but in Word these contextual occurences of NBSP which are easy to detect, could have been autoreplaced when typesetting, or proposed as a correction in the integrated speller, at least for French). But the old behavior of old versions of Office (before NNBSP existed in Unicode) should have been cleaned up since long. It's clear that MS Office developers don't know the standards and do what they want (they also don't know the correct standards for maths in Excel and use a lot of very stupid assumptions, as if they were smarter than their users that suffer since long from these bugs !) and don't want to fix their past errors. 2018-01-01 3:14 GMT+01:00 Shriramana Sharma via Unicode >: While http://unicode.org/reports/tr14/ clearly states that: When expanding or compressing interword space according to common typographical practice, only the spaces marked by U+0020 SPACE and U+00A0 NO-BREAK SPACE are subject to compression, and only spaces marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces marked by U+2009 THIN SPACE are subject to expansion. All other space characters normally have fixed width. ? really sad to see the misunderstanding around U+00A0: https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1 https://bugs.documentfoundation.org/show_bug.cgi?id=41652 -- Shriramana Sharma ???????????? ???????????? ???????????????????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 1 04:32:47 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 1 Jan 2018 11:32:47 +0100 Subject: Popular wordprocessors treating U+00A0 as fixed-width In-Reply-To: References: Message-ID: I do not call them by names, what I call is their reply, even when people explain them, and when they even suggest something else which is obviously wrong (and in fact absolutely not needed in Office which offers another way using styles for controling linebreaks without having to change the encoded character (a Word document has never been plain text, so I wonder why they even speak about compatibility by breaking another compatibility rule as a pseudo-workaround). 2018-01-01 9:06 GMT+01:00 Jonathan Rosenne via Unicode : > May we all please keep this discussion civil. People, being human, may > sometimes make mistakes, but that does not necessarily justify calling them > names. > > > > Best Regards, > > > > Jonathan Rosenne > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe > Verdy via Unicode > *Sent:* Monday, January 01, 2018 5:43 AM > *To:* Shriramana Sharma > *Cc:* UnicoDe List > *Subject:* Re: Popular wordprocessors treating U+00A0 as fixed-width > > > > Well it's unfortunate that Microsoft's own response (by its MSVP) is > completely wrong, suggesting to use Narrow non-breaking space to get > justification, which is exactly the reverse where these NNBSP should NOT be > justified and keep their width. > > > > Microsoft's developers have absolutely misunderstood the standard where > both SPACE and NBSP should really behave the same for justification (being > different only for the existence of the break opportunity). > > > > This Microsoft response is completrrely supid, and it even breaks the > classic typography for French use of NNBSP ("fine" in French) around some > punctuations (before :;!?? or after ?) and as group separators in numbers > (note that NNBSP was introduced in Unicode very late in the standard (and > before that NBSP was used only because this was the only non-breaking space > available but it was much too large!) > > > > Still many documents use NBSP instead of NNBSP around punctuations or as > group separators (but in Word these contextual occurences of NBSP which are > easy to detect, could have been autoreplaced when typesetting, or proposed > as a correction in the integrated speller, at least for French). But the > old behavior of old versions of Office (before NNBSP existed in Unicode) > should have been cleaned up since long. > > > > It's clear that MS Office developers don't know the standards and do what > they want (they also don't know the correct standards for maths in Excel > and use a lot of very stupid assumptions, as if they were smarter than > their users that suffer since long from these bugs !) and don't want to fix > their past errors. > > > > 2018-01-01 3:14 GMT+01:00 Shriramana Sharma via Unicode < > unicode at unicode.org>: > > While http://unicode.org/reports/tr14/ clearly states that: > > > When expanding or compressing interword space according to common > typographical practice, only the spaces marked by U+0020 SPACE and > U+00A0 NO-BREAK SPACE are subject to compression, and only spaces > marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces > marked by U+2009 THIN SPACE are subject to expansion. All other space > characters normally have fixed width. > > > ? really sad to see the misunderstanding around U+00A0: > > https://answers.microsoft.com/en-us/msoffice/forum/msoffice_ > word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/ > 4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1 > > https://bugs.documentfoundation.org/show_bug.cgi?id=41652 > > -- > Shriramana Sharma ???????????? ???????????? ???????????????????????? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 1 08:52:20 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 1 Jan 2018 14:52:20 +0000 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: Message-ID: <20180101145220.7334ba83@JRWUBU2> On Mon, 1 Jan 2018 13:24:29 +0530 Manish Goregaokar via Unicode wrote: > sounds very much like a > degenerate case to me. Generally yes, but I'm not sure that they'd be inappropriate for Egyptian hieroglyphs showing human beings. The choice of determinative can convey unpronounceable semantic information, though I'm not sure that that can be as sensitive as skin colour. However, in such a case it would also be appropriate to give a skin tone modifier the property Extend. Richard. From unicode at unicode.org Mon Jan 1 09:47:59 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 1 Jan 2018 16:47:59 +0100 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: <20180101145220.7334ba83@JRWUBU2> References: <20180101145220.7334ba83@JRWUBU2> Message-ID: This is an interesting suggestion, Manish. is a degenerate case, so if we following your suggestion we also could drop E_Base and E_Modifier, and rule GB10. Instead, we'd add one line to *Extend :* OLD Grapheme_Extend = Yes *and not* GCB = Virama NEW Grapheme_Extend = Yes, or Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [UTS51 ]. *and not* GCB = Virama Note: we are already planning to get rid of the GAZ/EBG distinction ( http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. Mark On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Mon, 1 Jan 2018 13:24:29 +0530 > Manish Goregaokar via Unicode wrote: > > > sounds very much like a > > degenerate case to me. > > Generally yes, but I'm not sure that they'd be inappropriate for > Egyptian hieroglyphs showing human beings. The choice of determinative > can convey unpronounceable semantic information, though I'm not sure > that that can be as sensitive as skin colour. However, in such a case > it would also be appropriate to give a skin tone modifier the property > Extend. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 03:21:37 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Tue, 2 Jan 2018 01:21:37 -0800 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: <20180101145220.7334ba83@JRWUBU2> References: <20180101145220.7334ba83@JRWUBU2> Message-ID: <9b4df777-189c-287a-fdfd-d9bf4e750d0e@ix.netcom.com> On 1/1/2018 6:52 AM, Richard Wordingham via Unicode wrote: > On Mon, 1 Jan 2018 13:24:29 +0530 > Manish Goregaokar via Unicode wrote: > >> sounds very much like a >> degenerate case to me. > Generally yes, but I'm not sure that they'd be inappropriate for > Egyptian hieroglyphs showing human beings. The choice of determinative > can convey unpronounceable semantic information, though I'm not sure > that that can be as sensitive as skin colour. However, in such a case > it would also be appropriate to give a skin tone modifier the property > Extend. They would be inappropriate because it's not part of the hieroglyphic writing system to make those distinctions. "Over expressiveness" is sometimes a problem rather than a feature when it comes to Unicode. A./ > > Richard. > From unicode at unicode.org Tue Jan 2 03:32:27 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Tue, 2 Jan 2018 15:02:27 +0530 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: <20180101145220.7334ba83@JRWUBU2> Message-ID: > Note: we are already planning to get rid of the GAZ/EBG distinction ( http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. This is great! I hadn't noticed this when I last saw that draft (I was focusing on the Virama stuff). Good to know! > Instead, we'd add one line to *Extend :* Yeah, this is essentially what I was hoping we could do. Is there any way to formally propose this? Or is bringing it up here good enough? Thanks, -Manish On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ?? via Unicode < unicode at unicode.org> wrote: > This is an interesting suggestion, Manish. > > is a degenerate case, so if we > following your suggestion we also could drop E_Base and E_Modifier, and > rule GB10. > > Instead, we'd add one line to *Extend > :* > > OLD > Grapheme_Extend = Yes > *and not* GCB = Virama > > NEW > Grapheme_Extend = Yes, or > Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [ > UTS51 ]. > *and not* GCB = Virama > > Note: we are already planning to get rid of the GAZ/EBG distinction ( > http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. > > Mark > > On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > >> On Mon, 1 Jan 2018 13:24:29 +0530 >> Manish Goregaokar via Unicode wrote: >> >> > sounds very much like a >> > degenerate case to me. >> >> Generally yes, but I'm not sure that they'd be inappropriate for >> Egyptian hieroglyphs showing human beings. The choice of determinative >> can convey unpronounceable semantic information, though I'm not sure >> that that can be as sensitive as skin colour. However, in such a case >> it would also be appropriate to give a skin tone modifier the property >> Extend. >> >> Richard. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 03:41:48 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Tue, 2 Jan 2018 15:11:48 +0530 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: <20180101145220.7334ba83@JRWUBU2> Message-ID: In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x Extended_Pictographic. Can this similarly be distilled to just ZWJ x Extended_Pictographic? This does affect cases like or and I'm not certain if that counts as a degenerate case. If we do this then all of the rules except the flag emoji one become things which can be easily calculated with local information, which is nice for implementors. (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere but if we merge that with Extend that's not going to be necessary anyway) -Manish On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar wrote: > > Note: we are already planning to get rid of the GAZ/EBG distinction ( > http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. > > > This is great! I hadn't noticed this when I last saw that draft (I was > focusing on the Virama stuff). Good to know! > > > > Instead, we'd add one line to > *Extend :* > > Yeah, this is essentially what I was hoping we could do. > > Is there any way to formally propose this? Or is bringing it up here good > enough? > > Thanks, > > -Manish > > On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ?? via Unicode < > unicode at unicode.org> wrote: > >> This is an interesting suggestion, Manish. >> >> is a degenerate case, so if we >> following your suggestion we also could drop E_Base and E_Modifier, and >> rule GB10. >> >> Instead, we'd add one line to *Extend >> :* >> >> OLD >> Grapheme_Extend = Yes >> *and not* GCB = Virama >> >> NEW >> Grapheme_Extend = Yes, or >> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [ >> UTS51 ]. >> *and not* GCB = Virama >> >> Note: we are already planning to get rid of the GAZ/EBG distinction ( >> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >> >> Mark >> >> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < >> unicode at unicode.org> wrote: >> >>> On Mon, 1 Jan 2018 13:24:29 +0530 >>> Manish Goregaokar via Unicode wrote: >>> >>> > sounds very much like a >>> > degenerate case to me. >>> >>> Generally yes, but I'm not sure that they'd be inappropriate for >>> Egyptian hieroglyphs showing human beings. The choice of determinative >>> can convey unpronounceable semantic information, though I'm not sure >>> that that can be as sensitive as skin colour. However, in such a case >>> it would also be appropriate to give a skin tone modifier the property >>> Extend. >>> >>> Richard. >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 04:37:30 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Jan 2018 11:37:30 +0100 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: <20180101145220.7334ba83@JRWUBU2> Message-ID: > Or is bringing it up here good enough? You should submit a proposal, which you can do at https://www.unicode.org/reporting.html. It doesn't have to be much more than what you put in email. (A reminder for everyone here: This is simply a discussion list, and has no effect whatsoever unless someone submits a proposal for the UTC.) Mark On Tue, Jan 2, 2018 at 10:32 AM, Manish Goregaokar wrote: > > Note: we are already planning to get rid of the GAZ/EBG distinction ( > http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. > > > This is great! I hadn't noticed this when I last saw that draft (I was > focusing on the Virama stuff). Good to know! > > > > Instead, we'd add one line to > *Extend :* > > Yeah, this is essentially what I was hoping we could do. > > Is there any way to formally propose this? Or is bringing it up here good > enough? > > Thanks, > > -Manish > > On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ?? via Unicode < > unicode at unicode.org> wrote: > >> This is an interesting suggestion, Manish. >> >> is a degenerate case, so if we >> following your suggestion we also could drop E_Base and E_Modifier, and >> rule GB10. >> >> Instead, we'd add one line to *Extend >> :* >> >> OLD >> Grapheme_Extend = Yes >> *and not* GCB = Virama >> >> NEW >> Grapheme_Extend = Yes, or >> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [ >> UTS51 ]. >> *and not* GCB = Virama >> >> Note: we are already planning to get rid of the GAZ/EBG distinction ( >> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >> >> Mark >> >> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < >> unicode at unicode.org> wrote: >> >>> On Mon, 1 Jan 2018 13:24:29 +0530 >>> Manish Goregaokar via Unicode wrote: >>> >>> > sounds very much like a >>> > degenerate case to me. >>> >>> Generally yes, but I'm not sure that they'd be inappropriate for >>> Egyptian hieroglyphs showing human beings. The choice of determinative >>> can convey unpronounceable semantic information, though I'm not sure >>> that that can be as sensitive as skin colour. However, in such a case >>> it would also be appropriate to give a skin tone modifier the property >>> Extend. >>> >>> Richard. >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 04:41:16 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Jan 2018 11:41:16 +0100 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: <20180101145220.7334ba83@JRWUBU2> Message-ID: We had that originally, but some people objected that some languages (Arabic, as I recall) can end a string of letters with a ZWJ, and immediately follow it by an emoji (without an intervening space) without wanting it to be joined into a grapheme cluster with a following symbol. While I personally consider that a degenerate case, we tightened the definition to prevent that. Mark Mark On Tue, Jan 2, 2018 at 10:41 AM, Manish Goregaokar wrote: > In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x > Extended_Pictographic. > > Can this similarly be distilled to just ZWJ x Extended_Pictographic? This > does affect cases like or letter, zwj, emoji> and I'm not certain if that counts as a degenerate > case. If we do this then all of the rules except the flag emoji one become > things which can be easily calculated with local information, which is nice > for implementors. > > (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere > but if we merge that with Extend that's not going to be necessary anyway) > > -Manish > > On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar > wrote: > >> > Note: we are already planning to get rid of the GAZ/EBG distinction ( >> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >> >> >> This is great! I hadn't noticed this when I last saw that draft (I was >> focusing on the Virama stuff). Good to know! >> >> >> > Instead, we'd add one line to >> *Extend :* >> >> Yeah, this is essentially what I was hoping we could do. >> >> Is there any way to formally propose this? Or is bringing it up here good >> enough? >> >> Thanks, >> >> -Manish >> >> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ?? via Unicode < >> unicode at unicode.org> wrote: >> >>> This is an interesting suggestion, Manish. >>> >>> is a degenerate case, so if we >>> following your suggestion we also could drop E_Base and E_Modifier, and >>> rule GB10. >>> >>> Instead, we'd add one line to *Extend >>> :* >>> >>> OLD >>> Grapheme_Extend = Yes >>> *and not* GCB = Virama >>> >>> NEW >>> Grapheme_Extend = Yes, or >>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [ >>> UTS51 ]. >>> *and not* GCB = Virama >>> >>> Note: we are already planning to get rid of the GAZ/EBG distinction ( >>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >>> >>> Mark >>> >>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < >>> unicode at unicode.org> wrote: >>> >>>> On Mon, 1 Jan 2018 13:24:29 +0530 >>>> Manish Goregaokar via Unicode wrote: >>>> >>>> > sounds very much like a >>>> > degenerate case to me. >>>> >>>> Generally yes, but I'm not sure that they'd be inappropriate for >>>> Egyptian hieroglyphs showing human beings. The choice of determinative >>>> can convey unpronounceable semantic information, though I'm not sure >>>> that that can be as sensitive as skin colour. However, in such a case >>>> it would also be appropriate to give a skin tone modifier the property >>>> Extend. >>>> >>>> Richard. >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 04:52:27 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Tue, 2 Jan 2018 11:52:27 +0100 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: References: <20180101145220.7334ba83@JRWUBU2> Message-ID: BTW, relevant to this discussion is a proposal filed http://www.unicode.org/ L2/L2017/17434-emoji-rejex-uts51-def.pdf (The date is wrong, should be 2017-12-22) Mark On Tue, Jan 2, 2018 at 11:41 AM, Mark Davis ?? wrote: > We had that originally, but some people objected that some languages > (Arabic, as I recall) can end a string of letters with a ZWJ, and > immediately follow it by an emoji (without an intervening space) without > wanting it to be joined into a grapheme cluster with a following symbol. > While I personally consider that a degenerate case, we tightened the > definition to prevent that. > > Mark > > Mark > > On Tue, Jan 2, 2018 at 10:41 AM, Manish Goregaokar > wrote: > >> In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x >> Extended_Pictographic. >> >> Can this similarly be distilled to just ZWJ x Extended_Pictographic? This >> does affect cases like or > letter, zwj, emoji> and I'm not certain if that counts as a degenerate >> case. If we do this then all of the rules except the flag emoji one become >> things which can be easily calculated with local information, which is nice >> for implementors. >> >> (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere >> but if we merge that with Extend that's not going to be necessary anyway) >> >> -Manish >> >> On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar >> wrote: >> >>> > Note: we are already planning to get rid of the GAZ/EBG distinction ( >>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >>> >>> >>> This is great! I hadn't noticed this when I last saw that draft (I was >>> focusing on the Virama stuff). Good to know! >>> >>> >>> > Instead, we'd add one line to >>> *Extend :* >>> >>> Yeah, this is essentially what I was hoping we could do. >>> >>> Is there any way to formally propose this? Or is bringing it up here >>> good enough? >>> >>> Thanks, >>> >>> -Manish >>> >>> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ?? via Unicode < >>> unicode at unicode.org> wrote: >>> >>>> This is an interesting suggestion, Manish. >>>> >>>> is a degenerate case, so if we >>>> following your suggestion we also could drop E_Base and E_Modifier, and >>>> rule GB10. >>>> >>>> Instead, we'd add one line to *Extend >>>> :* >>>> >>>> OLD >>>> Grapheme_Extend = Yes >>>> *and not* GCB = Virama >>>> >>>> NEW >>>> Grapheme_Extend = Yes, or >>>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [ >>>> UTS51 ]. >>>> *and not* GCB = Virama >>>> >>>> Note: we are already planning to get rid of the GAZ/EBG distinction ( >>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event. >>>> >>>> Mark >>>> >>>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode < >>>> unicode at unicode.org> wrote: >>>> >>>>> On Mon, 1 Jan 2018 13:24:29 +0530 >>>>> Manish Goregaokar via Unicode wrote: >>>>> >>>>> > sounds very much like a >>>>> > degenerate case to me. >>>>> >>>>> Generally yes, but I'm not sure that they'd be inappropriate for >>>>> Egyptian hieroglyphs showing human beings. The choice of determinative >>>>> can convey unpronounceable semantic information, though I'm not sure >>>>> that that can be as sensitive as skin colour. However, in such a case >>>>> it would also be appropriate to give a skin tone modifier the property >>>>> Extend. >>>>> >>>>> Richard. >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 2 14:30:34 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 2 Jan 2018 20:30:34 +0000 Subject: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10) In-Reply-To: <9b4df777-189c-287a-fdfd-d9bf4e750d0e@ix.netcom.com> References: <20180101145220.7334ba83@JRWUBU2> <9b4df777-189c-287a-fdfd-d9bf4e750d0e@ix.netcom.com> Message-ID: <20180102203034.1f25bbe2@JRWUBU2> On Tue, 2 Jan 2018 01:21:37 -0800 Asmus Freytag via Unicode wrote: > On 1/1/2018 6:52 AM, Richard Wordingham via Unicode wrote: > > Generally yes, but I'm not sure that they'd be inappropriate for > > Egyptian hieroglyphs showing human beings. The choice of > > determinative can convey unpronounceable semantic information, > > though I'm not sure that that can be as sensitive as skin colour. > > However, in such a case it would also be appropriate to give a skin > > tone modifier the property Extend. > They would be inappropriate because it's not part of the hieroglyphic > writing system to make those distinctions. If the distinction is kept to indisputable pictures, then that does keep it out of scope. It just occurred to me that the painter might choose the ethnically appropriate skin colour rather than just using the Egyptian skin colour. Richard. From unicode at unicode.org Tue Jan 2 14:55:47 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 2 Jan 2018 13:55:47 -0700 Subject: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)) In-Reply-To: References: Message-ID: Mark Davis wrote: > BTW, relevant to this discussion is a proposal filed > http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The > date is wrong, should be 2017-12-22) The phrase "emoji regex" had caused me to ignore this document, but I took a look based on this thread. It says "we still depend on the RGI test to filter the set of emoji sequences" and proposes that the EBNF in UTS #51 be simplified on the basis that only RGI sequences will pass the "possible emoji" test anyway. Thus it is true, as some people have said (i.e. in L2/17?382), that non-RGI sequences do not actually count as emoji, and therefore there is no way ? not merely no "recommended" way ? to represent the flags of entities such as Catalonia and Brittany. In 2016 I had asked for the emoji tag sequence mechanism for flags to be available for all CLDR subdivisions, not just three, with the understanding that the vast majority would not be supported by vendor glyphs. II t is unfortunate that, while the conciliatory name "recommended" was adopted for the three, the intent of "exclusively permitted" was retained. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jan 3 02:29:14 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Jan 2018 09:29:14 +0100 Subject: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)) In-Reply-To: References: Message-ID: Thanks for your comments; you raise an excellent issue. There are valid sequences that are not RGI; a vendor can support additional emoji sequences (in particular, flags). So the wording in the doc isn't correct. It should do something like replace the use of "testing for RGI" by "testing for validity". The key areas involved in that are checking for the valid base+modifier combinations, valid RI pairs, and TAG sequences. The latter two involve testing based on the information applied in the appendix, while the valid base+modifiers are more regular and can be tested based on properties. Mark On Tue, Jan 2, 2018 at 9:55 PM, Doug Ewell via Unicode wrote: > Mark Davis wrote: > > BTW, relevant to this discussion is a proposal filed >> http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The >> date is wrong, should be 2017-12-22) >> > > The phrase "emoji regex" had caused me to ignore this document, but I took > a look based on this thread. It says "we still depend on the RGI test to > filter the set of emoji sequences" and proposes that the EBNF in UTS #51 be > simplified on the basis that only RGI sequences will pass the "possible > emoji" test anyway. > > Thus it is true, as some people have said (i.e. in L2/17?382), that > non-RGI sequences do not actually count as emoji, and therefore there is no > way ? not merely no "recommended" way ? to represent the flags of entities > such as Catalonia and Brittany. > > In 2016 I had asked for the emoji tag sequence mechanism for flags to be > available for all CLDR subdivisions, not just three, with the understanding > that the vast majority would not be supported by vendor glyphs. II t is > unfortunate that, while the conciliatory name "recommended" was adopted for > the three, the intent of "exclusively permitted" was retained. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 3 03:16:36 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Jan 2018 10:16:36 +0100 Subject: Regex for Grapheme Cluster Breaks Message-ID: I had a UTC action to adjust http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters to update the regex, and other necessary changes surrounding text. Here is what I've come up with for an EBNF formulation. The $x are the GCB properties. cluster = crlf | $Control | precore* core postcore* ; crlf = $CR $LF ; precore = $Prepend ; postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] ); core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence | [^$Control $CR $LF] ); hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ; ri-sequence = $RI $RI ; skin-sequence = $E_Base $E_Modifier ; xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?: $Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ; virama-sequence = [$Virama $ZWJ] $LinkingConsonant ; ?I have tools to turn that into a (lovely) regex: \p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])(?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])* ? ?(It is a bit shorter if some more property names/values are abbreviated.) I then tested against the current test file: GraphemeBreakTest.txt. There is one outlying failure with that test file: 813) ???? hex: 261D 0308 1F3FB test: [0, 4] ebnf: [0, 2, 4] I believe that is a problem with the test rather than the BNF, but I need to track it down in any event. ?A regex is much easier for many applications to use than the current rule syntax, so I'm going to see if the other segmentations could be reformulated ?as ebnfs (ideally corresponding to regular grammars, or in the worst case, for PEGs). Feedback is welcome. ? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 3 04:38:17 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Wed, 3 Jan 2018 11:38:17 +0100 Subject: Regex for Grapheme Cluster Breaks In-Reply-To: References: Message-ID: Quick update: Manish pointed out that I'd misstated one of the rules, should be: skin-sequence = $E_Base $Extend* $E_Modifier ; ?With that change, the test passes. (Thanks Manish!)? Mark On Wed, Jan 3, 2018 at 10:16 AM, Mark Davis ?? wrote: > I had a UTC action to adjust http://www.unicode.org/ > reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_ > Clusters to update the regex, and other necessary changes surrounding > text. > > Here is what I've come up with for an EBNF formulation. The $x are the GCB > properties. > > cluster = crlf | $Control | precore* core postcore* ; > > > crlf = $CR $LF ; > > > precore = $Prepend ; > > > postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] ); > > > core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence > | [^$Control $CR $LF] ); > > > hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ; > > > ri-sequence = $RI $RI ; > > > > skin-sequence = $E_Base $E_Modifier ; > > > xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?: > $Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ; > > > virama-sequence = [$Virama $ZWJ] $LinkingConsonant ; > > > ?I have tools to turn that into a (lovely) regex: > > \p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\ > p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{ > gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{ > gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\ > p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_ > Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb= > LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}]) > (?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{ > gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])* > ? > ?(It is a bit shorter if some more property names/values are abbreviated.) > > I then tested against the current test file: GraphemeBreakTest.txt. There > is one outlying failure with that test file: > > 813) ???? > > hex: 261D 0308 1F3FB > > test: [0, 4] > > ebnf: [0, 2, 4] > > I believe that is a problem with the test rather than the BNF, but I need > to track it down in any event. > > ?A regex is much easier for many applications to use than the current rule > syntax, so I'm going to see if the other segmentations could be > reformulated ?as ebnfs (ideally corresponding to regular grammars, or in > the worst case, for PEGs). > > Feedback is welcome. > > ? > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 3 11:15:39 2018 From: unicode at unicode.org (=?utf-8?Q? J.=C2=A0S._Choi ?= via Unicode) Date: Wed, 03 Jan 2018 10:15:39 -0700 Subject: W3C discussion: nullifying BCP47 tags for emoji presentation in HTML/XML Message-ID: <669D0643-49C3-44E2-86AF-6B59C43350DB@icloud.com> A discussion relevant to UTS 51: Unicode Emoji is occurring in the W3C?s CSS Working Group on GitHub at https://github.com/w3c/csswg-drafts/issues/2138. To review, the Consortium recently registered several BCP47 language-tag extension keys for specifying transliteration and text-vs.-emoji presentation such as ?en-u-em-emoji? (see http://blog.unicode.org/2016/03/cldr-version-29-released.html). Basically, the W3C and the major web-browser vendors are considering normatively forbidding any influence of Unicode?s BCP47 extensions on the presentation of emoji characters in HTML and XML, viewing them as currently little used and fully redundant to variation-selector characters and the CSS font-presentation property. The Consortium was the the originator of the BCP47 extensions and may have insight into their use cases; thus, those involved in registering the extensions may be interested in participating in this discussion, which is occurring on GitHub at https://github.com/w3c/csswg-drafts/issues/2138. So far, representatives from Google Chrome / Blink (Sascha Brawer), Microsoft Edge / Chakra (Sergey Malkin), Apple Safari / WebKit (Myles C. Maxfield), and W3C (Chris Lilley) have been participating. J. S. Choi -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 5 05:30:55 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Fri, 5 Jan 2018 12:30:55 +0100 Subject: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)) In-Reply-To: References: Message-ID: Doug, I modified my working draft, at https://docs.google.com/document/d/1EuNjbs0XrBwqlvCJxra44o3EVrwdBJUWsPf8Ec1fWKY If that looks ok, I'll submit. Thanks again for your comments. Mark Mark On Wed, Jan 3, 2018 at 9:29 AM, Mark Davis ?? wrote: > Thanks for your comments; you raise an excellent issue. There are valid > sequences that are not RGI; a vendor can support additional emoji sequences > (in particular, flags). So the wording in the doc isn't correct. > > It should do something like replace the use of "testing for RGI" by > "testing for validity". The key areas involved in that are checking for the > valid base+modifier combinations, valid RI pairs, and TAG sequences. The > latter two involve testing based on the information applied in the > appendix, while the valid base+modifiers are more regular and can be tested > based on properties. > > > Mark > > On Tue, Jan 2, 2018 at 9:55 PM, Doug Ewell via Unicode < > unicode at unicode.org> wrote: > >> Mark Davis wrote: >> >> BTW, relevant to this discussion is a proposal filed >>> http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The >>> date is wrong, should be 2017-12-22) >>> >> >> The phrase "emoji regex" had caused me to ignore this document, but I >> took a look based on this thread. It says "we still depend on the RGI test >> to filter the set of emoji sequences" and proposes that the EBNF in UTS #51 >> be simplified on the basis that only RGI sequences will pass the "possible >> emoji" test anyway. >> >> Thus it is true, as some people have said (i.e. in L2/17?382), that >> non-RGI sequences do not actually count as emoji, and therefore there is no >> way ? not merely no "recommended" way ? to represent the flags of entities >> such as Catalonia and Brittany. >> >> In 2016 I had asked for the emoji tag sequence mechanism for flags to be >> available for all CLDR subdivisions, not just three, with the understanding >> that the vast majority would not be supported by vendor glyphs. II t is >> unfortunate that, while the conciliatory name "recommended" was adopted for >> the three, the intent of "exclusively permitted" was retained. >> >> -- >> Doug Ewell | Thornton, CO, US | ewellic.org >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 6 18:08:45 2018 From: unicode at unicode.org (Paul Hoffman via Unicode) Date: Sat, 6 Jan 2018 16:08:45 -0800 Subject: Printed versions of Unicode v1 through v4 available Message-ID: Greetings. I am cleaning out my closet, and have printed versions of TUS v1 through v4 that I'm no longer interested in. If you want them and are willing to pay postage (US media mail rates are lowest), send me a note off-list. Otherwise, they will go the way of so many things in this world... --Paul Hoffman -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 7 18:32:47 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 8 Jan 2018 01:32:47 +0100 Subject: Printed versions of Unicode v1 through v4 available In-Reply-To: References: Message-ID: If you don't know what to do with your books (any kind), go to your local public library to give it there, or give it to a school, they may interest students. Such books are rarely found in primary schools but this may insterest them to get some supports and the earlier versions are simpler to sudy than recent versions and not all children have a suitable Internet to work with in better conditions than a poor smartphone. Yoy should only drop dialy newspapers or old magazines. Even students could use them for creating art and would be amazed to discover that there are more scripts than what they think or learn or may find interests in learning foreign languages because of these books. 2018-01-07 1:08 GMT+01:00 Paul Hoffman via Unicode : > Greetings. I am cleaning out my closet, and have printed versions of TUS > v1 through v4 that I'm no longer interested in. If you want them and are > willing to pay postage (US media mail rates are lowest), send me a note > off-list. Otherwise, they will go the way of so many things in this world... > > --Paul Hoffman > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 7 19:19:21 2018 From: unicode at unicode.org (Paul Hoffman via Unicode) Date: Sun, 7 Jan 2018 17:19:21 -0800 Subject: Printed versions of Unicode v1 through v4 available In-Reply-To: References: Message-ID: Thanks, but folks have already spoken for them. Also, my local library is shedding this type of historical book, which is why I was looking for active Unicoders. --Paul Hoffman On Sun, Jan 7, 2018 at 4:32 PM, Philippe Verdy wrote: > If you don't know what to do with your books (any kind), go to your local > public library to give it there, or give it to a school, they may interest > students. Such books are rarely found in primary schools but this may > insterest them to get some supports and the earlier versions are simpler to > sudy than recent versions and not all children have a suitable Internet to > work with in better conditions than a poor smartphone. > Yoy should only drop dialy newspapers or old magazines. > Even students could use them for creating art and would be amazed to > discover that there are more scripts than what they think or learn or may > find interests in learning foreign languages because of these books. > > 2018-01-07 1:08 GMT+01:00 Paul Hoffman via Unicode : > >> Greetings. I am cleaning out my closet, and have printed versions of TUS >> v1 through v4 that I'm no longer interested in. If you want them and are >> willing to pay postage (US media mail rates are lowest), send me a note >> off-list. Otherwise, they will go the way of so many things in this world... >> >> --Paul Hoffman >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 9 21:44:28 2018 From: unicode at unicode.org (Karl Sanders via Unicode) Date: Wed, 10 Jan 2018 04:44:28 +0100 Subject: Whitespace-related characters Message-ID: Hi all, I was looking at this page: https://en.wikipedia.org/wiki/Whitespace_character specifically at the 'Related whitespace characters without Unicode character property "WSpace=Y"' table. I was wondering: 1) Is there an official source for this table in the standard? I think not and hence the following two questions. 2) Are there any characters that you think are missing from the table or maybe there are some that don't belong there? 3) I wouldn't put the U+2800 and U+2063 code points into such a table. Would you? Regards, Karl From unicode at unicode.org Wed Jan 10 21:44:25 2018 From: unicode at unicode.org (jillian mestel via Unicode) Date: Wed, 10 Jan 2018 22:44:25 -0500 Subject: =?utf-8?Q?Emoji=E2=80=99s?= Message-ID: To whom it may concern, I was very disappointed to learn that there are no emojis of portraying a dominant left hand. I feel this is rude, and is setting this group of people apart, and disregarding them. There are emojis of all different races of right dominant hands, yet not left dominant hands are portrayed. I hope this can be fixed, and that leftys and rightys can be equals. ??????????????? From unicode at unicode.org Wed Jan 10 23:35:01 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Thu, 11 Jan 2018 06:35:01 +0100 Subject: =?UTF-8?B?UmU6IEVtb2pp4oCZcw==?= In-Reply-To: References: Message-ID: On Thu, Jan 11, 2018 at 4:44 AM, jillian mestel via Unicode wrote: > To whom it may concern, > I was very disappointed to learn that there are no emojis of portraying a dominant left hand. I feel this is rude, and is setting this group of people apart, and disregarding them. There are emojis of all different races of right dominant hands, yet not left dominant hands are portrayed. I hope this can be fixed, and that leftys and rightys can be equals. > ??????????????? Then people with no hands will be discriminated. From unicode at unicode.org Thu Jan 11 03:56:06 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 11 Jan 2018 10:56:06 +0100 Subject: =?UTF-8?B?UmU6IEVtb2pp4oCZcw==?= In-Reply-To: References: Message-ID: 2018-01-11 6:35 GMT+01:00 Pierpaolo Bernardi via Unicode < unicode at unicode.org>: > On Thu, Jan 11, 2018 at 4:44 AM, jillian mestel via Unicode > wrote: > > To whom it may concern, > > I was very disappointed to learn that there are no emojis of portraying > a dominant left hand. I feel this is rude, and is setting this group of > people apart, and disregarding them. There are emojis of all different > races of right dominant hands, yet not left dominant hands are portrayed. I > hope this can be fixed, and that leftys and rightys can be equals. > > ??????????????? > > Then people with no hands will be discriminated. Do you suggest those unable to use their hands should have their emojis with their right or left foot holding the pen ? Or with the pen in their mouth ? Or with their eyes followed by a camera and blinking to select letters/words to compose on a display ? or using seech-to-text processors ? There are lot of different handicaps with different solutions, and the first one is severe visual deficiency (or blindness), and severe intellectual deficiencies (from their birth, or after health accidents), where people can't read or distinguish the emojis or understand their differences, and will need assitance by an equipement or a third party. Think about the symbol for wheelchair: do you want to distinguish a "left-hand" and "right-hand" version (by mirroring), or a motorized version for those that can't push it with their hands, or a wheelbed for those that can't sit down ? These omissions in existing emojis are not "rude" or "discriminatory", they are just not requested for actual use ; what is really "rude" is the experienced handicaps, and what is "discriminatory" is how we accept (or refuse) to adapt our social life, common equipement and laws, to improve the coexistence of people with and without these deficiencies. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 11 04:07:23 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Thu, 11 Jan 2018 11:07:23 +0100 Subject: =?UTF-8?Q?Re:_Emoji=E2=80=99s?= Message-ID: <2icg7ha1v1k4uanc63s3p3l9.1515665243457@email.android.com> Il giorno 11 gennaio 2018, alle ore 10:56, Philippe Verdy ha scritto: > > >2018-01-11 6:35 GMT+01:00 Pierpaolo Bernardi via Unicode : > >On Thu, Jan 11, 2018 at 4:44 AM, jillian mestel via Unicode > wrote: >> To whom it may concern, >> I was very disappointed to learn that there are no emojis of portraying a dominant left hand. I feel this is rude, and is setting this group of people apart, and disregarding them. There are emojis of all different races of right dominant hands, yet not left dominant hands are portrayed. I hope this can be fixed, and that leftys and rightys can be equals. >> ??????????????? > >Then people with no hands will be discriminated. > >? > >Do you suggest those unable to use their hands should have their emojis with their right or left foot holding the pen ? No. Where did you get this idea from? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 11 05:30:39 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Thu, 11 Jan 2018 12:30:39 +0100 (CET) Subject: =?UTF-8?Q?Re:_Emoji=E2=80=99s?= In-Reply-To: References: Message-ID: <1258322087.13885.1515670239525@ox.hosteurope.de> jillian mestel: > > I was very disappointed to learn that there are no emojis of portraying a dominant left hand. See for the general emoji proposal process. This would actually not need a new character being assigned a code point, because existing ?? U+1F58E could be reused to contrast with ?? U+270D. It would just need the Emoji property being set which can be done with any update to UTS#51. UTS#51 11.0 (beta) introduces ZWJ sequences with left and right arrows (?? U+2B05, ?? U+27A1) as suffixed determiners to explicitly indicate directional orientation, but this would be an inappropriate solution for this case. The custom emoji sets by Samsung and LG already include colorful graphics for U+1F58E. UTC should adopt a policy that grants any pictographic character the Emoji property if it is supported by at least two major vendors. ("Major vendor" would need a proper definition.) These 20 characters would be affected at the moment if my records are correct and complete: - U+2610 ?: BALLOT BOX - U+2612 ?: BALLOT BOX WITH X - U+261C ?: WHITE LEFT POINTING INDEX [L2/17-421] - U+261E ?: WHITE RIGHT POINTING INDEX [L2/17-421] - U+261F ?: WHITE DOWN POINTING INDEX [L2/17-421] - U+1F323 ??: WHITE SUN - U+1F544 ??: NOTCHED RIGHT SEMICIRCLE WITH THREE DOTS - U+1F546 ??: WHITE LATIN CROSS - U+1F547 ??: HEAVY LATIN CROSS - U+1F568 ??: RIGHT SPEAKER - U+1F569 ??: RIGHT SPEAKER WITH ONE SOUND WAVE - U+1F56A ??: RIGHT SPEAKER WITH THREE SOUND WAVES - U+1F56D ??: RINGING BELL [L2/17-240] - U+1F58E ??: LEFT WRITING HAND - U+1F591 ??: REVERSED RAISED HAND WITH FINGERS SPLAYED - U+1F592 ??: REVERSED THUMBS UP SIGN - U+1F593 ??: REVERSED THUMBS DOWN SIGN - U+1F5E2 ??: LIPS - U+1F6C6 ??: TRIANGLE WITH ROUNDED CORNERS - U+1F6C7 ??: PROHIBITED SIGN [L2/17-240]: http://www.unicode.org/L2/L2017/17240-ringing-bell-chg.pdf [L2/17-421]: http://www.unicode.org/L2/L2017/17421r-emoji-changes.pdf From unicode at unicode.org Thu Jan 11 22:53:26 2018 From: unicode at unicode.org (Manish Goregaokar via Unicode) Date: Fri, 12 Jan 2018 10:23:26 +0530 Subject: =?UTF-8?B?UmU6IEVtb2pp4oCZcw==?= In-Reply-To: <1258322087.13885.1515670239525@ox.hosteurope.de> References: <1258322087.13885.1515670239525@ox.hosteurope.de> Message-ID: I submitted a proposal to emojify the left writing hand code point. -Manish On Thu, Jan 11, 2018 at 5:00 PM, Christoph P?per via Unicode < unicode at unicode.org> wrote: > jillian mestel: > > > > I was very disappointed to learn that there are no emojis of portraying > a dominant left hand. > > See for the general emoji > proposal process. This would actually not need a new character being > assigned a code point, because existing ?? U+1F58E could be reused to > contrast with ?? U+270D. It would just need the Emoji property being set > which can be done with any update to UTS#51. > > UTS#51 11.0 (beta) introduces ZWJ sequences with left and right arrows (?? > U+2B05, ?? U+27A1) as suffixed determiners to explicitly indicate > directional orientation, but this would be an inappropriate solution for > this case. > > The custom emoji sets by Samsung and LG already include colorful graphics > for U+1F58E. UTC should adopt a policy that grants any pictographic > character the Emoji property if it is supported by at least two major > vendors. ("Major vendor" would need a proper definition.) These 20 > characters would be affected at the moment if my records are correct and > complete: > > - U+2610 ?: BALLOT BOX > - U+2612 ?: BALLOT BOX WITH X > - U+261C ?: WHITE LEFT POINTING INDEX [L2/17-421] > - U+261E ?: WHITE RIGHT POINTING INDEX [L2/17-421] > - U+261F ?: WHITE DOWN POINTING INDEX [L2/17-421] > - U+1F323 ??: WHITE SUN > - U+1F544 ??: NOTCHED RIGHT SEMICIRCLE WITH THREE DOTS > - U+1F546 ??: WHITE LATIN CROSS > - U+1F547 ??: HEAVY LATIN CROSS > - U+1F568 ??: RIGHT SPEAKER > - U+1F569 ??: RIGHT SPEAKER WITH ONE SOUND WAVE > - U+1F56A ??: RIGHT SPEAKER WITH THREE SOUND WAVES > - U+1F56D ??: RINGING BELL [L2/17-240] > - U+1F58E ??: LEFT WRITING HAND > - U+1F591 ??: REVERSED RAISED HAND WITH FINGERS SPLAYED > - U+1F592 ??: REVERSED THUMBS UP SIGN > - U+1F593 ??: REVERSED THUMBS DOWN SIGN > - U+1F5E2 ??: LIPS > - U+1F6C6 ??: TRIANGLE WITH ROUNDED CORNERS > - U+1F6C7 ??: PROHIBITED SIGN > > [L2/17-240]: http://www.unicode.org/L2/L2017/17240-ringing-bell-chg.pdf > [L2/17-421]: http://www.unicode.org/L2/L2017/17421r-emoji-changes.pdf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 13 11:14:13 2018 From: unicode at unicode.org (Henri Sivonen via Unicode) Date: Sat, 13 Jan 2018 19:14:13 +0200 Subject: PDF restrictions on the Unicode Standard 10.0 Message-ID: I was reading https://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf on a Sony Digital Paper device and tried to scribble some notes and make highlights but I couldn't. I still couldn't after ensuring that the pen was charged and could write on other PDFs. Since Evince told me just "Security: No", since the Digital Paper's UI for designating non-editability is easy to miss and since there's no password required to open the file, it took me non-trivial time to figure out what was going on. Upon examining the PDF in Acrobat Reader, it turned out that even though the PDF can be viewed, printed and copied from without artificial restrictions, there are various restriction bits set for modifying the file. (Screenshot: https://hsivonen.fi/screen/unicode-pdf-restrictions.png ) It doesn't make sense to me that the Consortium restricts me from adding highlights or handwriting if I open the Standard on an e-Ink device even though I can do those things if I print the PDF. I'd like to request that going forward the Consortium refrain from using restriction bits or any "security" on the PDFs it publishes. -- Henri Sivonen hsivonen at hsivonen.fi https://hsivonen.fi/ From unicode at unicode.org Mon Jan 15 21:25:01 2018 From: unicode at unicode.org (Eric Muller via Unicode) Date: Mon, 15 Jan 2018 19:25:01 -0800 Subject: 0027, 02BC, 2019, or a new character? Message-ID: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> https://www.nytimes.com/2018/01/15/world/asia/kazakhstan-alphabet-nursultan-nazarbayev.html Eric. From unicode at unicode.org Mon Jan 15 21:57:13 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 15 Jan 2018 20:57:13 -0700 Subject: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)) In-Reply-To: References: Message-ID: On January 5, Mark Davis wrote: > Doug, I modified my working draft, at > https://docs.google.com/document/d/1EuNjbs0XrBwqlvCJxra44o3EVrwdBJUWsPf8Ec1fWKY > > If that looks ok, I'll submit. Sorry for the delay. The text substitutions look fine. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jan 15 22:16:21 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 15 Jan 2018 20:16:21 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: It will probably be the ASCII apostrophe. The stated intent favors the apostrophe over diacritics or special characters to ensure that the language can be input to computers with standard keyboards. From unicode at unicode.org Mon Jan 15 23:55:32 2018 From: unicode at unicode.org (Pravin Jain via Unicode) Date: Tue, 16 Jan 2018 11:25:32 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: new characters can always be left to proper input methods being available, I am not sure, but I feel over use of apostrophes can lead to ambiguity. On Tue, Jan 16, 2018 at 9:46 AM, James Kass via Unicode wrote: > It will probably be the ASCII apostrophe. The stated intent favors > the apostrophe over diacritics or special characters to ensure that > the language can be input to computers with standard keyboards. > -- Pravin Jain (M)+91-9426054269 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 16 01:40:15 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 15 Jan 2018 23:40:15 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: It's possible that the ruler of Kazakhstan, who is guiding this script change movement, is unaware of modern, proper input methods. Avoiding ambiguity was the reason given for the government's rejection of ASCII-Latin digraphs; it was thought that, for example, English language students might become confused by a phonetic difference between the same digraph as used in English versus Kazakh. On a side note, wouldn't most of the "standard keyboards" currently in Kazakhstan be labelled in Cyrillic anyway? More info on Kazakh's writing system history: http://www.omniglot.com/writing/kazakh.htm From unicode at unicode.org Tue Jan 16 02:00:19 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 Jan 2018 08:00:19 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: <20180116080019.2738554a@JRWUBU2> On Mon, 15 Jan 2018 20:16:21 -0800 James Kass via Unicode wrote: > It will probably be the ASCII apostrophe. The stated intent favors > the apostrophe over diacritics or special characters to ensure that > the language can be input to computers with standard keyboards. Typing U+0027 into a word processor takes planning. Of the three, it should obviously be the modifier letter U+02BC, but I think what gets stored will be U+0027 or the single quotation mark U+2019. However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA ABOVE RIGHT. Richard. From unicode at unicode.org Tue Jan 16 02:10:19 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Tue, 16 Jan 2018 13:40:19 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: Rejecting the digraph method (which is probably the simplest) doesn't have much meaning because they have different sounds in different languages all the time like ch in English and German. Anyhow, it certainly can be difficult convincing non technical political people. Modifier letters are more legible than modifier punctuation IMO so that maybe an option And the labels on keycaps don't mean anything at all. We in India use the plain QWERTY keyboard all the time for our scripts. In any case, the linguistic committee should present their recommendation along with a new set of actual keycaps and an MSKLC or such input method to just show the president that what is recommended can be input using "a standard keyboard". -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 16 02:46:19 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 16 Jan 2018 08:46:19 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: <20180116084619.58d0c5a7@JRWUBU2> On Mon, 15 Jan 2018 23:40:15 -0800 James Kass via Unicode wrote: > On a side note, wouldn't most of the "standard keyboards" currently in > Kazakhstan be labelled in Cyrillic anyway? They're probably already labelled in Cyrillic *and* printable ASCII (US QWERTY). Using the Cyrillic labels for non-ASCII Latin Kazakh would cause utter confusion. Richard. From unicode at unicode.org Wed Jan 17 07:06:26 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 17 Jan 2018 14:06:26 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: Excessive digrams based on (non-combining) apostrophes will create numerous problems. The only case I know that uses an apostrophe in a polygram is the trigram "c'h" used in Breton, where it is used to differentiate it from "ch" (but here also it would have been simpler to use another digraph, such as "sh", or a diacritic but Bretons wanted to use the diacritics available in Frencdh which has no diacritic on consonants except "?" with the cedilla which could have been used there, and the tilde in "?"). The "c'h" trigram in Breton however causes less problems because it is not final and within a pair where it is unlikely to mark an elision between two words. But now Kazakh will have difficulties to mark elisions, and will also have problem to allow distinctive quotations I hope they will never have cases like: 's'a'n'd'' with pairs of apostrophes at end and it would have been better readable to see: '????'. Using the caron diacritic, typical in Eastern European languages, would have also done the trick over consonnants, while preserving the possibility to capitalize letters: a single diacritic was easy to map on keyboards. Adding the diaeresis or macron, or even the acute for the long vowels would have also done the trick with the second diacritic. But here Kazakh has some turkic origin and solutions based on other turkic alphabets could have been used. But may be they did not like the compelxity of Turkish for dotless vs. dotted "i". But a few diacritics could have helped without having to use custom ligatures or digrams. Now I think that these proposed non-combining apostrophes will evolve to combining acute accents (the most widely used diacritic in Latin in most languages): it will make the texts actually more readable. 2018-01-16 9:10 GMT+01:00 Shriramana Sharma via Unicode : > Rejecting the digraph method (which is probably the simplest) doesn't have > much meaning because they have different sounds in different languages all > the time like ch in English and German. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 17 16:11:09 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Wed, 17 Jan 2018 23:11:09 +0100 (CET) Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> Message-ID: <1563876561.75365.1516227071367@ox.hosteurope.de> James Kass via Unicode : > > It will probably be the ASCII apostrophe. The stated intent favors > the apostrophe over diacritics or special characters to ensure that > the language can be input to computers with standard keyboards. Yes, this can only mean U+0027, but apparently official material, in MS Word format, shows the curly apostrophe punctuation mark U+2019 instead. There is probably no doubt among list subscribers that U+02BC should be used for any apostrophe that works like a proper letter. embeds , and both are quoted in . Cyrl Latn-kz Latn ? A?/A' ?/? ? G?/G' ?/? ?/? I?/I' ?/? ? N?/N' ?/? ? O?/O' ?/? ? Y?/Y' W ? U?/U' ? ? C?/C' ?/Ch ? S?/S' ?/Sh I sympathize with the ease of input argument, but input (keys) does neither have to equate storage (characters) nor output (glyphs). Furthermore, all orthographies should (and many constructed ones don't) respect that almost all text is read more often and by more people than it is written by, thus reader experience is more important than writer experience. Whether you use - a single dead key that has to be typed before the corresponding letter without diacritic marks or - a combinator key (e.g. AltGr) that must be kept pressed while the base is typed or - a secondary selection that appears when the base letter's key is hold down longer or - separate keys for each letter outside the MRA, the best solution depends on the hardware, software and, of course, the writing system, i.e. how frequently and prominently these letters occur. From unicode at unicode.org Wed Jan 17 20:30:57 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 17 Jan 2018 21:30:57 -0500 Subject: Observations and rants Message-ID: <289f25b5-8754-4b9e-4256-51667ece2948@kli.org> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 02:21:27 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Thu, 18 Jan 2018 08:21:27 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180116080019.2738554a@JRWUBU2> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> Message-ID: <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode > wrote: On Mon, 15 Jan 2018 20:16:21 -0800 James Kass via Unicode > wrote: It will probably be the ASCII apostrophe. The stated intent favors the apostrophe over diacritics or special characters to ensure that the language can be input to computers with standard keyboards. Typing U+0027 into a word processor takes planning. Of the three, it should obviously be the modifier letter U+02BC, but I think what gets stored will be U+0027 or the single quotation mark U+2019. However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA ABOVE RIGHT. Richard. I have just tested twitter hashtags and as one would expect, U+02BC does not break hashtags. See twitter.com/andreschappo/status/953903964722024448 Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 05:00:35 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Thu, 18 Jan 2018 11:00:35 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> Message-ID: <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode > wrote: On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode > wrote: On Mon, 15 Jan 2018 20:16:21 -0800 James Kass via Unicode > wrote: It will probably be the ASCII apostrophe. The stated intent favors the apostrophe over diacritics or special characters to ensure that the language can be input to computers with standard keyboards. Typing U+0027 into a word processor takes planning. Of the three, it should obviously be the modifier letter U+02BC, but I think what gets stored will be U+0027 or the single quotation mark U+2019. However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA ABOVE RIGHT. Richard. I have just tested twitter hashtags and as one would expect, U+02BC does not break hashtags. See twitter.com/andreschappo/status/953903964722024448 ...and, just in case twitter.com/andreschappo/status/953944089896083456 Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 08:55:52 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 18 Jan 2018 20:25:52 +0530 Subject: Emoji for major planets at least? Message-ID: Hello people. We have sun, earth and moon emoji (3 for the earth and more for the moon's phases). But we don't have emoji for the rest of the planets. We have astrological symbols for all the planets and a few non-existent imaginary "planets" as well. Given this, would it be impractical to encode proper emoji characters for the rest of the planets, at least the major ones whose physical characteristics are well known and identifiable? I mean for example identifying Sedna and Quaoar (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not going to be practical for all those other than astronomy buffs but the physical shapes of the major planets are known to all high school students? -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Thu Jan 18 09:38:06 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Thu, 18 Jan 2018 15:38:06 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> Message-ID: On 18 Jan 2018, at 08:21, Andre Schappo via Unicode > wrote: On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode > wrote: On Mon, 15 Jan 2018 20:16:21 -0800 James Kass via Unicode > wrote: It will probably be the ASCII apostrophe. The stated intent favors the apostrophe over diacritics or special characters to ensure that the language can be input to computers with standard keyboards. Typing U+0027 into a word processor takes planning. Of the three, it should obviously be the modifier letter U+02BC, but I think what gets stored will be U+0027 or the single quotation mark U+2019. However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA ABOVE RIGHT. Richard. I have just tested twitter hashtags and as one would expect, U+02BC does not break hashtags. See twitter.com/andreschappo/status/953903964722024448 I have done a bit more investigation and as a result have written a short blog article ? schappo.blogspot.co.uk/2018/01/computer-science-internationalization_18.html Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 11:44:05 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 Jan 2018 09:44:05 -0800 Subject: Emoji for major planets at least? In-Reply-To: References: Message-ID: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 12:01:43 2018 From: unicode at unicode.org (John H. Jenkins via Unicode) Date: Thu, 18 Jan 2018 11:01:43 -0700 Subject: Emoji for major planets at least? In-Reply-To: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> Message-ID: Well, you can go with Venus = white planet, Mercury = grey planet, Uranus = greenish planet, Neptune = bluish planet, Jupiter = striped planet. As you say, though, without a context, none of them convey much and Venus, at least, would just be a circle. Plus there's the question of the context in which someone would want to send little pictures of the planets. This sounds like it would be adding emoji just because. > On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode wrote: > > On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: >> Hello people. >> >> We have sun, earth and moon emoji (3 for the earth and more for the >> moon's phases). But we don't have emoji for the rest of the planets. >> >> We have astrological symbols for all the planets and a few >> non-existent imaginary "planets" as well. >> >> Given this, would it be impractical to encode proper emoji characters >> for the rest of the planets, at least the major ones whose physical >> characteristics are well known and identifiable? >> >> I mean for example identifying Sedna and Quaoar >> (https://en.wikipedia.org/wiki/File:EightTNOs.png ) is probably not >> going to be practical for all those other than astronomy buffs but the >> physical shapes of the major planets are known to all high school >> students? >> > Earth = blue planet (with clouds) > > Mars = red planet > > Saturn = planet with rings > > I don't think any of the other ones are identifiable in a context-free setting, unless you draw a "big planet with red dot" for Jupiter. > > Earth would have to be depicted in a way that doesn't focus on "hemispheres", or you miss the idea of it as "planet". > > > > A./ > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 12:46:24 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Thu, 18 Jan 2018 10:46:24 -0800 Subject: Emoji for major planets at least? In-Reply-To: References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> Message-ID: <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> On 1/18/2018 10:01 AM, John H. Jenkins wrote: > Well, you can go with Venus = white planet, Mercury = grey planet, > Uranus = greenish planet, Neptune = bluish planet, Jupiter = striped > planet. > > As you say, though, without a context, none of them convey much and > Venus, at least, would just be a circle. > > Plus there's the question of the context in which someone would want > to send little pictures of the planets. This sounds like it would be > adding emoji just because. "Earth" as in "a blue ball in space" is something that reached iconic status after the famous photo taken during the early Apollo missions. I could definitely see that used in a variety of possible contexts. And the recognition value is higher than for many recent emoji. Saturn, with its rings (even though it's no longer the only one known with rings) also is iconic and highly recognizable. I lack imagination as to when someone would want to use it in communication, but I have the same issue with quite a few recent emoji, some of which are far less iconic or recognizable. I think it does lend itself to describe a "non-earth" type planet, or even the generic idea of a planet (as opposed to a star/sun). Mars and Venus have tons of connotations, which could be expressed by using an emoji (as opposed to the astrological symbol for each), but only Mars is reasonably recognizable without lots of pre-established context. That red color. In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes and red dot would more recognizable than any of the remaining planets (on par or better with many recent emoji), but I see even less scope for using it metaphorically or in extended contexts. If someone were to make a proposal, I would suggest to them to limit it to these four and to provide more of a suggestion as to how these might show up in use. A./ > >> On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode >> > wrote: >> >> On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: >>> Hello people. >>> >>> We have sun, earth and moon emoji (3 for the earth and more for the >>> moon's phases). But we don't have emoji for the rest of the planets. >>> >>> We have astrological symbols for all the planets and a few >>> non-existent imaginary "planets" as well. >>> >>> Given this, would it be impractical to encode proper emoji characters >>> for the rest of the planets, at least the major ones whose physical >>> characteristics are well known and identifiable? >>> >>> I mean for example identifying Sedna and Quaoar >>> (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not >>> going to be practical for all those other than astronomy buffs but the >>> physical shapes of the major planets are known to all high school >>> students? >>> >> Earth = blue planet (with clouds) >> >> Mars = red planet >> >> Saturn = planet with rings >> >> I don't think any of the other ones are identifiable in a >> context-free setting, unless you draw a "big planet with red dot" for >> Jupiter. >> >> Earth would have to be depicted in a way that doesn't focus on >> "hemispheres", or you miss the idea of it as "planet". >> >> >> A./ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 12:51:39 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 Jan 2018 10:51:39 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 13:04:09 2018 From: unicode at unicode.org (Anshuman Pandey via Unicode) Date: Thu, 18 Jan 2018 13:04:09 -0600 Subject: Emoji for major planets at least? In-Reply-To: <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> Message-ID: <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> Proposals for planet emoji were submitted in April 2017: https://www.unicode.org/L2/L2017/17100-planet-emoji-seq.pdf http://www.unicode.org/L2/L2017/17100r-planet-emoji-seq.pdf I?m not sure what the result was. Anshu > On Jan 18, 2018, at 12:46 PM, Asmus Freytag (c) via Unicode wrote: > >> On 1/18/2018 10:01 AM, John H. Jenkins wrote: >> Well, you can go with Venus = white planet, Mercury = grey planet, Uranus = greenish planet, Neptune = bluish planet, Jupiter = striped planet. >> >> As you say, though, without a context, none of them convey much and Venus, at least, would just be a circle. >> >> Plus there's the question of the context in which someone would want to send little pictures of the planets. This sounds like it would be adding emoji just because. > > "Earth" as in "a blue ball in space" is something that reached iconic status after the famous photo taken during the early Apollo missions. I could definitely see that used in a variety of possible contexts. And the recognition value is higher than for many recent emoji. > > Saturn, with its rings (even though it's no longer the only one known with rings) also is iconic and highly recognizable. I lack imagination as to when someone would want to use it in communication, but I have the same issue with quite a few recent emoji, some of which are far less iconic or recognizable. I think it does lend itself to describe a "non-earth" type planet, or even the generic idea of a planet (as opposed to a star/sun). > > Mars and Venus have tons of connotations, which could be expressed by using an emoji (as opposed to the astrological symbol for each), but only Mars is reasonably recognizable without lots of pre-established context. That red color. > > In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes and red dot would more recognizable than any of the remaining planets (on par or better with many recent emoji), but I see even less scope for using it metaphorically or in extended contexts. > > If someone were to make a proposal, I would suggest to them to limit it to these four and to provide more of a suggestion as to how these might show up in use. > > A./ >> >>> On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode wrote: >>> >>>> On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: >>>> Hello people. >>>> >>>> We have sun, earth and moon emoji (3 for the earth and more for the >>>> moon's phases). But we don't have emoji for the rest of the planets. >>>> >>>> We have astrological symbols for all the planets and a few >>>> non-existent imaginary "planets" as well. >>>> >>>> Given this, would it be impractical to encode proper emoji characters >>>> for the rest of the planets, at least the major ones whose physical >>>> characteristics are well known and identifiable? >>>> >>>> I mean for example identifying Sedna and Quaoar >>>> (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not >>>> going to be practical for all those other than astronomy buffs but the >>>> physical shapes of the major planets are known to all high school >>>> students? >>>> >>> Earth = blue planet (with clouds) >>> >>> Mars = red planet >>> >>> Saturn = planet with rings >>> >>> I don't think any of the other ones are identifiable in a context-free setting, unless you draw a "big planet with red dot" for Jupiter. >>> >>> Earth would have to be depicted in a way that doesn't focus on "hemispheres", or you miss the idea of it as "planet". >>> >>> >>> >>> A./ >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 15:10:46 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 18 Jan 2018 22:10:46 +0100 Subject: Emoji for major planets at least? In-Reply-To: <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> Message-ID: Well I can think of a popular pseudo-planet, the "Death Star" or "Black Star" (for the "Star Wars" series), which is easily recognized by its color and shape (with the deep built crater, and optionally its destroyed half part) which also looks like a real planet, the Saturnian moon Mimas with its very wide crater (to avoid the copyright issue)... 2018-01-18 20:04 GMT+01:00 Anshuman Pandey via Unicode : > Proposals for planet emoji were submitted in April 2017: > > https://www.unicode.org/L2/L2017/17100-planet-emoji-seq.pdf > > http://www.unicode.org/L2/L2017/17100r-planet-emoji-seq.pdf > > I?m not sure what the result was. > > Anshu > > > On Jan 18, 2018, at 12:46 PM, Asmus Freytag (c) via Unicode < > unicode at unicode.org> wrote: > > On 1/18/2018 10:01 AM, John H. Jenkins wrote: > > Well, you can go with Venus = white planet, Mercury = grey planet, Uranus > = greenish planet, Neptune = bluish planet, Jupiter = striped planet. > > As you say, though, without a context, none of them convey much and Venus, > at least, would just be a circle. > > Plus there's the question of the context in which someone would want to > send little pictures of the planets. This sounds like it would be adding > emoji just because. > > > "Earth" as in "a blue ball in space" is something that reached iconic > status after the famous photo taken during the early Apollo missions. I > could definitely see that used in a variety of possible contexts. And the > recognition value is higher than for many recent emoji. > > Saturn, with its rings (even though it's no longer the only one known with > rings) also is iconic and highly recognizable. I lack imagination as to > when someone would want to use it in communication, but I have the same > issue with quite a few recent emoji, some of which are far less iconic or > recognizable. I think it does lend itself to describe a "non-earth" type > planet, or even the generic idea of a planet (as opposed to a star/sun). > > Mars and Venus have tons of connotations, which could be expressed by > using an emoji (as opposed to the astrological symbol for each), but only > Mars is reasonably recognizable without lots of pre-established context. > That red color. > > In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes > and red dot would more recognizable than any of the remaining planets (on > par or better with many recent emoji), but I see even less scope for using > it metaphorically or in extended contexts. > > If someone were to make a proposal, I would suggest to them to limit it to > these four and to provide more of a suggestion as to how these might show > up in use. > > A./ > > > On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode < > unicode at unicode.org> wrote: > > On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: > > Hello people. > > We have sun, earth and moon emoji (3 for the earth and more for the > moon's phases). But we don't have emoji for the rest of the planets. > > We have astrological symbols for all the planets and a few > non-existent imaginary "planets" as well. > > Given this, would it be impractical to encode proper emoji characters > for the rest of the planets, at least the major ones whose physical > characteristics are well known and identifiable? > > I mean for example identifying Sedna and Quaoar > (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not > going to be practical for all those other than astronomy buffs but the > physical shapes of the major planets are known to all high school > students? > > > Earth = blue planet (with clouds) > > Mars = red planet > > Saturn = planet with rings > > I don't think any of the other ones are identifiable in a context-free > setting, unless you draw a "big planet with red dot" for Jupiter. > > Earth would have to be depicted in a way that doesn't focus on > "hemispheres", or you miss the idea of it as "planet". > > > A./ > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 15:59:12 2018 From: unicode at unicode.org (Walter Tross via Unicode) Date: Thu, 18 Jan 2018 22:59:12 +0100 Subject: Emoji for major planets at least? In-Reply-To: References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> Message-ID: Sorry guys if I step in uninvited, but I must say that I had hoped that the subject of this thread was ironical. Do you guys want to have an emoji for every entry of some encyclopaedia? You need JPEG, PNG, etc., not Unicode. Sorry Walter 2018-01-18 22:10 GMT+01:00 Philippe Verdy via Unicode : > Well I can think of a popular pseudo-planet, the "Death Star" or "Black > Star" (for the "Star Wars" series), which is easily recognized by its color > and shape (with the deep built crater, and optionally its destroyed half > part) which also looks like a real planet, the Saturnian moon Mimas with > its very wide crater (to avoid the copyright issue)... > > 2018-01-18 20:04 GMT+01:00 Anshuman Pandey via Unicode < > unicode at unicode.org>: > >> Proposals for planet emoji were submitted in April 2017: >> >> https://www.unicode.org/L2/L2017/17100-planet-emoji-seq.pdf >> >> http://www.unicode.org/L2/L2017/17100r-planet-emoji-seq.pdf >> >> I?m not sure what the result was. >> >> Anshu >> >> >> On Jan 18, 2018, at 12:46 PM, Asmus Freytag (c) via Unicode < >> unicode at unicode.org> wrote: >> >> On 1/18/2018 10:01 AM, John H. Jenkins wrote: >> >> Well, you can go with Venus = white planet, Mercury = grey planet, Uranus >> = greenish planet, Neptune = bluish planet, Jupiter = striped planet. >> >> As you say, though, without a context, none of them convey much and >> Venus, at least, would just be a circle. >> >> Plus there's the question of the context in which someone would want to >> send little pictures of the planets. This sounds like it would be adding >> emoji just because. >> >> >> "Earth" as in "a blue ball in space" is something that reached iconic >> status after the famous photo taken during the early Apollo missions. I >> could definitely see that used in a variety of possible contexts. And the >> recognition value is higher than for many recent emoji. >> >> Saturn, with its rings (even though it's no longer the only one known >> with rings) also is iconic and highly recognizable. I lack imagination as >> to when someone would want to use it in communication, but I have the same >> issue with quite a few recent emoji, some of which are far less iconic or >> recognizable. I think it does lend itself to describe a "non-earth" type >> planet, or even the generic idea of a planet (as opposed to a star/sun). >> >> Mars and Venus have tons of connotations, which could be expressed by >> using an emoji (as opposed to the astrological symbol for each), but only >> Mars is reasonably recognizable without lots of pre-established context. >> That red color. >> >> In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes >> and red dot would more recognizable than any of the remaining planets (on >> par or better with many recent emoji), but I see even less scope for using >> it metaphorically or in extended contexts. >> >> If someone were to make a proposal, I would suggest to them to limit it >> to these four and to provide more of a suggestion as to how these might >> show up in use. >> >> A./ >> >> >> On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode < >> unicode at unicode.org> wrote: >> >> On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: >> >> Hello people. >> >> We have sun, earth and moon emoji (3 for the earth and more for the >> moon's phases). But we don't have emoji for the rest of the planets. >> >> We have astrological symbols for all the planets and a few >> non-existent imaginary "planets" as well. >> >> Given this, would it be impractical to encode proper emoji characters >> for the rest of the planets, at least the major ones whose physical >> characteristics are well known and identifiable? >> >> I mean for example identifying Sedna and Quaoar >> (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not >> going to be practical for all those other than astronomy buffs but the >> physical shapes of the major planets are known to all high school >> students? >> >> >> Earth = blue planet (with clouds) >> >> Mars = red planet >> >> Saturn = planet with rings >> >> I don't think any of the other ones are identifiable in a context-free >> setting, unless you draw a "big planet with red dot" for Jupiter. >> >> Earth would have to be depicted in a way that doesn't focus on >> "hemispheres", or you miss the idea of it as "planet". >> >> >> A./ >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 17:25:02 2018 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 18 Jan 2018 15:25:02 -0800 Subject: Emoji for major planets at least? In-Reply-To: References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> Message-ID: <94ad5a07-5b6c-1582-9be8-a2ea97b58e84@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 18:19:14 2018 From: unicode at unicode.org (Aleksey Tulinov via Unicode) Date: Fri, 19 Jan 2018 02:19:14 +0200 Subject: Emoji for major planets at least? In-Reply-To: <94ad5a07-5b6c-1582-9be8-a2ea97b58e84@ix.netcom.com> References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> <94ad5a07-5b6c-1582-9be8-a2ea97b58e84@ix.netcom.com> Message-ID: Perhaps we all shall stop being ironical to each other, calm down, sit and discuss how to encode 3D animated emojies (animojies) in Unicode. Adopting something like COLLADA would be sweet. I guess COLLADA, being XML-based standard, already can be encoded by Unicode, so it shouldn't be a lot of hustle, just some paper work, right? 2018-01-19 1:25 GMT+02:00 Asmus Freytag via Unicode : > On 1/18/2018 1:59 PM, Walter Tross via Unicode wrote: > > Sorry guys if I step in uninvited, but I must say that I had hoped that > the subject of this thread was ironical. > > > Of course not, how could you think that? > > Do you guys want to have an emoji for every entry of some encyclopaedia? > You need JPEG, PNG, etc., not Unicode. > > > Clearly, the natural progression of modern communication is away from > bothersome alphabetic recordings of spoken sound to the expressive power of > picture-writing. > > You can't possibly dream of standing in the way of this evolution! > > A./ > > > Sorry > Walter > > 2018-01-18 22:10 GMT+01:00 Philippe Verdy via Unicode >: > >> Well I can think of a popular pseudo-planet, the "Death Star" or "Black >> Star" (for the "Star Wars" series), which is easily recognized by its color >> and shape (with the deep built crater, and optionally its destroyed half >> part) which also looks like a real planet, the Saturnian moon Mimas with >> its very wide crater (to avoid the copyright issue)... >> >> 2018-01-18 20:04 GMT+01:00 Anshuman Pandey via Unicode < >> unicode at unicode.org>: >> >>> Proposals for planet emoji were submitted in April 2017: >>> >>> https://www.unicode.org/L2/L2017/17100-planet-emoji-seq.pdf >>> >>> http://www.unicode.org/L2/L2017/17100r-planet-emoji-seq.pdf >>> >>> I?m not sure what the result was. >>> >>> Anshu >>> >>> >>> On Jan 18, 2018, at 12:46 PM, Asmus Freytag (c) via Unicode < >>> unicode at unicode.org> wrote: >>> >>> On 1/18/2018 10:01 AM, John H. Jenkins wrote: >>> >>> Well, you can go with Venus = white planet, Mercury = grey planet, >>> Uranus = greenish planet, Neptune = bluish planet, Jupiter = striped >>> planet. >>> >>> As you say, though, without a context, none of them convey much and >>> Venus, at least, would just be a circle. >>> >>> Plus there's the question of the context in which someone would want to >>> send little pictures of the planets. This sounds like it would be adding >>> emoji just because. >>> >>> >>> "Earth" as in "a blue ball in space" is something that reached iconic >>> status after the famous photo taken during the early Apollo missions. I >>> could definitely see that used in a variety of possible contexts. And the >>> recognition value is higher than for many recent emoji. >>> >>> Saturn, with its rings (even though it's no longer the only one known >>> with rings) also is iconic and highly recognizable. I lack imagination as >>> to when someone would want to use it in communication, but I have the same >>> issue with quite a few recent emoji, some of which are far less iconic or >>> recognizable. I think it does lend itself to describe a "non-earth" type >>> planet, or even the generic idea of a planet (as opposed to a star/sun). >>> >>> Mars and Venus have tons of connotations, which could be expressed by >>> using an emoji (as opposed to the astrological symbol for each), but only >>> Mars is reasonably recognizable without lots of pre-established context. >>> That red color. >>> >>> In a detailed enough rendering, Jupiter, as a shaded "ball" with stripes >>> and red dot would more recognizable than any of the remaining planets (on >>> par or better with many recent emoji), but I see even less scope for using >>> it metaphorically or in extended contexts. >>> >>> If someone were to make a proposal, I would suggest to them to limit it >>> to these four and to provide more of a suggestion as to how these might >>> show up in use. >>> >>> A./ >>> >>> >>> On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode < >>> unicode at unicode.org> wrote: >>> >>> On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote: >>> >>> Hello people. >>> >>> We have sun, earth and moon emoji (3 for the earth and more for the >>> moon's phases). But we don't have emoji for the rest of the planets. >>> >>> We have astrological symbols for all the planets and a few >>> non-existent imaginary "planets" as well. >>> >>> Given this, would it be impractical to encode proper emoji characters >>> for the rest of the planets, at least the major ones whose physical >>> characteristics are well known and identifiable? >>> >>> I mean for example identifying Sedna and Quaoar >>> (https://en.wikipedia.org/wiki/File:EightTNOs.png) is probably not >>> going to be practical for all those other than astronomy buffs but the >>> physical shapes of the major planets are known to all high school >>> students? >>> >>> >>> Earth = blue planet (with clouds) >>> >>> Mars = red planet >>> >>> Saturn = planet with rings >>> >>> I don't think any of the other ones are identifiable in a context-free >>> setting, unless you draw a "big planet with red dot" for Jupiter. >>> >>> Earth would have to be depicted in a way that doesn't focus on >>> "hemispheres", or you miss the idea of it as "planet". >>> >>> >>> A./ >>> >>> >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 18 19:12:04 2018 From: unicode at unicode.org (Pierpaolo Bernardi via Unicode) Date: Fri, 19 Jan 2018 02:12:04 +0100 Subject: Emoji for major planets at least? In-Reply-To: References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> <94ad5a07-5b6c-1582-9be8-a2ea97b58e84@ix.netcom.com> Message-ID: On Fri, Jan 19, 2018 at 1:19 AM, Aleksey Tulinov via Unicode wrote: > Perhaps we all shall stop being ironical to each other, calm down, sit and > discuss how to encode 3D animated emojies (animojies) in Unicode. Adopting > something like COLLADA would be sweet. I guess COLLADA, being XML-based > standard, already can be encoded by Unicode, so it shouldn't be a lot of > hustle, just some paper work, right? What??? No MPEG-4? COLLADA is a step in the right direction, but it doesn't encode sounds! From unicode at unicode.org Fri Jan 19 02:42:44 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 19 Jan 2018 08:42:44 +0000 Subject: Emoji for major planets at least? In-Reply-To: References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> <10D65683-2738-4A2A-831B-E27DF665B52A@umich.edu> <94ad5a07-5b6c-1582-9be8-a2ea97b58e84@ix.netcom.com> Message-ID: <20180119084244.7bb6e31a@JRWUBU2> On Fri, 19 Jan 2018 02:12:04 +0100 Pierpaolo Bernardi via Unicode wrote: > On Fri, Jan 19, 2018 at 1:19 AM, Aleksey Tulinov via Unicode > wrote: > > Perhaps we all shall stop being ironical to each other, calm down, > > sit and discuss how to encode 3D animated emojies (animojies) in > > Unicode. Adopting something like COLLADA would be sweet. I guess > > COLLADA, being XML-based standard, already can be encoded by > > Unicode, so it shouldn't be a lot of hustle, just some paper work, > > right? > > What??? No MPEG-4? > > COLLADA is a step in the right direction, but it doesn't encode > sounds! Isn't there the issue that Unicode is supposed to encode writing? I only see two secure precedents for the encoding of multimedia emoji: 1) Korean compatibility ideographs which differ only in pronunciation 2) Character-level mark-up by CGJ for collation to distinguish German umlaut and diaeresis (if they are truly indistinguishable in all styles) and by CGJ to distinguish strings for the purposes of collation. (Soft hyphens are also usable in this r?le.) Of course, multimedia *glyphs* are permitted. Of course, it's rather tricky to print animated glyphs using muggle inks. Richard. From unicode at unicode.org Fri Jan 19 03:16:25 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Fri, 19 Jan 2018 14:46:25 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: Wow. Somebody really needs to convey this to the Kazhaks. Else a short-sighted decision would ruin their chances at native IDNs. Any Kazhaks on this list? On 19-Jan-2018 00:23, "Asmus Freytag via Unicode" wrote: > Top level IDN domain names can not contain 02BC, nor 0027 or 2019. > > (RFC 6912 gives the rationale and RZ-LGR the implementation, see MSR-3 > ) > > A./ > > On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: > > > > On 18 Jan 2018, at 08:21, Andre Schappo via Unicode > wrote: > > > > On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > On Mon, 15 Jan 2018 20:16:21 -0800 > James Kass via Unicode wrote: > > It will probably be the ASCII apostrophe. The stated intent favors > the apostrophe over diacritics or special characters to ensure that > the language can be input to computers with standard keyboards. > > > Typing U+0027 into a word processor takes planning. Of the three, it > should obviously be the modifier letter U+02BC, but I think what gets > stored will be U+0027 or the single quotation mark U+2019. > > However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA > ABOVE RIGHT. > > Richard. > > > I have just tested twitter hashtags and as one would expect, U+02BC does > not break hashtags. See twitter.com/andreschappo/status/953903964722024448 > > > ...and, just in case twitter.com/andreschappo/status/953944089896083456 > > > Andr? Schappo > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 03:39:11 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Fri, 19 Jan 2018 09:39:11 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: On 19 January 2018 at 09:16, Shriramana Sharma via Unicode wrote: > Wow. Somebody really needs to convey this to the Kazhaks. Else a > short-sighted decision would ruin their chances at native IDNs. Any Kazhaks > on this list? There's only one Kazakh who counts, and I'm pretty sure he's not on this list. Andrew From unicode at unicode.org Fri Jan 19 07:19:53 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 19 Jan 2018 13:19:53 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: I?d go talk with him :-) I published Alice in Kazakh. He might like that. Michael > On 19 Jan 2018, at 09:39, Andrew West via Unicode wrote: > > On 19 January 2018 at 09:16, Shriramana Sharma via Unicode > wrote: >> Wow. Somebody really needs to convey this to the Kazhaks. Else a >> short-sighted decision would ruin their chances at native IDNs. Any Kazhaks >> on this list? > > There's only one Kazakh who counts, and I'm pretty sure he's not on this list. > > Andrew From unicode at unicode.org Fri Jan 19 07:35:23 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Fri, 19 Jan 2018 19:05:23 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: You can just mail him or Skype-call him no? ?? On 19-Jan-2018 18:53, "Michael Everson via Unicode" wrote: > I?d go talk with him :-) I published Alice in Kazakh. He might like that. > > Michael > > > On 19 Jan 2018, at 09:39, Andrew West via Unicode > wrote: > > > > On 19 January 2018 at 09:16, Shriramana Sharma via Unicode > > wrote: > >> Wow. Somebody really needs to convey this to the Kazhaks. Else a > >> short-sighted decision would ruin their chances at native IDNs. Any > Kazhaks > >> on this list? > > > > There's only one Kazakh who counts, and I'm pretty sure he's not on this > list. > > > > Andrew > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 07:37:40 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 19 Jan 2018 14:37:40 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: May be the IDN could accept a new combining diacritic (sort of right-side acute accent). After all the Kazakh intent is not to define a new separate character but a modification of base letter to create a single letter in their alphabet. So a proposal for COMBINING APOSTROPHE (whose spacing non-combining version is 02BC), so that SPACE+COMBINING APOSTROPHE will render exactly like 02BC. 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode : > Top level IDN domain names can not contain 02BC, nor 0027 or 2019. > > (RFC 6912 gives the rationale and RZ-LGR the implementation, see MSR-3 > ) > > A./ > > > On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: > > > > On 18 Jan 2018, at 08:21, Andre Schappo via Unicode > wrote: > > > > On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > On Mon, 15 Jan 2018 20:16:21 -0800 > James Kass via Unicode wrote: > > It will probably be the ASCII apostrophe. The stated intent favors > the apostrophe over diacritics or special characters to ensure that > the language can be input to computers with standard keyboards. > > > Typing U+0027 into a word processor takes planning. Of the three, it > should obviously be the modifier letter U+02BC, but I think what gets > stored will be U+0027 or the single quotation mark U+2019. > > However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA > ABOVE RIGHT. > > Richard. > > > I have just tested twitter hashtags and as one would expect, U+02BC does > not break hashtags. See twitter.com/andreschappo/status/953903964722024448 > > > ...and, just in case twitter.com/andreschappo/status/953944089896083456 > > > Andr? Schappo > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 07:42:52 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 19 Jan 2018 14:42:52 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: Hmmm.... that character exists already at 0+0315 (a combining comma above right). It would work for the new Kazah orthographic system, including for collation purpose. I don't think IDN rejects this combining version. 2018-01-19 14:37 GMT+01:00 Philippe Verdy : > May be the IDN could accept a new combining diacritic (sort of right-side > acute accent). After all the Kazakh intent is not to define a new separate > character but a modification of base letter to create a single letter in > their alphabet. > So a proposal for COMBINING APOSTROPHE (whose spacing non-combining > version is 02BC), so that SPACE+COMBINING APOSTROPHE will render exactly > like 02BC > > 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode > : > >> Top level IDN domain names can not contain 02BC, nor 0027 or 2019. >> >> (RFC 6912 gives the rationale and RZ-LGR the implementation, see MSR-3 >> ) >> >> A./ >> >> >> On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: >> >> >> >> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode >> wrote: >> >> >> >> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode < >> unicode at unicode.org> wrote: >> >> On Mon, 15 Jan 2018 20:16:21 -0800 >> James Kass via Unicode wrote: >> >> It will probably be the ASCII apostrophe. The stated intent favors >> the apostrophe over diacritics or special characters to ensure that >> the language can be input to computers with standard keyboards. >> >> >> Typing U+0027 into a word processor takes planning. Of the three, it >> should obviously be the modifier letter U+02BC, but I think what gets >> stored will be U+0027 or the single quotation mark U+2019. >> >> However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA >> ABOVE RIGHT. >> >> Richard. >> >> >> I have just tested twitter hashtags and as one would expect, U+02BC does >> not break hashtags. See twitter.com/andreschappo/s >> tatus/953903964722024448 >> >> >> ...and, just in case twitter.com/andreschappo/status/953944089896083456 >> >> >> Andr? Schappo >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 07:47:43 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 19 Jan 2018 13:47:43 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: <3BF8C43A-297D-4E3F-82E2-B585614B3788@evertype.com> There?s no redeeming this orthography. > On 19 Jan 2018, at 13:42, Philippe Verdy via Unicode wrote: > > Hmmm.... that character exists already at 0+0315 (a combining comma above right). It would work for the new Kazah orthographic system, including for collation purpose. I don't think IDN rejects this combining version. > > > 2018-01-19 14:37 GMT+01:00 Philippe Verdy : > May be the IDN could accept a new combining diacritic (sort of right-side acute accent). After all the Kazakh intent is not to define a new separate character but a modification of base letter to create a single letter in their alphabet. > So a proposal for COMBINING APOSTROPHE (whose spacing non-combining version is 02BC), so that SPACE+COMBINING APOSTROPHE will render exactly like 02BC > > 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode : > Top level IDN domain names can not contain 02BC, nor 0027 or 2019. > > (RFC 6912 gives the rationale and RZ-LGR the implementation, see MSR-3) > > A./ > > > On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: >> >> >>> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode wrote: >>> >>> >>> >>>> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode wrote: >>>> >>>> On Mon, 15 Jan 2018 20:16:21 -0800 >>>> James Kass via Unicode wrote: >>>> >>>>> It will probably be the ASCII apostrophe. The stated intent favors >>>>> the apostrophe over diacritics or special characters to ensure that >>>>> the language can be input to computers with standard keyboards. >>>> >>>> Typing U+0027 into a word processor takes planning. Of the three, it >>>> should obviously be the modifier letter U+02BC, but I think what gets >>>> stored will be U+0027 or the single quotation mark U+2019. >>>> >>>> However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA >>>> ABOVE RIGHT. >>>> >>>> Richard. >>> >>> I have just tested twitter hashtags and as one would expect, U+02BC does not break hashtags. See twitter.com/andreschappo/status/953903964722024448 >>> >> >> ...and, just in case twitter.com/andreschappo/status/953944089896083456 >> >> Andr? Schappo >> > > > From unicode at unicode.org Fri Jan 19 07:51:35 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 19 Jan 2018 14:51:35 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: Also U+0315 is not part of any decomposition for canonical normalization purpose, so it would remain encoded separately (only subject to possible reordering if there are other diacritics) 2018-01-19 14:37 GMT+01:00 Philippe Verdy : > May be the IDN could accept a new combining diacritic (sort of right-side > acute accent). After all the Kazakh intent is not to define a new separate > character but a modification of base letter to create a single letter in > their alphabet. > So a proposal for COMBINING APOSTROPHE (whose spacing non-combining > version is 02BC), so that SPACE+COMBINING APOSTROPHE will render exactly > like 02BC. > > 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode > : > >> Top level IDN domain names can not contain 02BC, nor 0027 or 2019. >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 07:51:43 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Fri, 19 Jan 2018 13:51:43 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: On 19 January 2018 at 13:19, Michael Everson via Unicode wrote: > > I?d go talk with him :-) I published Alice in Kazakh. He might like that. Damn, you'll have to reprint it with apostrophes now. Andrew From unicode at unicode.org Fri Jan 19 07:56:48 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Fri, 19 Jan 2018 14:56:48 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <3BF8C43A-297D-4E3F-82E2-B585614B3788@evertype.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <3BF8C43A-297D-4E3F-82E2-B585614B3788@evertype.com> Message-ID: 2018-01-19 14:47 GMT+01:00 Michael Everson via Unicode : > There?s no redeeming this orthography. This is not a redeeming, the Kazakh government currently has not made any assesment of how to encode their proposed system. Who said that was was proposed by them was an "apostrophe" ? May be they jsut wanted to use the ASCII apostrophe for compatibility with their legacy systems (but it's like the hack used in legacy ASCII-only system to represent [?] as [e'] : it's a workaround but this caused enough serious problems that we then all used the correct encoding of an acute accent, as a separate combining character or precombined with letters). And here we were suggesting several other characters. For me U+0315 is the best match for what they propose. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 08:16:05 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Fri, 19 Jan 2018 15:16:05 +0100 (CET) Subject: Emoji for major planets at least? In-Reply-To: <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> References: <292f1f76-d469-04cf-6cca-26325084e68a@ix.netcom.com> <9c059394-b594-82c1-1c24-a68bf890ed5f@ix.netcom.com> Message-ID: <1796514713.89209.1516371365598@ox.hosteurope.de> Asmus Freytag: > > Saturn, with its rings (even though it's no longer the only one known > with rings) also is iconic and highly recognizable. I lack imagination > as to when someone would want to use it in communication, but I have the > same issue with quite a few recent emoji, some of which are far less > iconic or recognizable. I think it does lend itself to describe a > "non-earth" type planet, or even the generic idea of a planet (as > opposed to a star/sun). For what it's worth, the Sky Web logo was a planet with a ring or orbit and it was included in the J-Phone, later Vodafone then SoftBank, emoji set at position F-75 (next to the paperplane for their Skywalker service). As a proprietary logo, it was not included in the final proposal emerging from the emoji4unicode project, but it was documented as e-E78, EMOJI COMPATIBILITY SYMBOL-58. The image was animated where possible. -------------- next part -------------- A non-text attachment was scrubbed... Name: F75.gif Type: image/gif Size: 296 bytes Desc: not available URL: From unicode at unicode.org Fri Jan 19 08:23:29 2018 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 19 Jan 2018 14:23:29 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: I won?t. > On 19 Jan 2018, at 13:51, Andrew West via Unicode wrote: > > On 19 January 2018 at 13:19, Michael Everson via Unicode > wrote: >> >> I?d go talk with him :-) I published Alice in Kazakh. He might like that. > > Damn, you'll have to reprint it with apostrophes now. > > Andrew > From unicode at unicode.org Fri Jan 19 10:42:07 2018 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Fri, 19 Jan 2018 08:42:07 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: <5A621FDF.5070503@unicode.org> Michael - Lemme know when you're ready to print. I have a huge bag of leftover apostrophes I can send you. On 1/19/2018 5:51 AM, Andrew West via Unicode wrote: > On 19 January 2018 at 13:19, Michael Everson via Unicode > wrote: >> I?d go talk with him :-) I published Alice in Kazakh. He might like that. > Damn, you'll have to reprint it with apostrophes now. > > Andrew > > From unicode at unicode.org Fri Jan 19 14:08:23 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Fri, 19 Jan 2018 12:08:23 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: <162106ac-1a6e-5770-3e03-b81d763382bd@ix.netcom.com> On 1/19/2018 5:37 AM, Philippe Verdy wrote: > May be the IDN could accept a new combining diacritic (sort of > right-side acute accent). After all the Kazakh intent is not to define > a new separate character but a modification of base letter to create a > single letter in their alphabet. > So a proposal for COMBINING APOSTROPHE (whose spacing non-combining > version is 02BC), so that SPACE+COMBINING APOSTROPHE will render > exactly like 02BC. > In the case of TLD IDNs what is at issue is the fact that it "renders exactly like" 02BC (which renders exactly like 2019). You can see the issue when you look at Andre's twitter tags: you can create two strings that look the same, but the part that is a hashtag is different. That is deemed an unacceptable security risk for TLD IDNs. If you encoded such a combining character, it would also not be eligible for TLD IDNs. A./ > 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode > >: > > Top level IDN domain names can not contain 02BC, nor 0027 or 2019. > > (RFC 6912 gives the rationale and RZ-LGR the implementation, see > MSR-3 ) > > A./ > > > On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: >> >> >>> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode >>> > wrote: >>> >>> >>> >>>> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode >>>> > wrote: >>>> >>>> On Mon, 15 Jan 2018 20:16:21 -0800 >>>> James Kass via Unicode >>> > wrote: >>>> >>>>> It will probably be the ASCII apostrophe. The stated intent favors >>>>> the apostrophe over diacritics or special characters to ensure >>>>> that >>>>> the language can be input to computers with standard keyboards. >>>> >>>> Typing U+0027 into a word processor takes planning.? Of the >>>> three, it >>>> should obviously be the modifier letter U+02BC, but I think >>>> what gets >>>> stored will be U+0027 or the single quotation mark U+2019. >>>> >>>> However, we shouldn't overlook the diacritic mark U+0315 >>>> COMBINING COMMA >>>> ABOVE RIGHT. >>>> >>>> Richard. >>> >>> I have just tested twitter hashtags and as one would expect, >>> U+02BC does not break hashtags. See >>> twitter.com/andreschappo/status/953903964722024448 >>> >>> >> >> ...and, just in case >> twitter.com/andreschappo/status/953944089896083456 >> >> >> >> Andr? Schappo >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 14:10:33 2018 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Fri, 19 Jan 2018 12:10:33 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> On 1/19/2018 5:42 AM, Philippe Verdy wrote: > Hmmm.... that character exists already at 0+0315 (a combining comma > above right). It would work for the new Kazah?orthographic system, > including for collation purpose.? I don't think IDN rejects this > combining version. This is also ineligible for the Root Zone. A./ > > > 2018-01-19 14:37 GMT+01:00 Philippe Verdy >: > > May be the IDN could accept a new combining diacritic (sort of > right-side acute accent). After all the Kazakh intent is not to > define a new separate character but a modification of base letter > to create a single letter in their alphabet. > So a proposal for COMBINING APOSTROPHE (whose spacing > non-combining version is 02BC), so that SPACE+COMBINING APOSTROPHE > will render exactly like 02BC > > 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode > >: > > Top level IDN domain names can not contain 02BC, nor 0027 or > 2019. > > (RFC 6912 gives the rationale and RZ-LGR the implementation, > see MSR-3 > ) > > A./ > > > On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: >> >> >>> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode >>> > wrote: >>> >>> >>> >>>> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode >>>> > wrote: >>>> >>>> On Mon, 15 Jan 2018 20:16:21 -0800 >>>> James Kass via Unicode >>> > wrote: >>>> >>>>> It will probably be the ASCII apostrophe.? The stated >>>>> intent favors >>>>> the apostrophe over diacritics or special characters to >>>>> ensure that >>>>> the language can be input to computers with standard >>>>> keyboards. >>>> >>>> Typing U+0027 into a word processor takes planning.? Of the >>>> three, it >>>> should obviously be the modifier letter U+02BC, but I think >>>> what gets >>>> stored will be U+0027 or the single quotation mark U+2019. >>>> >>>> However, we shouldn't overlook the diacritic mark U+0315 >>>> COMBINING COMMA >>>> ABOVE RIGHT. >>>> >>>> Richard. >>> >>> I have just tested twitter hashtags and as one would expect, >>> U+02BC does not break hashtags. See >>> twitter.com/andreschappo/status/953903964722024448 >>> >>> >> >> ...and, just in case >> twitter.com/andreschappo/status/953944089896083456 >> >> >> >> Andr? Schappo >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 18:41:54 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 20 Jan 2018 01:41:54 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> Message-ID: For the root zone may be, but not formally rejected by IDN, and the Kazakh zone could accept it without problem. It also has the advantage of allowing cleaner collation and contextual text extraction, and it also allows better placement of the combining character with its base in some dedicated pairs that will be suitable for Kazakh. But technically it would have been preferable to use the acute accent. Note that the acute accent is also looking much as apostrophes in Greek with capitals, and still it is not rejected ! (of course in IDN, case does not matter and there's a visual distinction in lowercase: but what would prohibit Kazakh pairs to have similar dictinctions at least for lowercase letters?). I don't understand the rationale: a combining accent (acute above, comma above, or comma above right) will always be better than the proposed reuse of punctuation apostrophes, and its intended semantic for Kazakh is certainly not a punctuation or elision mark, which will also occur in borrowed foreign names really using elision apostrophes or trigrams, and also not a separate modifier letter). Sometimes I think we should also discuss about the Breton trigram "c'h" which is not well represented with the legacy apostrophe, and a combining acute above or comma above or above right would be better : Breton keyboards or input methods can be improved to select the correct character to encode). 2018-01-19 21:10 GMT+01:00 Asmus Freytag (c) : > On 1/19/2018 5:42 AM, Philippe Verdy wrote: > > Hmmm.... that character exists already at 0+0315 (a combining comma above > right). It would work for the new Kazah orthographic system, including for > collation purpose. I don't think IDN rejects this combining version. > > > This is also ineligible for the Root Zone. > A./ > > > > 2018-01-19 14:37 GMT+01:00 Philippe Verdy : > >> May be the IDN could accept a new combining diacritic (sort of right-side >> acute accent). After all the Kazakh intent is not to define a new separate >> character but a modification of base letter to create a single letter in >> their alphabet. >> So a proposal for COMBINING APOSTROPHE (whose spacing non-combining >> version is 02BC), so that SPACE+COMBINING APOSTROPHE will render exactly >> like 02BC >> >> 2018-01-18 19:51 GMT+01:00 Asmus Freytag via Unicode > >: >> >>> Top level IDN domain names can not contain 02BC, nor 0027 or 2019. >>> >>> (RFC 6912 gives the rationale and RZ-LGR the implementation, see MSR-3 >>> ) >>> >>> A./ >>> >>> >>> On 1/18/2018 3:00 AM, Andre Schappo via Unicode wrote: >>> >>> >>> >>> On 18 Jan 2018, at 08:21, Andre Schappo via Unicode >>> wrote: >>> >>> >>> >>> On 16 Jan 2018, at 08:00, Richard Wordingham via Unicode < >>> unicode at unicode.org> wrote: >>> >>> On Mon, 15 Jan 2018 20:16:21 -0800 >>> James Kass via Unicode wrote: >>> >>> It will probably be the ASCII apostrophe. The stated intent favors >>> the apostrophe over diacritics or special characters to ensure that >>> the language can be input to computers with standard keyboards. >>> >>> >>> Typing U+0027 into a word processor takes planning. Of the three, it >>> should obviously be the modifier letter U+02BC, but I think what gets >>> stored will be U+0027 or the single quotation mark U+2019. >>> >>> However, we shouldn't overlook the diacritic mark U+0315 COMBINING COMMA >>> ABOVE RIGHT. >>> >>> Richard. >>> >>> >>> I have just tested twitter hashtags and as one would expect, U+02BC does >>> not break hashtags. See twitter.com/andreschappo/s >>> tatus/953903964722024448 >>> >>> >>> ...and, just in case twitter.com/andreschappo/status/953944089896083456 >>> >>> >>> Andr? Schappo >>> >>> >>> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 19 20:59:53 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 19 Jan 2018 18:59:53 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> Message-ID: Philippe Verdy wrote, > I don't understand the rationale: ... Maybe there isn't any. As Shriramana Sharma wrote earlier, >> Anyhow, it certainly can be difficult convincing >> non technical political people. And that's an understatement. This article... https://boingboing.net/2018/01/17/the-war-over-apostrophes-in-ka.html ... may offer additional insight. From unicode at unicode.org Fri Jan 19 22:24:04 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Sat, 20 Jan 2018 09:54:04 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> Message-ID: Announcing: Much ado about apostrophes A Play By William Codesphere Coming soon to a theatre near you... ?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 20 00:45:18 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 19 Jan 2018 22:45:18 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <33fa628d-7f2d-dfc0-a84f-d9b61afd4ffc@ix.netcom.com> Message-ID: "Much ado about apostrophes" If the apostrophe thing doesn't work out, we might also look forward to "The Shaming of the Crew", a play in which the advisory panel gets blamed for not pointing out what they were pointing out all along. From unicode at unicode.org Sat Jan 20 14:04:49 2018 From: unicode at unicode.org (Simon Montagu via Unicode) Date: Sat, 20 Jan 2018 22:04:49 +0200 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> Message-ID: <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> On 19/01/18 15:37, Philippe Verdy via Unicode wrote: > May be the IDN could accept a new combining diacritic (sort of > right-side acute accent). After all the Kazakh intent is not to define a > new separate character but a modification of base letter to create a > single letter in their alphabet. Hardly. If they insist on using a modifier character available on "standard" keyboards instead of already-encoded letters and/or diacritics, they are unlikely to be interested in new characters. From unicode at unicode.org Sun Jan 21 06:49:46 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 21 Jan 2018 13:49:46 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> Message-ID: But there's NO standard keyboard in Kazakhstan with the Latin alphabet. Those you'll find are cyrillic keyboards with a way to type basic Latin. Or keyboards made for other countries. So this is not a good reason at all. In fact Kazakstan would have to create a keyboard standard for the Latin orthography, and there's no reason to not place diacritics or precombined common letters in their new alphabet. For now it looks they just decided to do nothing at all, meaning that people will use various foreign Latin layouts, and to be cost effective, they'll look at those keyboards already used in nearby countries using Latin orthographies (Romania, Moldavia, Poland, Turkey). I can understand they don't want the technical trick and difficulties of dotted vs undotted I used in Turkish, but a single diacritic would have solved it (and it is part of their solution which uses an apostrophe but could as well have been an acute), a diacritic which is present in precombined letters on many European keyboards). In my opinion they should still have based a Kazakh keyboard that uses the same location as existing Cyrillic letters, it would have sommethed the transition. Now typing separate apostrophes in Latin Kazakh will just slow down the input and strain a single finger too frequently to the same small key on the top keys row... People won't like it at all... they'll stick on using the Cyrillic keyboards, and only an IME will convert what they type to transliterate it to Latin: if ther's an IME, the fact it will be using apostrophes or diacritics will be equivalent (but corrections of text in Kazakh will be less problematic for what is really perceived as a single letter but now being two separate characters (press ony one key, generate two base characters, but need to press backspace twice for correction, and added complication when selecting text because now you have separate grapheme clusters and the apostrophe can play different roles as a modifier where it should form a cluster, or as an elision mark where it is a placeholder for separate letters, or as a separate punctuation sign for quotation...) 2018-01-20 21:04 GMT+01:00 Simon Montagu via Unicode : > On 19/01/18 15:37, Philippe Verdy via Unicode wrote: > > May be the IDN could accept a new combining diacritic (sort of > > right-side acute accent). After all the Kazakh intent is not to define a > > new separate character but a modification of base letter to create a > > single letter in their alphabet. > > Hardly. If they insist on using a modifier character available on > "standard" keyboards instead of already-encoded letters and/or > diacritics, they are unlikely to be interested in new characters. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 21 00:14:15 2018 From: unicode at unicode.org (David Melik via Unicode) Date: Sat, 20 Jan 2018 22:14:15 -0800 Subject: superscripts & subscripts for science/mathematics? Message-ID: <95c3220d-85e2-96f9-0fde-e2f553bb4c22@gmail.com> I don't know if this was discussed, but it'd help scientists/mathematicians if all Greek and Hebrew were available as superscript & subscript.? Mathematicians use certain such letters in standard notation of important expressions/formulae (superscript ? in Euler's Identity, subscript base ?, superscript ? in cardinality of real numbers, etc.)... actually we use all Greek letters, and since a few Hebrew (since 1800s) have standard mathematical meanings, more are used for variables.? After any such alphabets' letters are used, the rest are considered normal/standard to use in standard script, superscript, and subscript, for any educational usage, and future standard notation. From unicode at unicode.org Sun Jan 21 00:15:35 2018 From: unicode at unicode.org (David Melik via Unicode) Date: Sat, 20 Jan 2018 22:15:35 -0800 Subject: superscripts & subscripts for science/mathematics? Message-ID: <9e90ac88-6a49-f00d-7f71-216a4fce023b@gmail.com> I don't know if this was discussed, but it'd help scientists/mathematicians if all Greek and Hebrew were available as superscript & subscript.? Mathematicians use certain such letters in standard notation of important expressions/formulae (superscript ? in Euler's Identity, subscript base ?, superscript ? in cardinality of real numbers, etc.)... actually we use all Greek letters, and since a few Hebrew (since 1800s) have standard mathematical meanings, more are used for variables.? After any such alphabets' letters are used, the rest are considered normal/standard to use in standard script, superscript, and subscript, for any educational usage, and future standard notation. From unicode at unicode.org Sun Jan 21 12:49:45 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 21 Jan 2018 18:49:45 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> Message-ID: <20180121184945.2659a1ab@JRWUBU2> On Sun, 21 Jan 2018 13:49:46 +0100 Philippe Verdy via Unicode wrote: > But there's NO standard keyboard in Kazakhstan with the Latin > alphabet. Those you'll find are cyrillic keyboards with a way to type > basic Latin. Or keyboards made for other countries. I believe we're talking about physical keyboards here. From the Wikipedia web page https://kk.wikipedia.org/wiki/%D0%9F%D0%B5%D1%80%D0%BD%D0%B5%D1%82%D0%B0%D2%9B%D1%82%D0%B0 and the only credible pictures I can find - https://sabaqtar.kz/informatika/8876-pernetata-pernetatamen-tanysu.html (tolerable) and https://kaz.tengrinews.kz/gadgets/kazaksha-klaviatura-100-mektepte-syinaktan-ott-255562/ (poor) - I beg to differ. It seems that the available keyboards are labelled in Kazakh Cyrillic and US QWERTY. There is a different layout tagged as 'Kazakh national layout' at http://aitaber.kz/blog/komputer/3991.html - and again the keys are labelled for both writing systems. On-screen keyboards should not be an issue at all. So, what devices are you talking about? Richard. From unicode at unicode.org Sun Jan 21 21:35:16 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Mon, 22 Jan 2018 11:35:16 +0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180121184945.2659a1ab@JRWUBU2> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: It's probably still too difficult to input a character with umlaut for general people in 2018, like the official Chinese romanization system used the character "?", but because it's so hard to be input or process many people in many occasion just use "v" instead and more recently standarised "yu" as a replacement for the character. There are language-dependent keyboards for French or German with special keys or deadkeys that help input these umlauts, but they are language dependent and it is not possible for e.g. a regular American user using Windows to simply type them out, at least not without prior knowledge about these umlauts. 2018-01-22 2:49 GMT+08:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Sun, 21 Jan 2018 13:49:46 +0100 > Philippe Verdy via Unicode wrote: > > > But there's NO standard keyboard in Kazakhstan with the Latin > > alphabet. Those you'll find are cyrillic keyboards with a way to type > > basic Latin. Or keyboards made for other countries. > > I believe we're talking about physical keyboards here. From the > Wikipedia web page > https://kk.wikipedia.org/wiki/%D0%9F%D0%B5%D1%80%D0%BD%D0%B5 > %D1%82%D0%B0%D2%9B%D1%82%D0%B0 > and the only credible pictures I can find - > https://sabaqtar.kz/informatika/8876-pernetata-pernetatamen-tanysu.html > (tolerable) and > https://kaz.tengrinews.kz/gadgets/kazaksha-klaviatura-100- > mektepte-syinaktan-ott-255562/ > (poor) > - I beg to differ. It seems that the available keyboards are labelled > in Kazakh Cyrillic and US QWERTY. > > There is a different layout tagged as 'Kazakh national layout' at > http://aitaber.kz/blog/komputer/3991.html - and again the keys are > labelled for both writing systems. > > On-screen keyboards should not be an issue at all. > > So, what devices are you talking about? > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 22 00:34:12 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 21 Jan 2018 22:34:12 -0800 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20171211101631.44155a27@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171211101631.44155a27@JRWUBU2> Message-ID: I was looking the feedback in http://www.unicode.org/review/pri355/, and didn't see yours there. Could you please file your feedback there? (Nothing on this list is tracked by the committee...) FYI, I'm thinking now that the change should be: GB9c: (Virama | ZWJ ) ? LinkingConsonant => GB9c: (Virama ViramaExtend* | ZWJ ) ? LinkingConsonant where ViramaExtend = [Extend - Virama - \p{ccc=0}] (This is pre-partitioning.) That is close to your formulation, but for for canonical equivalence, there shouldn't need to allow the ViramaExtend after ZWJ, because the ZWJ has ccc=0, and thus nothing reorders around it. Cibu also pointed out on a different thread that for Malayalam we need to consider a couple of other forms: ... Following contexts should be allowed for requesting reformed or traditional conjuncts as per Unicode10.0.0/ch12 page 505. ... /$L ZWNJ $V $L/ /$L ZWJ $V $L/ The ZWJ Virama sequence is already provided for by the combination of GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would mean the addition of something like: GB9d: ? (ZWNJ ViramaExtend* Virama) Cibu also wrote: Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences involving legacy chillus. That is, for example, is a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36). It's legacy equivalent would be . It might be OK to disallow this; but, we should be mindful of this side effect. ?To account for the legacy cases, the simplest approach might be to add some characters to GCB= LinkingConsonant ? Note: ?The final date for deciding exactly what to do with #29 will be in April, so there is some more time to discuss this. But we have to have a pretty solid proposal going into that April meeting. ? The only test files that we have gotten from India so far include Devanagari, Malayalam and Bengali. I suspect that the UTC is likely to be conservative, and limit the GCB=Virama category to just those scripts that we have test files for ?, and that look complete.? ? Mark On Mon, Dec 11, 2017 at 2:16 AM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sun, 10 Dec 2017 21:14:18 -0800 > Manish Goregaokar via Unicode wrote: > > > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > > > You can also explicitly request ligatureification with a ZWJ, so > > perhaps this rule should be something like > > > > (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant > > > > -Manish > > > > On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ?? via Unicode < > > unicode at unicode.org> wrote: > > > > > 1. You make a good point about the GB9c. It should probably instead > > > be something like: > > > > > > GB9c: (Virama | ZWJ ) ? Extend* LinkingConsonant > > This change is unnecessary. If we start from Draft 1 where there are: > > GB9: ? (Extend | ZWJ | Virama) > GB9c: (Virama | ZWJ ) ? LinkingConsonant > > If the classes used in the rules are to be disjoint, we then have to > split Extend into something like ViramaExtend and OtherExtend to allow > normalised (NFC/NFD) text, at which point we may as well continue to > have rules that work without any normalisation. Informally, > > ViramaExtend = Extend and ccc ? 0. > > OtherExtend = Extend and ccc = 0. > > (We might need to put additional characters in ViramaExtend.) > > This gives us rules: > > GB9': ? (OtherExtend | ViramaExtend | ZWJ | Virama) > > GB9c': (Virama | ZWJ ) ViramaExtend* ? LinkingConsonant > > So, for a sequence , GB9' gives us > > virama ? ZWJ ? nukta LinkingConsonant > > and GB9c' gives us > > virama ? ZWJ ? nukta ? LinkingConsonant > > --- > In Rule GB9c, what examples justify including ZWJ? Are they just the C1 > half-forms? My knowledge suggests that > > GB9c'': Virama (ZWJ | ViramaExtend)* ? LinkingConsonant > > might be more appropriate. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 22 10:28:58 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 22 Jan 2018 09:28:58 -0700 Subject: SignWriting in U+40000 block Message-ID: <20180122092858.665a7a7059d7ee80bb4d670165c8327d.9b69fbafa0.wbe@email03.godaddy.com> The IETF is noting the progress of an updated draft: Formal SignWriting draft-slevinski-formal-signwriting-04 https://tools.ietf.org/html/draft-slevinski-formal-signwriting-04.html which continues to describe an implementation of SignWriting in the as-yet unassigned Plane 4, including a detailed breakdown of blocks for different types of characters. I know the struggle between Slevinski and Unicode is long and contentious, with Slevinski arguing for years that the Unicode encoding of SignWriting is useless because it doesn't encode position, and vowing that no implementation (under his aegis) will ever use it). Nevertheless, I wonder if it would be appropriate for Unicode or WG2, in some capacity, to protest in some formal way against this recommendation to arrogate an unassigned plane instead of using the PUA, which is the correct place for unassigned characters. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jan 22 10:39:57 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Mon, 22 Jan 2018 16:39:57 +0000 Subject: Internationalised Computer Science Exercises Message-ID: I continue my endeavours to get Unicode and Internationalisation into/onto (I am not sure which is correct) University and School Curricula. Here is another of my endeavours?? Yesterday, I drafted a final year student project specification for the 2018/2019 academic year. These projects will start in October but students will be choosing their project some time around June. The project involves producing a set of internationalised Computer Science exercises for both educators and students. Details at schappo.blogspot.co.uk/2018/01/computer-science-internationalization_21.html I am confident that more than one student will choose this project. By way of example, one programming challenge I set to students a couple of weeks ago involves diacritics. Please see jsfiddle.net/coas/wda45gLp There is huge potential for some really interesting and challenging Unicode exercises. If you have any suggestions for such exercises they would be most welcome. Email me direct or share on this list. TIA Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 22 11:55:16 2018 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Mon, 22 Jan 2018 18:55:16 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: Message-ID: <3f9e887d-fcfa-f99a-f3ec-a92ab342fd30@gmail.com> Le 22/01/2018 ? 17:39, Andre Schappo via Unicode a ?crit?: > > By way of example, one programming challenge I set to students a > couple of weeks ago involves diacritics. Please see > jsfiddle.net/coas/wda45gLp > > There is huge potential for some really interesting and challenging > Unicode exercises. If you have any suggestions for such exercises they > would be most welcome. Email me direct or share on this list. A simple challenge is to write a function which localize numbers in a script having decimal digits or parse them (i.e. which have characters with property Numeric_Type=Decimal, as explained in ?4.6 of the Unicode 10 standard). The list of these scripts is specified in table 22-3. There is usually a most one set of digits/script (with the exception of Arabic, Myanmar and Tai Tham). Then, of course, one can look at other numeral systems (CJK, Ethiopic, Roman, to name a few in contemporaneous use). The section 22.3 of the Unicode standard is an interesting starting point for these. A internationalised exercise which doesn?t (always) use unicode is the localization of separators in numbers: 2??+? = 1,027.14 in US and 1 027,14 in France. One also should not forget that half a million is 5,00,000 in India. These simple things can be very surprising the first time you meet them. ? Fr?d?ric From unicode at unicode.org Mon Jan 22 16:08:55 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 Jan 2018 22:08:55 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: Message-ID: <20180122220855.7b929272@JRWUBU2> On Mon, 22 Jan 2018 16:39:57 +0000 Andre Schappo via Unicode wrote: > By way of example, one programming challenge I set to students a > couple of weeks ago involves diacritics. Please see > jsfiddle.net/coas/wda45gLp Did any of them come up with the idea of using traces instead of strings? Richard. From unicode at unicode.org Mon Jan 22 17:02:42 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 Jan 2018 23:02:42 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: <3f9e887d-fcfa-f99a-f3ec-a92ab342fd30@gmail.com> References: <3f9e887d-fcfa-f99a-f3ec-a92ab342fd30@gmail.com> Message-ID: <20180122230242.1ddd9954@JRWUBU2> On Mon, 22 Jan 2018 18:55:16 +0100 Fr?d?ric Grosshans via Unicode wrote: > A simple challenge is to write a function which localize numbers in a > script having decimal digits or parse them (i.e. which have > characters with property Numeric_Type=Decimal, as explained in ?4.6 > of the Unicode 10 standard). The list of these scripts is specified > in table 22-3. There is usually a most one set of digits/script (with > the exception of Arabic, Myanmar and Tai Tham). Presumably you specify the task by defining the digit for zero. Would you expect them to successfully parse '10?2' (with diagonal in middle digit) as opposed to '102'? Do you expect them to get the New Tai Lue form for the number '1' correct - it's U+19DA rather than U+19D1! Richard. From unicode at unicode.org Mon Jan 22 17:26:38 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 22 Jan 2018 23:26:38 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: <20180122232638.66ef51b7@JRWUBU2> On Mon, 22 Jan 2018 11:35:16 +0800 Phake Nick via Unicode wrote: > There > are language-dependent keyboards for French or German with special > keys or deadkeys that help input these umlauts, but they are language > dependent and it is not possible for e.g. a regular American user > using Windows to simply type them out, at least not without prior > knowledge about these umlauts. I found the Windows 'US International' keyboard layout highly intuitive for accented Latin-1 characters. Richard. From unicode at unicode.org Mon Jan 22 18:55:45 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 22 Jan 2018 16:55:45 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: Phake Nick wrote, > ... and it is not possible for e.g. a regular American > user using Windows to simply type them out, at least not > without prior knowledge about these umlauts. Regular American users simply don't type umlauts, period. Eccentric American users needing umlauts, such as foreign language students or heavy metal enthusiasts, generally find an easy way. Practically everybody knows how to search the web. Earlier in this thread, Shriramana Sharma wrote, > Rejecting the digraph method (which is probably the > simplest) doesn't have much meaning because they have > different sounds in different languages all the time > like ch in English and German. Any Kazakh/Qazaq student ambitious enough to study a foreign language such as English is already sophisticated enough to easily distinguish differing digraph values between the two languages. English speakers face distinctions such as the difference between the "ch" in "chigger" versus "chiffon" daily without any apparent danger of confusion. With so much push-back, along with technical objections, hopefully the government will reconsider the apostrophe situation and go with digraphs or diacritics. From unicode at unicode.org Mon Jan 22 19:52:05 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 23 Jan 2018 10:52:05 +0900 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> On 2018/01/23 09:55, James Kass via Unicode wrote: > Any Kazakh/Qazaq student ambitious enough to study a foreign language > such as English is already sophisticated enough to easily distinguish > differing digraph values between the two languages. English speakers > face distinctions such as the difference between the "ch" in "chigger" > versus "chiffon" daily without any apparent danger of confusion. Well, there are many many easier orthographies than English, so I'd understand if the Kazakh don't want to take English as an example. > With > so much push-back, along with technical objections, hopefully the > government will reconsider the apostrophe situation and go with > digraphs or diacritics. I very much hope so too. One way to avoid confusion is to use one specific letter only as the second letter in digraphs. With the current orthography, they don't use w and x, so they could use one of these. But personally, I'd find accents more visually pleasing. Regards, Martin. From unicode at unicode.org Mon Jan 22 20:34:29 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 Jan 2018 02:34:29 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171211101631.44155a27@JRWUBU2> Message-ID: <20180123023429.24130691@JRWUBU2> On Sun, 21 Jan 2018 22:34:12 -0800 Mark Davis ?? via Unicode wrote: > FYI, I'm thinking now that the change should be: > > GB9c: (Virama | ZWJ ) ? LinkingConsonant > => > GB9c: (Virama ViramaExtend* | ZWJ ) ? LinkingConsonant > > where ViramaExtend = [Extend - Virama - \p{ccc=0}] > (This is pre-partitioning.) > > That is close to your formulation, but for for canonical equivalence, > there shouldn't need to allow the ViramaExtend after ZWJ, because the > ZWJ has ccc=0, and thus nothing reorders around it. These look fine. > Cibu also pointed out on a different thread that for Malayalam we > need to consider a couple of other forms: > > ... Following contexts should be allowed for requesting reformed or > traditional conjuncts as per Unicode10.0.0/ch12 page 505. ... > > /$L ZWNJ $V $L/ > /$L ZWJ $V $L/ > > The ZWJ Virama sequence is already provided for by the combination of > GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would > mean the addition of something like: > > GB9d: ? (ZWNJ ViramaExtend* Virama) This is OK by me for aksharas. It might make sense for Tai Tham as well, where various degrees of binding are attested in what you can think of as D.DH (as in 'buddha'). If the font formally ligates them but does not always ligate subscript 'DHA' (i.e. U+1A35 TAI THAM LETTER LOW THA), would provide the unligated form. Note than in Tai Tham, SAKOT primarily affects the C2 consonant. > > Cibu also wrote: > > > Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences > involving legacy chillus. That is, for example, E> is a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36). > E> It's legacy > equivalent would be . It might be OK to > disallow this; but, we should be mindful of this side effect. I see no problem here. By GB9, we get NA ? VIRAMA ? ZWJ SIGN_E By GB9a, we then get NA ? VIRAMA ? ZWJ ? SIGN_E Have I missed something? Do you want me to try to formally submit my comments from this post? I will be going to bed as soon as I've finished extract comments from this thread. Richard. From unicode at unicode.org Mon Jan 22 20:41:00 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 Jan 2018 02:41:00 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171211101631.44155a27@JRWUBU2> Message-ID: <20180123024100.4c38773a@JRWUBU2> On Sun, 21 Jan 2018 22:34:12 -0800 Mark Davis ?? via Unicode wrote: > The ZWJ Virama sequence is already provided for by the combination of > GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would > mean the addition of something like: > > GB9d: ? (ZWNJ ViramaExtend* Virama) I don't think we need ViramaExtend* here. The seqeunce should be followed by a base consonant, so there's no way for another mark to sneak in. Incidentally, I think ViramaExtend would be better named as NSExtend, with 'NS' for 'non-starter'. Richard. From unicode at unicode.org Mon Jan 22 21:06:04 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 Jan 2018 03:06:04 +0000 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: References: <20171208220619.3eb2fcbe@JRWUBU2> <20171211101631.44155a27@JRWUBU2> Message-ID: <20180123030604.49dc6915@JRWUBU2> On Sun, 21 Jan 2018 22:34:12 -0800 Mark Davis ?? via Unicode wrote: > I was looking the feedback in http://www.unicode.org/review/pri355/, > and didn't see yours there. Could you please file your feedback > there? (Nothing on this list is tracked by the committee...) This is the submission I have just made: The major principled issue I have is that UAX#29 can no longer claim to have a sound definition of the concept of a 'user-perceived character'. Perhaps it never did. Some of the claims would be better if there were evidence to back them up. For example, this evening I did a quick bit of research and asked the Korean owner of the local Korean restaurant how many letters there were in the hangul spelling of 'Gangnam'. She traced out the spelling of the word (??) and came back with the answer '6'. UAX#29 claims it has 2 user-perceived characters. You might also argue that she has spent too long in England to be a useful informant. The following old paragraph causes grief for me: "As far as a user is concerned, the underlying representation of text is not important, but it is important that an editing interface present a uniform implementation of what the user thinks of as characters. Grapheme clusters commonly behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. For example, when a grapheme cluster is represented internally by a character sequence consisting of base character + accents, then using the right arrow key would skip from the start of the base character to the end of the last accent." The problem is that many editors read this as saying that the arrow keys should move by whole characters. The result of this is that in many applications, to replace the first character of a grapheme cluster one must retype the entire grapheme cluster. With a grapheme cluster of three characters, as is common in Thai and Korean, this is irritating. With a grapheme cluster of four or five characters, as is common in Northern Thai, it is annoying. The prospect of the grapheme cluster being extended to include a whole akshara fills me with dismay. Consider the Northern Thai word ??????? /m??/ 'scrumptious'. At present, this 7 character word is split into three grapheme clusters, of lengths 2, 4 and 1. However, it is clearly a single akshara. To change the first character, I would have to also retype the other 6 characters. My first thought that changing software that way would breach the UK's Equality Act 2010, by further restricting the ability of Northern Thai users to do character by character editing. (My wife's protected characteristic extends to me for the purposes of the Act.) However, there may be a get-out in the form of Schedule 3 Section 30 (https://www.legislation.gov.uk/ukpga/2010/15/schedule/3/paragraph/30). The supplier of the service can claim that they only supply a character by character editing facility to the ethnic groups using simple scripts, and that they are under no obligation to supply the service to members of other ethnic groups. - "If a service is generally provided only for persons who share a protected characteristic, a person (A) who normally provides the service for persons who share that characteristic does not contravene section 29(1) or (2)? (a)by insisting on providing the service in the way A normally provides it, or (b)if A reasonably thinks it is impracticable to provide the service to persons who do not share that characteristic, by refusing to provide the service." But what an embarrassing defence to offer! However, there is another reason for rejecting the extension of grapheme clusters to whole aksharas. Currently, U+1A63 TAI THAM VOWEL SIGN AA starts a grapheme cluster. However, for non-defective text, it is part of the same akshara as the preceding grapheme cluster. Now, the decision to make U+1A63 start a new grapheme cluster is intrinsically reasonable. It can have its own stack with a subscript consonant and even a vowel, and it is not difficult to find manuscripts showing a line break before it, e.g. L2/07-007 Figure 9b Leaf 2 lines 2/3, ????????-?????. I believe that the akshara should be a level of text above the grapheme cluster. Ideally, it would be below the level of a word, but of course in Sanskrit, word boundaries readily occur within present day grapheme clusters. (I made this recommendation in L2/17-122.) Further comments apply to the definition of akshara boundaries, regardless of whether they are to coincide with the boundaries of grapheme clusters. These rules do not work well where virama may fall back to visible virama. This is particularly the case with Tamil, where conjuncts are restricted to K.SSA and SH.RII. Johny Cibu provided an example where the title ??????? is broken as [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed algorithm it would be: [ta-u, ka-virama-lla, ka-virama] http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg For native intuition, I would cite the Tamil letter-counting account at https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf. What the author counts is not spacing glyphs, but vowel letters and consonant characters, with two significant modifications. Firstly, K.SSA counts as just one consonant, and SH.R.II is also counted as containing a single consonant. In other words, the Tamil virama character works as a pure killer except in those two environments. This is also the story the TUNE protagonists tell us. It will be an inelegant rule for UAX#29, but, unfortunately, reality is messy. To quote Johny Cibu further: "Malayalam could be a similar story. In case of Malayalam, it can be font specific because of the existence of traditional and reformed writing styles. A conjunct might be a ligature in traditional; and it might get displayed with explicit virama in the reformed style. For example see the poster with word ??????? broken as [u, sa-virama, ta-aa, da-virama] - as it is written in the reformed style. As per the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. These breaks would be used by the traditional style of writing. https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg I believe there is a problem with the first two examples in Table 12-33. If one suffixed to the first two examples, yielding *??????? and *????????, one would have three Malayalam aksharas, not two extended grapheme clusters as the proposed rules would say. From unicode at unicode.org Mon Jan 22 21:34:38 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Mon, 22 Jan 2018 19:34:38 -0800 Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues In-Reply-To: <20180123024100.4c38773a@JRWUBU2> References: <20171208220619.3eb2fcbe@JRWUBU2> <20171211101631.44155a27@JRWUBU2> <20180123024100.4c38773a@JRWUBU2> Message-ID: Good point, thanks Mark On Mon, Jan 22, 2018 at 6:41 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sun, 21 Jan 2018 22:34:12 -0800 > Mark Davis ?? via Unicode wrote: > > > The ZWJ Virama sequence is already provided for by the combination of > > GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would > > mean the addition of something like: > > > > GB9d: ? (ZWNJ ViramaExtend* Virama) > > I don't think we need ViramaExtend* here. The seqeunce should be > followed by a base consonant, so there's no way for another mark to > sneak in. > > Incidentally, I think ViramaExtend would be better named as NSExtend, > with 'NS' for 'non-starter'. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 22 21:43:34 2018 From: unicode at unicode.org (David Melik via Unicode) Date: Mon, 22 Jan 2018 19:43:34 -0800 Subject: superscripts & subscripts for science/mathematics? Message-ID: On 01/21/2018 02:27 PM, Fr?d?ric Grosshans wrote: > Le 21/01/2018 ? 07:15, David Melik via Unicode a ?crit : >> I don't know if this was discussed, but it'd help >>scientists/mathematicians if all Greek and Hebrew were available as >>superscript & subscript. Mathematicians use certain such letters in >>standard notation of important expressions/formulae (superscript ? in >>Euler's Identity, subscript base ?, superscript ? in cardinality of >>real numbers, etc.)... actually we use all Greek letters, and since a >>few Hebrew (since 1800s) have standard mathematical meanings, more are >>used for variables. After any such alphabets' letters are used, the >>rest are considered normal/standard to use in standard script, >>superscript, and subscript, for any educational usage, and future >>standard notation. > >> Mathematics superscript and substript are supposed to be rich text, >not plain text. Furthermore, ?completing the set of mathematical >superscripts? is an impossible task, since one would need double >superscripts for e^(-x?) and even more exotic combinations for stuff as >common as e^x? On 01/22/2018 01:20 PM, Murray Sargent wrote: > Subscripts and superscripts are more complicated in mathematics than >in ordinary text in that they can be nested and can include arbitrary >operators, e.g., a superscripted superscript as in e^(-x^2). >Accordingly, encoding more Unicode subscripts and superscripts for >mathematics isn't general enough to be worthwhile and it can complicate >math input methods. In plain text, one can use a linear format such as >LaTeX or UnicodeMath. Ideally these formats can drive math display >engines that display elegant mathematical typography with arbitrary >combinations of subscripts, superscripts and other mathematical >constructs. ?The intended use was to allow chemical and algebra formulas to be written without markup?--https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts. Unless wrong, apart from disagreement, it's clear mathematics word processing software is useful, but not a reason to not finish almost-complete set of basic superscripts & subscripts ((super|sub)scripts) for relevant alphabets used (English, Greek, perhaps Hebrew, latter two which were in my original post subject line, but I likely accidentally used link I received to delete pre-moderated post.) Before rich-text, people used plain-text centuries, still do, such as plain-text files that may be about simpler topics, or informal/notes, and Internet areas predating all websites, such as standard (such as this non-HTML) email, NNTP/Usenet (still hundreds of mathematics posts/day,) Internet Relay Chat (IRC, still dozens of science & mathematics rooms, one math one with around 1000 people, busy all the time) etc., but the latter at least has Unicode (none are rich-text.) This shows how much of English has superscripts, and which letter doesn't: ????????????????q?????????. However, there are simple mathematics situations people use any/every letter, lowercase & uppercase, superscript & subscript (not sure about ?overscript.?) It's up to each science/math fan, student, writer, instructor what type of text they want (not just what you say is ?supposed to be,? ?can complicate.?) I never said make Unicode like super-complicated stuff math formatting software... only a small percentage of where people write math, which of course, writing isn't just advanced books, but also simple & informal/notes, and plain-text isn't just in text-editors, but also graphics editors. If not clearer now, all I was requesting was adding/completing Greek (super|sub)scripts, though had forgotten not all English ones exist, so those too, and I was suggesting Hebrew (super|sub)scripts... never mentioned supersuperscripts & subsubscripts, etc., which one of you showed then argued against (doesn't refute what I actually said.) I'm just talking about completing relevant alphabets for usage described ?chemical and algebra formulas,? which as I took algebra before high school, wasn't seeing super-complicated stuff that may or not be in college/university algebra texts, or are in derived fields with some algebra-type formulae. I'm only talking about simple, one-level (super|sub)scripts for largest variety of simple formulae, not ?completing the set? (in relation to all math) nor (super|sub)(super|sub)scripts as in replies with mixed style. The biggest problem for me is Euler's Formula & Identity, which through high school math of analysis/calculus (and on through several years to applied & abstract analysis) are usually considered the most important & beautiful formula & identity in mathematics (the formula modelling basis of all current physics, and the identity having the most important numbers, symbols/operations in math.) It's easy to write his formula plain-text, as below. e??=cos x+i sin x Almost every day in my plain-text notes/to-do-list, I read these, and discuss most weeks in math discussion areas (as mentioned) and ?in real life,? so thanks for i,x superscripts. However, writing his identity has a problem: must say it definitively has ? but am replacing with English letter that came from ? and is equivalent, p, as below. e??+1=0 So, I can only write that in standard from with a word processor, TeX, MathML, or (technical jargon, even ambiguous to many computer programmers/scientists, CS) graphing calculator notiation, ?e^(i?)+1=0.? :( It's not just for mathematics research (what one of you were talking about,) but (in)formal use by math fans, students, instructors when they use plain-text (which I and many I know use.) For example, For years I couldn't write Euler's Identity into graphics programs such as the Free/Libre Software (FL/S) one called GIMP (which doesn't have (super|sub)scripts (don't know about proprietary Adobe Photoshop graphics,) though it finally worked with Inkscape FL/S. So, that's another problem... people using the most widely-used graphics FL/S need these in plain-text, otherwise may learn a trick to make (super|sub)scripts in GIMP by moving (not resizing) the text (personally, I spent days reading about standard (super|sub)script sizes and more hours making three text layers) so if 100 people each make an image about Euler for educational uses, or for a t-shirt, in GIMP, all 100 are going to have varying, non-standard text appearance... unless the proper (super|sub)scripts are added. Of course, I've been aware of graphing calculator notation you used above like e^x, x_1, but punctuation there mean different things in much of CS (many CS whom didn't use graphing calculators, forgot, or started in school with newer math software) so it'd be helpful to have the proper superscripts for .TXT, email (and NNTP/Usenet?,) IRC. That's all I'm saying... not additional (super|sub)(super|sub)script levels... just, if you have some (super|sub)scripts in an alphabet, have a *basic* set of all (or all considered important)... particularly ?, maybe ?, ... Also, the problem isn't just these classic formats (.TXT, NNTP, IRC) not all you necessarily use (many scientists do)... still, some web-forums are plain-text-only (or have text forum formatting code, some which has major (super|sub)script bugs... others allow in HTML, which seems to have disadvantage of increasing line height.) Many have been ?dying? while sites like Facebook & Twitter are growing... which also have... plain-text. So, try to write about or discuss easy-to-read science/mathematics (in short posts) where most people are on the Internet, and you still run into the problem of Unicode currently being inadequate. On those biggest sites, they can still post various emojis about poop and ideological cults... maybe that's going to help people discuss ideas how to advance science for a better world? Not as much, I think... Sincerely, David From unicode at unicode.org Mon Jan 22 22:31:57 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 22 Jan 2018 20:31:57 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote, > ... One way to avoid confusion is to use one specific > letter only as the second letter in digraphs. With the current orthography, > they don't use w and x, so they could use one of these. But personally, I'd > find accents more visually pleasing. Me too: (bottle, east, skier, crucial, cherry) s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e From unicode at unicode.org Tue Jan 23 00:45:29 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 22 Jan 2018 22:45:29 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: For me, having to go around justifying my whims would probably take some of the fun out of being an authoritarian ruler. Which suggests that the apostrophe decision can be revised with no explanation expected, even though a simple explanation exists. Changing from the apostrophe to the combining acute accent above is, after all, essentially turning the apostrophe at a slight angle and writing it above the letter it modifies. This would not represent a reversed decision, simply a change in the style in which the already selected modifier is written. > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e From unicode at unicode.org Tue Jan 23 05:51:17 2018 From: unicode at unicode.org (Khaled Hosny via Unicode) Date: Tue, 23 Jan 2018 13:51:17 +0200 Subject: superscripts & subscripts for science/mathematics? In-Reply-To: References: Message-ID: <20180123115117.GF1155@macbook.localdomain> On Mon, Jan 22, 2018 at 07:43:34PM -0800, David Melik via Unicode wrote: > ?The intended use was to allow chemical and algebra formulas to be written > without > markup?--https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts. > Unless wrong, apart from disagreement, it's clear mathematics word > processing software is useful, but not a reason to not finish > almost-complete set of basic superscripts & subscripts ((super|sub)scripts) > for relevant alphabets used (English, Greek, perhaps Hebrew, latter two > which were in my original post subject line, but I likely accidentally used > link I received to delete pre-moderated post.) Mathematics written in Arabic notation use Arabic-Indic numbers and Arabic letters and they can occur in superscripts and subscripts as well. Regards, Khaled From unicode at unicode.org Tue Jan 23 08:23:54 2018 From: unicode at unicode.org (philip chastney via Unicode) Date: Tue, 23 Jan 2018 14:23:54 +0000 (UTC) Subject: superscripts & subscripts for science/mathematics? References: <648389554.3309279.1516717434103.ref@mail.yahoo.com> Message-ID: <648389554.3309279.1516717434103@mail.yahoo.com> . . . and do Russians still do mathematics? I guess not, since there is no Cyrillic counterpart to the AMS extensions also, chemists sometimes like to put a superscript over a subscript will that still have to be done using rich text? or maybe we need another extension . . . ? /phil -------------------------------------------- On Tue, 23/1/18, Khaled Hosny via Unicode wrote: Subject: Re: superscripts & subscripts for science/mathematics? To: "David Melik" Cc: unicode at unicode.org Date: Tuesday, 23 January, 2018, 11:51 AM On Mon, Jan 22, 2018 at 07:43:34PM -0800, David Melik via Unicode wrote: > ?The intended use was to allow chemical and algebra formulas to be written > without > markup?--https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts. > Unless wrong, apart from? disagreement, it's clear mathematics word > processing software is useful, but not a reason to not finish > almost-complete set of basic superscripts & subscripts ((super|sub)scripts) > for relevant alphabets used (English, Greek, perhaps Hebrew, latter two > which were in my original post subject line, but I likely accidentally used > link I received to delete pre-moderated post.) Mathematics written in Arabic notation use Arabic-Indic numbers and Arabic letters and they can occur in superscripts and subscripts as well. Regards, Khaled -----Inline Attachment Follows----- From unicode at unicode.org Tue Jan 23 09:18:11 2018 From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode) Date: Tue, 23 Jan 2018 16:18:11 +0100 (CET) Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: <194592430.121188.1516720691578@ox.hosteurope.de> James Kass: > > (bottle, east, skier, crucial, cherry) > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e [Esperanto orthography] provides the option to either choose x-digraphs, h-digraphs or caron diacritics (i.e. circumflex on consonants and breve on vowels) and there are some alternative proposals, e.g. substituting '?' by unused 'w'. No naturally evolved orthography, as far as I know, mixes consonant and vowel letters in digraphs, but 'j' and 'w' or 'v' can be both and 'h' is a special case. Using 'x' after consonants and 'w' after vowels would therefore make some sense, although it still looks strange to people used to natural graphotactics. Readability may be improved if not all diacritics are put above the base letter. s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e sxiwsxa, sxygxys, sxanxgxysxy, sxesxuwsxi, sxiwiwe shijsha, shyghys, shanhghyshy, sheshuwshi, shijije ?i??a, ?y?ys, ?a??y?y, ?e?u??i, ?i?i?e ?i??a, ?y?ys, ?a??y?y, ?e?u??i, ?i?i?e [Esperanto orthography]: https://en.wikipedia.org/wiki/Esperanto_orthography From unicode at unicode.org Tue Jan 23 11:58:12 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 23 Jan 2018 18:58:12 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <194592430.121188.1516720691578@ox.hosteurope.de> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> <194592430.121188.1516720691578@ox.hosteurope.de> Message-ID: Ukainian should follow the romanisation model used by Serbian which is clear for them and coherent with other uses in Eastern Europe: carons for modified consonnants, and acute accents (sometimes double acute in Hungarian) for vowels. Even if they want support with a legacy 8-bit charset, ISO 3166-2 or windows codepage 1250 would work for them without much complication. But there it's like if they wanted to do like German ignoring umlauts completely and using digrams with "e" instead, and ignore ess-tsetts and use digrams "ss" everywhere. This looks like a big return backward to the old age of ASCII-only "typography" of the 1960's (when also French or Italian were sometimes represented with their acute accents replaced by ugly digrams with apostrophes...) 2018-01-23 16:18 GMT+01:00 Christoph P?per via Unicode : > James Kass: > > > > (bottle, east, skier, crucial, cherry) > > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > > sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe > > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > > [Esperanto orthography] provides the option to either choose x-digraphs, > h-digraphs or caron diacritics (i.e. circumflex on consonants and breve on > vowels) and there are some alternative proposals, e.g. substituting '?' by > unused 'w'. No naturally evolved orthography, as far as I know, mixes > consonant and vowel letters in digraphs, but 'j' and 'w' or 'v' can be both > and 'h' is a special case. Using 'x' after consonants and 'w' after vowels > would therefore make some sense, although it still looks strange to people > used to natural graphotactics. Readability may be improved if not all > diacritics are put above the base letter. > > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > sxiwsxa, sxygxys, sxanxgxysxy, sxesxuwsxi, sxiwiwe > shijsha, shyghys, shanhghyshy, sheshuwshi, shijije > ?i??a, ?y?ys, ?a??y?y, ?e?u??i, ?i?i?e > ?i??a, ?y?ys, ?a??y?y, ?e?u??i, ?i?i?e > > [Esperanto orthography]: https://en.wikipedia.org/wiki/ > Esperanto_orthography > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 23 12:51:42 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 23 Jan 2018 11:51:42 -0700 Subject: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F?= Message-ID: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> I think it's so cute that some of us think we can advise Nazarbayev on whether to use straight or curly apostrophes or accents or x's or whatever. Like he would listen to a bunch of Western technocrats. An explicitly stated goal of the new orthography was to enable typing Kazakh on a "standard keyboard," meaning an English-language one. Nazarbayev may ultimately be persuaded to embrace ASCII digraphs, which also meet this goal, but this talk about U+2019 and U+02BC will make exactly zero difference in Kazakh policy. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jan 23 13:22:37 2018 From: unicode at unicode.org (Phake Nick via Unicode) Date: Wed, 24 Jan 2018 03:22:37 +0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: >I found the Windows 'US International' keyboard layout highly intuitive >for accented Latin-1 characters. How common is the US International keyboard in real life..? Users would still need to manually add them in Windows, or in other computing tools vendors would need to add support for "US International" before they can be used > Regular American users simply don't type umlauts, period. Eccentric Which is exactly why they aren't using unlauts. > American users needing umlauts, such as foreign language students or > heavy metal enthusiasts, generally find an easy way. Practically > everybody knows how to search the web. How about, for example, a random tourist looking for info of random Kazakhstan city? Will they know how to type umlaut in a city's name? Most likely they'll simply type it without any umlaut and lost the distinction -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 23 13:33:49 2018 From: unicode at unicode.org (James Kass via Unicode) Date: Tue, 23 Jan 2018 11:33:49 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: Doug Ewell wrote, "I think it's so cute that some of us think we can advise Nazarbayev on whether to use straight or curly apostrophes or accents or x's or whatever. Like he would listen to a bunch of Western technocrats." Heh. We are offering sound advice. If people fail to heed it, that's too bad. From unicode at unicode.org Tue Jan 23 13:35:51 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 23 Jan 2018 19:35:51 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: On Tue, Jan 23, 2018 at 10:55 AM Doug Ewell via Unicode wrote: > I think it's so cute that some of us think we can advise Nazarbayev on > whether to use straight or curly apostrophes or accents or x's or > whatever. Like he would listen to a bunch of Western technocrats. > Kazakh has a perfectly servicable alphabet right now, that they probably have a bunch of keyboards that work for it. And I'm sure there's some Turkish firm that would be happy to deliver Turkish keyboards in bulk at quite reasonable prices. There's reasons why they're changing to an ASCII Latin script, and they're connected to the reasons he might listen to Western technocrats. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 23 14:53:25 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 23 Jan 2018 21:53:25 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: The best they should have done is instead keeping their existing keyboard layout, continaing both the Cyrillic letters and Latin QWERTY printed on them, but operating in two modes (depending on OS preferences) to invert the two layouts but without changing the keystrokes. It would just have needed one Latin letter or modified Latin letter so that it was simply a 1 to 1 transliteration. No need to type an extra apostrophe. No extra dead key was needed to remap to Latin the few Kazakh Cyrillic letters that are already typed with AltGr (the labels for them are on the bottom right of the key, with no character labeled above it, so it was also possible to use the upper position to indicate the associated Latin letter: when turning the keyboard to Latin mode instead of Cyrrilic by default today, the position of letters do not change, the labels are still valid where they are but AltGr will render the Latin key labeled at the upper position. All existing keyboards would remain usable as is. Users would then choose the Latin or Cyrillic layout as they want and could still switch from one to the other. Note that the placement of Cyrillic letters on the QWERTY layout of Latin letters is not from a direct transliteration: the paired letters do not match, but it does not matter (mapping keys on keyboards is not nessaraily a transliteration), but the existing Basic Latin letters A-Z should remain where they are on the QWERTY layout. The other Cyrillic letters are on keys that won't move but that will have the additional Latin letters needed for the language. Note also that there are two Kazakh-Cyrillic layouts, including one where the most common punctation (:,;.) are on two keys in the middle the 1st row (digits are typed using AltGr or with the numeric keypad): this layout also should not change, and the same two keys will keep these punctuations. But here again there's a single keystroke for each Cyrrlic letter, on the other keys, that will also keep the QWERTY layout of the Latin letters in their alternate mode. Only Cyrilic letters on other keys of the 2ns, 3rd and 4th row will need to map the missing non-basic Latin letters neededn alkso typed now with a single keystroke when the keyboard is turned to Latin mode. In all cases, the Latin keystrokes should also generate only one character only, not digrams. And its possible to use "correct" extended letters even if this requires some minor adaptation of existing transliterations (that use more complex rules with digrams), and it was perfectly possible to use Latin letters with acute accents for vowels, or with carons for consonnants, or possibly the cedilla below s or c, for every Kazakh Cyrillic letters to reach that goal without difficulty: a non-ambiguous, simple 1-to-1 transliteration, fully reversible, allowing also all historic texts in Cyrrlic to be transliterated instantly without loss, and still allowing clear reading of the Latin text, easy composition. Unicode (but also legacy ISO 8859 and Windows or MacOS codepages for Eastern European language in Latin) already support all the needed extended Latin characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 23 15:52:46 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 23 Jan 2018 21:52:46 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: <20180123215246.56e459f0@JRWUBU2> On Wed, 24 Jan 2018 03:22:37 +0800 Phake Nick via Unicode wrote: > >I found the Windows 'US International' keyboard layout highly > >intuitive for accented Latin-1 characters. > How common is the US International keyboard in real life..? I thought it was two copies per new Windows PC - one for 32- and the other for 64-bit code. I was talking about the *layout*. The apostrophe, quote, grave and circumflex on the usual US keyboard are good enough labels for the acute, unlaut, grave and circumflex dead keys. (Now, '?' is a problem.) > Users would still need to manually add them in Windows, or in other > computing tools vendors would need to add support for "US > International" before they can be used Select them, you mean. It's only a problem if the computer's owner has stopped users from selecting keyboards. I thought Windows penetration was better than 50%. > How about, for example, a random tourist looking for info of random > Kazakhstan city? Will they know how to type umlaut in a city's name? > Most likely they'll simply type it without any umlaut and lost the > distinction Possibly. From a US* keyboard on a PC in England, I enter "Munchen" in a Google search and get entries for M?nchen. I even get a reply panel headed "Things to do in Munich". The English Wikipedia redirects me from Munchen to Munich. Umlaut is simply not a problem. Richard. *Technically, it's a Thai keyboard, for when I type Tai Tham. I have trouble remembering where each digit key is. From unicode at unicode.org Tue Jan 23 16:24:05 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 23 Jan 2018 15:24:05 -0700 Subject: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F?= Message-ID: <20180123152405.665a7a7059d7ee80bb4d670165c8327d.e0f502fa27.wbe@email03.godaddy.com> Philippe Verdy wrote: > The best they should have done is instead keeping their existing > keyboard layout, continaing both the Cyrillic letters and Latin QWERTY > printed on them, but operating in two modes (depending on OS > preferences) to invert the two layouts but without changing the > keystrokes. It would just have needed one Latin letter or modified > Latin letter so that it was simply a 1 to 1 transliteration. The objective apparently was to be able use a U.S. English keyboard layout, AS IS, to type Kazakh-in-Latin. Adding new characters to the layout would defeat this purpose. Again, this may not be how you or I would solve the problem, and it may not be how the Kazakhs would solve the problem if there were no installed base (i.e. existing Latin-script keyboards with which compatibility was desired). As they say, the reason God was able to create the heavens and the earth in only 6 days was that there was no installed base to worry about. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jan 23 19:28:54 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 24 Jan 2018 01:28:54 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: <20180124012854.4d01b312@JRWUBU2> On Tue, 23 Jan 2018 11:51:42 -0700 Doug Ewell via Unicode wrote: > An explicitly stated goal of the new orthography was to enable typing > Kazakh on a "standard keyboard," meaning an English-language one. > Nazarbayev may ultimately be persuaded to embrace ASCII digraphs, > which also meet this goal, but this talk about U+2019 and U+02BC will > make exactly zero difference in Kazakh policy. Is it only in English then that typing an apostrophe key after a letter can't be relied UPON to yield U+0027 rather than U+2019? Richard. From unicode at unicode.org Wed Jan 24 03:45:03 2018 From: unicode at unicode.org (philip chastney via Unicode) Date: Wed, 24 Jan 2018 09:45:03 +0000 (UTC) Subject: 0027, 02BC, 2019, or a new character? References: <1514708770.3990073.1516787103300.ref@mail.yahoo.com> Message-ID: <1514708770.3990073.1516787103300@mail.yahoo.com> OK, he's no technocrat, but try googling "tony blair kazakhstan" and in case anybody's wondering what Nazarbayev got for his five million pounds, for a partial explanation, check out https://www.rt.com/uk/340035-blair-strike-kazakhstan-massacre/ it is not known if Blair profferred any advice on keyboard design, though, so this may be off-topic /phil -------------------------------------------- On Tue, 23/1/18, Doug Ewell via Unicode wrote: Subject: Re: 0027, 02BC, 2019, or a new character? To: "Unicode Mailing List" Date: Tuesday, 23 January, 2018, 6:51 PM I think it's so cute that some of us think we can advise Nazarbayev on whether to use straight or curly apostrophes or accents or x's or whatever. Like he would listen to a bunch of Western technocrats. From unicode at unicode.org Wed Jan 24 07:27:04 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Wed, 24 Jan 2018 13:27:04 +0000 Subject: Internationalization & Unicode Conference 2018 Message-ID: <2940A15B-0DF5-4643-855D-646A94BBE541@lboro.ac.uk> I am thinking that people at Internationalization & Unicode Conference 2018 may well be interested in my story and, at times difficult, journey. It has been a long journey. Title of my presentation would be "How I Internationalized my Computer Science Teaching". Would any organisation on this list be willing to fund my attendance: travel from England, accommodation ...etc... Alternatively, can you please point me to a funding body to which I can apply. Thank you Andr? Schappo From unicode at unicode.org Wed Jan 24 10:11:25 2018 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Wed, 24 Jan 2018 08:11:25 -0800 Subject: Internationalization & Unicode Conference 2018 In-Reply-To: <2940A15B-0DF5-4643-855D-646A94BBE541@lboro.ac.uk> References: <2940A15B-0DF5-4643-855D-646A94BBE541@lboro.ac.uk> Message-ID: If your presentation is accepted for the conference, you should get a hotel discount. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 16:19:07 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 24 Jan 2018 15:19:07 -0700 Subject: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F?= Message-ID: <20180124151907.665a7a7059d7ee80bb4d670165c8327d.1f48e0f353.wbe@email03.godaddy.com> James Kass wrote: > Heh. We are offering sound advice. If people fail to heed it, that's > too bad. We're offering excellent advice, very well informed. But the leadership has made the decision that it has made. All the news stories say that linguistic experts in Kazakhstan offered similar good advice, and were disheartened to learn it was ignored completely. Richard Wordingham wrote: > Is it only in English then that typing an apostrophe key after a > letter can't be relied UPON to yield U+0027 rather than U+2019? Um, I always get U+0027 when I expect it. Oh wait, you must be talking about AutoCorrect on Microsoft Word. Just visit AutoCorrect Options and turn off that particular "replace as you type" option, and be done with it. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Jan 24 17:55:34 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 25 Jan 2018 00:55:34 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180124151907.665a7a7059d7ee80bb4d670165c8327d.1f48e0f353.wbe@email03.godaddy.com> References: <20180124151907.665a7a7059d7ee80bb4d670165c8327d.1f48e0f353.wbe@email03.godaddy.com> Message-ID: So there will be a new administrative jargon in Kazakhstan that people won't like, and outside the government, they'll continue using their exiosting keyboards, and will only trnasliterate to Latin using a simple 1-t-to-1 mapping without the ugly apostrophes (most probably acute accents on vowels, or carons like in Serbian, notably on 'c' and 's' where acute accents are rarely found in many fonts : there's already a wide support Latin alphabets of Serbian, Hungarian, Slovakian, Polish ; and the special case for i can still avoid the computer nightmare of dotless vs. dotted versions used in Turkish, by using acute accents instead of these damned apostrophes...) Newspapers and books will continue for a wihile being published in Cyrillic (unless the Kazakh autority requires them to ban Cyrillic, but it will likely occur first on TV). Soon they will realize that this is not sustainable and that their decision causes many more problems with international documents, and will finally adopt the accents that will really promote their language to the web instead of freezing it in the Dark Age of ambiguous ASCII used in the early 1960's (when even the Cyrillic alphabet was not supported)... 2018-01-24 23:19 GMT+01:00 Doug Ewell via Unicode : > James Kass wrote: > > > Heh. We are offering sound advice. If people fail to heed it, that's > > too bad. > > We're offering excellent advice, very well informed. But the leadership > has made the decision that it has made. All the news stories say that > linguistic experts in Kazakhstan offered similar good advice, and were > disheartened to learn it was ignored completely. > > Richard Wordingham wrote: > > > Is it only in English then that typing an apostrophe key after a > > letter can't be relied UPON to yield U+0027 rather than U+2019? > > Um, I always get U+0027 when I expect it. > > Oh wait, you must be talking about AutoCorrect on Microsoft Word. Just > visit AutoCorrect Options and turn off that particular "replace as you > type" option, and be done with it. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 20:29:08 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 25 Jan 2018 07:59:08 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: On 23-Jan-2018 10:03, "James Kass via Unicode" wrote: (bottle, east, skier, crucial, cherry) s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e Last one most readable of the lot IMO and it's close enough to the apostrophe option. IIANM the apostrophe is used as a dead key for the acute accent in some common international keyboard layouts already? I retract my earlier statement about digraphs probably being the best option. It was made without looking at the actual requirement. For such heavy usage, it would simply make things horrible. Acute accent for the win! ?? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 20:29:11 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Thu, 25 Jan 2018 07:59:11 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: On 24-Jan-2018 00:25, "Doug Ewell via Unicode" wrote: I think it's so cute that some of us think we can advise Nazarbayev on whether to use straight or curly apostrophes or accents or x's or whatever. Like he would listen to a bunch of Western technocrats. Sir why this assumption that everyone here is "western"? I'm situated at an even more eastern longitude than Kazakhstan. An explicitly stated goal of the new orthography was to enable typing Kazakh on a "standard keyboard," meaning an English-language one. IMO it's hardly clear that that is or in fact *what* is meant by a standard keyboard. It meeely seems to me loose political speak to make it appear as if they are trying to make things simpler for the people. Nazarbayev may ultimately be persuaded to embrace ASCII digraphs, which also meet this goal, but this talk about U+2019 and U+02BC will make exactly zero difference in Kazakh policy. It shouldn't. At least the technical advisors should be monitoring this discussion if not participate in it. I know that Govt of India people do, at least on UnicoRe. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 20:49:24 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Thu, 25 Jan 2018 02:49:24 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: On Wed, Jan 24, 2018 at 6:31 PM Shriramana Sharma via Unicode < unicode at unicode.org> wrote: > > On 23-Jan-2018 10:03, "James Kass via Unicode" > wrote: > > (bottle, east, skier, crucial, cherry) > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > > [...] > > I retract my earlier statement about digraphs probably being the best > option. It was made without looking at the actual requirement. For such > heavy usage, it would simply make things horrible. > I'd say that the words chosen for this discussion have been specifically chosen for their heavy usage. Wikipedia has a translation of "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.", in what I believe in the new apostrophe-laden orthography: Barlyq adamdar tu'masynan azat ja'ne qadyr-qasi'eti men quqtary ten' bolyp du'ni'ege keledi. Adamdarg'a aqyl-parasat, ar-ojdan berilgen, sondyqtan olar bir-birimen tu'ystyq, bau'yrmaldyq qarym-qatynas jasau'lary ti'is. It's not that bad, though apostrophes still aren't a orthographic win. I'm voting for the Uniform Turkic Alphabet, for the grand total of zero my vote is worth. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 21:27:34 2018 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Wed, 24 Jan 2018 22:27:34 -0500 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: On 01/24/2018 09:29 PM, Shriramana Sharma via Unicode wrote: > On 24-Jan-2018 00:25, "Doug Ewell via Unicode" > wrote: > > I think it's so cute that some of us think we can advise Nazarbayev on > whether to use straight or curly apostrophes or accents or x's or > whatever. Like he would listen to a bunch of Western technocrats. > > > Sir why this assumption that everyone here is "western"? I'm situated > at an even more eastern longitude than Kazakhstan. It hardly matters. As the intent here is to comment on Nazarbayev's putative view of these discussions, it's quite likely he would write the whole lot of us off as "Western technocrats" no matter what our longitudes. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 21:30:52 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 25 Jan 2018 04:30:52 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <6c1f3146-e912-7bf7-c6a1-841352a8fd5f@it.aoyama.ac.jp> Message-ID: I agree, and still you won't necessarily have to press a dead key to have these characters, if you map one key where the Cyrillic letter was producing directly the character with its accent. No surprise for user, fast to type, easy to learn, typographically correct, preserves the etymologies and allows preservation of culture with a basic 1:1 transliterator between the two scripts. However, if you can type one key to produce one latin letter with its accent, I don't see why it could not use the caron instead of the acute above s and c, so that it is also immediately readable in other Eastern European languages. In addition they'll get better font support for x and c with caron than for s and c with acute and easy mappings from more softwares that handle only 8 bit charsets. The ISO 8859-2 subset (or Windows 1250) is the way to go if they don't want the complexity of the dotless i from other Turkic Latin alphabets. 2018-01-25 3:29 GMT+01:00 Shriramana Sharma via Unicode : > > > On 23-Jan-2018 10:03, "James Kass via Unicode" > wrote: > > (bottle, east, skier, crucial, cherry) > s'i's'a, s'yg'ys, s'an'g'ys'y, s'es'u's'i, s'i'i'e > sxixsxa, sxygxys, sxanxgxysxy, sxesxuxsxi, sxixixe > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > s?i?s?a, s?yg?ys, s?an?g?ys?y, s?es?u?s?i, s?i?i?e > > Last one most readable of the lot IMO and it's close enough to the > apostrophe option. IIANM the apostrophe is used as a dead key for the acute > accent in some common international keyboard layouts already? > > I retract my earlier statement about digraphs probably being the best > option. It was made without looking at the actual requirement. For such > heavy usage, it would simply make things horrible. > > Acute accent for the win! ?? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 24 21:41:02 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 25 Jan 2018 04:41:02 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: Great but then why sticking on a pure western subset (ASCII is mostly for US only). If he wants to be eastern, so choose ISO 8859-2. As a bonus, banning the apostrophe from the alphabet will have be security improvement (thing about the many cases where ASCII apostrophes are used as string delimiters in various programming and markup languages, and how frequently text variables get simply surrounded by ASCII quotes as if the text did not contain them: less frequent problems if the natural orthography avoids it. Less problems for processing texts internationally (think about technical documents, and air navigation, where local place names are inserted; even if these systems use UTF-8, the quotes will still need escaping and escaping mechanisms are not so universal...). 2018-01-25 4:27 GMT+01:00 Mark E. Shoulson via Unicode : > On 01/24/2018 09:29 PM, Shriramana Sharma via Unicode wrote: > > On 24-Jan-2018 00:25, "Doug Ewell via Unicode" > wrote: > > I think it's so cute that some of us think we can advise Nazarbayev on > whether to use straight or curly apostrophes or accents or x's or > whatever. Like he would listen to a bunch of Western technocrats. > > > Sir why this assumption that everyone here is "western"? I'm situated at > an even more eastern longitude than Kazakhstan. > > It hardly matters. As the intent here is to comment on Nazarbayev's > putative view of these discussions, it's quite likely he would write the > whole lot of us off as "Western technocrats" no matter what our longitudes. > > ~mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 25 00:51:33 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 25 Jan 2018 06:51:33 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <20180123115142.665a7a7059d7ee80bb4d670165c8327d.faa0083c34.wbe@email03.godaddy.com> Message-ID: <20180125065133.6c4a7a4e@JRWUBU2> On Thu, 25 Jan 2018 07:59:11 +0530 Shriramana Sharma via Unicode wrote: > IMO it's hardly clear that that is or in fact *what* is meant by a > standard keyboard. It meeely seems to me loose political speak to > make it appear as if they are trying to make things simpler for the > people. >From what I could find on the web, it seems that desktop keyboards in Kazakhstan are normally labelled with Kazakh Cyrillic and printable ASCII. The ASCII is arranged as US QWERTY. > It shouldn't. At least the technical advisors should be monitoring > this discussion if not participate in it. I know that Govt of India > people do, at least on UnicoRe. The Indian Government is an institutional member of Unicode, with a UTC vote when they attend regularly. Richard. From unicode at unicode.org Thu Jan 25 06:15:18 2018 From: unicode at unicode.org (Andrew West via Unicode) Date: Thu, 25 Jan 2018 12:15:18 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: On 23 January 2018 at 00:55, James Kass via Unicode wrote: > > Regular American users simply don't type umlauts, period. Not even the president of the Unicode Consortium when referring to Christoph P?per: http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf Andrew From unicode at unicode.org Thu Jan 25 08:14:46 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 25 Jan 2018 07:14:46 -0700 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <20180124151907.665a7a7059d7ee80bb4d670165c8327d.1f48e0f353.wbe@email03.godaddy.com> Message-ID: <4C8A3B8D15F147D98EF87C7CDDFE5F1D@DougEwell> Philippe Verdy wrote: > So there will be a new administrative jargon in Kazakhstan that people > won't like, and outside the government, they'll continue using their > exiosting keyboards [...] > > Newspapers and books will continue for a wihile being published in > Cyrillic [...] Yes, it will be a mess. I think we can agree on that. > Soon they will realize that this is not sustainable And that, only that, is what will cause them to change it. Shriramana Sharma wrote: > Sir why this assumption that everyone here is "western"? I'm situated > at an even more eastern longitude than Kazakhstan. Most of the participants in this "apostrophe" thread appeared to be from North America and Western Europe; I think you're the only one who expanded that. I wasn't referring to the geographical or cultural makeup of the list as a whole. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Thu Jan 25 09:40:44 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 25 Jan 2018 16:40:44 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: Such example shows that ignoring umlauts makes the document counterintuitive. Nobody is able to infer that "Paper" refers to a person here or if he actually meant a paper sheet/article... At least he should have written "Paeper" which would be more correct (if "Christoph P?per" is German, the umlaut is equivalent to a following "e"), or even "Christoph Paper". Apply that tot the Kazakh language, and attempt to drop the apostrophes (because they very commonly cause various technical issues in softwares), I'm sure you'll see problems of interpretation or too many synonyms, that the use of acute instead would have avoided All softwares today are "8-bit" clean and support at least ISO 8859-1 or windows 1252, if they don't support multibyte UTF-8; the time of 7-bit ASCII is ended now since long, except in very old systems, that were anyway not used at all for Kazakh in Cyrillic; so acute accents are more likely than ASCII apostrophes to survive the technical software constraints, notably if Latin letters with accents come from the ISO 8859-1 subset which is also 8-bit in Unicode. Even with UTF-8, these Latin letters with accents (from any ISO 8859-* subset) will be 2-byte wide, so exactly the same encoding size as basic letter+ASCII quote and the encoding size is definitely not an issue anywhere (all existing Kazakh Cyrillic letters are already using 2-byte encoding in UTF-8, as all their assigned code points values were higher than 0x7F but lower than 0x800) Choosing the ASCII quote for this "apostrophe" will not save anything ; but the regular Unicode apostrophe U+2019 would need... 3 bytes after the 1-byte basic Latin letter from ASCII (so it is worse !). Choosing the acute accent above Latin letters from ISO 8859-* would avoid this issue, because they are precombined, and in UTF-8 the usual prefered representation is in NFC form using a single code points. Javascript, Java, or C/C++ "wide string" types will handle these characters also with a single code unit (so the measured string "length" matches the number of letters). You will avoid all problems of SQL code injection on web sites if you have to allow the ASCII quotes unfiltered in data input forms to represent the proposed Kazakh orthography: with the acute, you can still continue to reject all ASCII quotes from software input forms and people won't be forced to use the alternate U+2019, not found on their basic keyboards, or will not substitute it by an hyphen or space or will not drop it completely; they'll just type letters with acute accents with a single keystroke on their Latinized keyboard. 2018-01-25 13:15 GMT+01:00 Andrew West via Unicode : > On 23 January 2018 at 00:55, James Kass via Unicode > wrote: > > > > Regular American users simply don't type umlauts, period. > > Not even the president of the Unicode Consortium when referring to > Christoph P?per: > > http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 25 09:48:42 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 25 Jan 2018 16:48:42 +0100 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: Just a remark for fun: - You'll also note that this talk is all about the apostrophe, and if Kazakhstan wants to introduce it in 2019, that year will match exactly the code point U+2019 [ ? ]... - This year 2018 is also the year to discuss and reverse the apostrophe decision, and it matches the codepoint U+2018 [ ? ] for the reversed apostrophe. Happy new years to ?Kazakhstan? ! But now we have a new way to memoize the code point value for these apostrophes ! 2018-01-25 16:40 GMT+01:00 Philippe Verdy : > Such example shows that ignoring umlauts makes the document > counterintuitive. Nobody is able to infer that "Paper" refers to a person > here or if he actually meant a paper sheet/article... > At least he should have written "Paeper" which would be more correct (if > "Christoph P?per" is German, the umlaut is equivalent to a following "e"), > or even "Christoph Paper". > > Apply that tot the Kazakh language, and attempt to drop the apostrophes > (because they very commonly cause various technical issues in softwares), > I'm sure you'll see problems of interpretation or too many synonyms, that > the use of acute instead would have avoided > > All softwares today are "8-bit" clean and support at least ISO 8859-1 or > windows 1252, if they don't support multibyte UTF-8; the time of 7-bit > ASCII is ended now since long, except in very old systems, that were anyway > not used at all for Kazakh in Cyrillic; so acute accents are more likely than > ASCII apostrophes to survive the technical software constraints, notably > if Latin letters with accents come from the ISO 8859-1 subset which is also > 8-bit in Unicode. Even with UTF-8, these Latin letters with accents (from > any ISO 8859-* subset) will be 2-byte wide, so exactly the same encoding > size as basic letter+ASCII quote and the encoding size is definitely not an > issue anywhere (all existing Kazakh Cyrillic letters are already using > 2-byte encoding in UTF-8, as all their assigned code points values were > higher than 0x7F but lower than 0x800) > > Choosing the ASCII quote for this "apostrophe" will not save anything ; > but the regular Unicode apostrophe U+2019 would need... 3 bytes after the > 1-byte basic Latin letter from ASCII (so it is worse !). > > Choosing the acute accent above Latin letters from ISO 8859-* would avoid > this issue, because they are precombined, and in UTF-8 the usual prefered > representation is in NFC form using a single code points. Javascript, Java, > or C/C++ "wide string" types will handle these characters also with a > single code unit (so the measured string "length" matches the number of > letters). You will avoid all problems of SQL code injection on web sites if > you have to allow the ASCII quotes unfiltered in data input forms to > represent the proposed Kazakh orthography: with the acute, you can still > continue to reject all ASCII quotes from software input forms and people > won't be forced to use the alternate U+2019, not found on their basic > keyboards, or will not substitute it by an hyphen or space or will not drop > it completely; they'll just type letters with acute accents with a single > keystroke on their Latinized keyboard. > > > 2018-01-25 13:15 GMT+01:00 Andrew West via Unicode : > >> On 23 January 2018 at 00:55, James Kass via Unicode >> wrote: >> > >> > Regular American users simply don't type umlauts, period. >> >> Not even the president of the Unicode Consortium when referring to >> Christoph P?per: >> >> http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf >> >> Andrew >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 25 12:48:59 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 25 Jan 2018 10:48:59 -0800 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: My apologies for the typo. There's no excuse for misspelling someone's name (especially since I live in Switzerland, and type German every day). Thanks for calling my attention to it: the doc has been updated. Mark Mark On Thu, Jan 25, 2018 at 4:15 AM, Andrew West via Unicode < unicode at unicode.org> wrote: > On 23 January 2018 at 00:55, James Kass via Unicode > wrote: > > > > Regular American users simply don't type umlauts, period. > > Not even the president of the Unicode Consortium when referring to > Christoph P?per: > > http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 25 13:34:16 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 25 Jan 2018 12:34:16 -0700 Subject: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F?= Message-ID: <20180125123416.665a7a7059d7ee80bb4d670165c8327d.cb0cedf332.wbe@email03.godaddy.com> Philippe Verdy wrote: > I agree, and still you won't necessarily have to press a dead key to > have these characters, if you map one key where the Cyrillic letter > was > producing directly the character with its accent. [...] > > However, if you can type one key to produce one latin letter with its > accent, I don't see why it could not use the caron instead of the > acute above s and c, so that it is also immediately readable in other > Eastern European languages. [...] I think it is very likely the Kazakhs, like most people who are not experts on computers or Unicode, did not consider the distinction between the physical keyboard (hardware) and the driver that maps keystrokes to characters (software). And they might consider replacing software drivers nationwide to be as unfeasible as replacing physical keyboards. Remember the government of Kazakhstan is probably not composed of computer experts. > As a bonus, banning the apostrophe from the alphabet will have be > security improvement (thing about the many cases where ASCII > apostrophes are used as string delimiters in various programming and > markup languages Another fact that they really did not seem to take into account. The advisers and linguists might have considered this, but not the decision-maker(s). > the time of 7-bit ASCII is ended now since long, except in very old > systems, And on U.S. English keyboards. (It's true, as Sharma says, that they didn't specify exactly what they meant by a "standard keyboard," but they did banish all diacritical marks, so...) > Even with UTF-8, these Latin letters with accents (from any ISO 8859-* > subset) will be 2-byte wide, so exactly the same encoding size as > basic letter+ASCII quote and the encoding size is definitely not an > issue anywhere (all existing Kazakh Cyrillic letters are already using > 2-byte encoding in UTF-8, as all their assigned code points values > were higher than 0x7F but lower than 0x800) [...] > > Choosing the ASCII quote for this "apostrophe" will not save > anything ; but the regular Unicode apostrophe U+2019 would need... 3 > bytes after the 1-byte basic Latin letter from ASCII (so it is > worse !). I did not see any evidence that this was something they ever considered or cared about. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri Jan 26 02:25:10 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Fri, 26 Jan 2018 08:25:10 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> Message-ID: <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> Talking of typing names correctly. Few people bother to type the acute accent in Andr?. This academic year, for the first time ever, I gave the following challenges to my web programming class of 143 students. I gave these challenges in the first lecture. ? learn how to write my name correctly on your desktop computers and mobile phones ? whenever you email me, ensure you write my name correctly I am pleased to report that the majority of this class now do type my name correctly when emailing me ?? Andr? Schappo On 25 Jan 2018, at 18:48, Mark Davis ?? via Unicode > wrote: My apologies for the typo. There's no excuse for misspelling someone's name (especially since I live in Switzerland, and type German every day). Thanks for calling my attention to it: the doc has been updated. Mark Mark On Thu, Jan 25, 2018 at 4:15 AM, Andrew West via Unicode > wrote: On 23 January 2018 at 00:55, James Kass via Unicode > wrote: > > Regular American users simply don't type umlauts, period. Not even the president of the Unicode Consortium when referring to Christoph P?per: http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf Andrew ?? ?? ?? Andr? Schappo schappo.blogspot.co.uk twitter.com/andreschappo weibo.com/andreschappo groups.google.com/forum/#!forum/computer-science-curriculum-internationalization -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 26 02:49:55 2018 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Fri, 26 Jan 2018 14:19:55 +0530 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> Message-ID: But your outgoing "From" address doesn't seem to have an accent!? On 26-Jan-2018 13:58, "Andre Schappo via Unicode" wrote: > > Talking of typing names correctly. Few people bother to type the acute > accent in Andr?. > > This academic year, for the first time ever, I gave the following > challenges to my web programming class of 143 students. I gave these > challenges in the first lecture. > > ? learn how to write my name correctly on your desktop computers and > mobile phones > ? whenever you email me, ensure you write my name correctly > > I am pleased to report that the majority of this class now do type my name > correctly when emailing me ?? > > Andr? Schappo > > On 25 Jan 2018, at 18:48, Mark Davis ?? via Unicode > wrote: > > My apologies for the typo. There's no excuse for misspelling someone's > name (especially since I live in Switzerland, and type German every day). > > Thanks for calling my attention to it: the doc has been updated. > > Mark > > Mark > > On Thu, Jan 25, 2018 at 4:15 AM, Andrew West via Unicode < > unicode at unicode.org> wrote: > >> On 23 January 2018 at 00:55, James Kass via Unicode >> wrote: >> > >> > Regular American users simply don't type umlauts, period. >> >> Not even the president of the Unicode Consortium when referring to >> Christoph P?per: >> >> http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf >> >> Andrew >> >> > > ?? ?? ?? > Andr? Schappo > schappo.blogspot.co.uk > twitter.com/andreschappo > weibo.com/andreschappo > groups.google.com/forum/#!forum/computer-science-curriculum- > internationalization > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 26 03:08:51 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Fri, 26 Jan 2018 09:08:51 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> Message-ID: <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> Ah! Yes?? That is a battle I gave up a long time ago. The database here can only handle ASCII. I have stopped trying to get the systems people here to convert the database to UTF-8. A few days ago I asked the systems people if they were going upgrade their MS mail server to handle non ASCII email addresses such as my Chinese email address. I will not go into details but basically they have no plans to support non ASCII email addresses. Further to my challenge: Before I set the below challenges to the students I described a possible scenario. Imagine you are responsible for a website with a backend database. This website provides financial management for a number of extremely wealthy clients. These clients are from many different parts of the world. If you cannot be bothered to get their names correct you could easily offend and hence lose clients. Just losing one client will be a huge loss in revenue for your company. My advice is: Learn the correct forms of their names in both the Latin script and the native script. Store both forms in your backend database. Andr? Schappo On 26 Jan 2018, at 08:49, Shriramana Sharma > wrote: But your outgoing "From" address doesn't seem to have an accent!? On 26-Jan-2018 13:58, "Andre Schappo via Unicode" > wrote: Talking of typing names correctly. Few people bother to type the acute accent in Andr?. This academic year, for the first time ever, I gave the following challenges to my web programming class of 143 students. I gave these challenges in the first lecture. ? learn how to write my name correctly on your desktop computers and mobile phones ? whenever you email me, ensure you write my name correctly I am pleased to report that the majority of this class now do type my name correctly when emailing me ?? Andr? Schappo On 25 Jan 2018, at 18:48, Mark Davis ?? via Unicode > wrote: My apologies for the typo. There's no excuse for misspelling someone's name (especially since I live in Switzerland, and type German every day). Thanks for calling my attention to it: the doc has been updated. Mark Mark On Thu, Jan 25, 2018 at 4:15 AM, Andrew West via Unicode > wrote: On 23 January 2018 at 00:55, James Kass via Unicode > wrote: > > Regular American users simply don't type umlauts, period. Not even the president of the Unicode Consortium when referring to Christoph P?per: http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf Andrew ?? ?? ?? Andr? Schappo schappo.blogspot.co.uk twitter.com/andreschappo weibo.com/andreschappo groups.google.com/forum/#!forum/computer-science-curriculum-internationalization ?? ?? ?? Andr? Schappo schappo.blogspot.co.uk twitter.com/andreschappo weibo.com/andreschappo groups.google.com/forum/#!forum/computer-science-curriculum-internationalization -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 26 10:47:49 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 26 Jan 2018 16:47:49 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> Message-ID: <20180126164749.3e9ab2e7@JRWUBU2> On Fri, 26 Jan 2018 09:08:51 +0000 Andre Schappo via Unicode wrote: > Ah! Yes?? That is a battle I gave up a long time ago. The database > here can only handle ASCII. I have stopped trying to get the systems > people here to convert the database to UTF-8. Some systems (or admins) have been totally defeated by even the ASCII version of ?O?Sullivan?. That bodes ill for Kazakhs. Richard. From unicode at unicode.org Fri Jan 26 15:14:22 2018 From: unicode at unicode.org (John W Kennedy via Unicode) Date: Fri, 26 Jan 2018 16:14:22 -0500 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180126164749.3e9ab2e7@JRWUBU2> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> <20180126164749.3e9ab2e7@JRWUBU2> Message-ID: <2AC5B8FB-747F-4016-827C-9B0582CCE27A@gmail.com> In cold-metal days, many were driven to resort to ?M?Donald? for lack of a superscript ?c?. > On Jan 26, 2018, at 11:47 AM, Richard Wordingham via Unicode wrote: > > On Fri, 26 Jan 2018 09:08:51 +0000 > Andre Schappo via Unicode wrote: > >> Ah! Yes?? That is a battle I gave up a long time ago. The database >> here can only handle ASCII. I have stopped trying to get the systems >> people here to convert the database to UTF-8. > > Some systems (or admins) have been totally defeated by even the ASCII > version of ?O?Sullivan?. That bodes ill for Kazakhs. > > Richard. > From unicode at unicode.org Sat Jan 27 03:29:03 2018 From: unicode at unicode.org (Julian Bradfield via Unicode) Date: Sat, 27 Jan 2018 09:29:03 +0000 (GMT) Subject: 0027, 02BC, 2019, or a new character? References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> <20180126164749.3e9ab2e7@JRWUBU2> Message-ID: On 2018-01-26, Richard Wordingham via Unicode wrote: > Some systems (or admins) have been totally defeated by even the ASCII > version of ?O?Sullivan?. That bodes ill for Kazakhs. The head (about to be ex-head) of my university is Sir Timothy O'Shea. On the student record system, it is impossible to search for students called O'Shea (I have one). I suppose it doesn't sanitize correctly - I haven't tried looking for little Bobby Tables yet. It hadn't occurred to me to check, but of course searching for O?Shea doesn't work either, as they usually enter their own names into the initial record, and use 0027. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From unicode at unicode.org Sat Jan 27 04:22:22 2018 From: unicode at unicode.org (Denis Jacquerye via Unicode) Date: Sat, 27 Jan 2018 10:22:22 +0000 Subject: In the mean time, in France (was Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <0CF721F4-95DF-4820-BDA8-A4E1CE0C49C4@lboro.ac.uk> <71A09394-76E6-40E0-BF04-856C19B8119C@lboro.ac.uk> <20180126164749.3e9ab2e7@JRWUBU2> Message-ID: In the mean time, in France, a municipality is refusing to let a baby be registered with an apostrophe in his Breton name while several babies have had apostrophes in their names in recent years : 2017 N'n?n? (F), 2017 Tu'iuvea (M), 2016 D'jessy (M), 2015 N'Guessan (F), 2015 Chem's (M), 2014 N'Khany (M) 2012 Manec'h (M). https://www.connexionfrance.com/French-news/Rennes-mayor-to-challenge-ban-on-Breton-first-names If only someone had told them it?s not necessarily an apostrophe but can be U+02BC or U+02BB in some of these. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 27 09:31:56 2018 From: unicode at unicode.org (Stephane Bortzmeyer via Unicode) Date: Sat, 27 Jan 2018 16:31:56 +0100 Subject: [HUMOR] Proof that emojis are useful Message-ID: <20180127153156.wmatbq6bkpzzp2ea@sources.org> Nice scientific info, and with emojis : https://twitter.com/biolojical/status/956953421130514432 From unicode at unicode.org Sat Jan 27 11:45:30 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sat, 27 Jan 2018 09:45:30 -0800 Subject: [HUMOR] Proof that emojis are useful In-Reply-To: <20180127153156.wmatbq6bkpzzp2ea@sources.org> References: <20180127153156.wmatbq6bkpzzp2ea@sources.org> Message-ID: Nice, thanks! Mark On Sat, Jan 27, 2018 at 7:31 AM, Stephane Bortzmeyer via Unicode < unicode at unicode.org> wrote: > Nice scientific info, and with emojis : > > https://twitter.com/biolojical/status/956953421130514432 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 27 13:40:49 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Jan 2018 20:40:49 +0100 Subject: TIRONIAN SIGN ET Message-ID: <86tvv7ozsu.fsf@mimuw.edu.pl> Hi! I try to find in UTC Document Register the proposals for characters which interest me for some reasons. I'm usually rather successful, but I'm unable to find the proposal for TIRONIAN SIGN ET. Any hints? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sat Jan 27 13:53:08 2018 From: unicode at unicode.org (Rick McGowan via Unicode) Date: Sat, 27 Jan 2018 11:53:08 -0800 Subject: TIRONIAN SIGN ET In-Reply-To: <86tvv7ozsu.fsf@mimuw.edu.pl> References: <86tvv7ozsu.fsf@mimuw.edu.pl> Message-ID: <5A6CD8A4.2000909@unicode.org> Hello Janusz -- Try this: http://www.unicode.org/L2/L2017/17300-n4841-tironian-et.pdf Regards, On 1/27/2018 11:40 AM, Janusz S. Bie? via Unicode wrote: > Hi! > > I try to find in UTC Document Register the proposals for characters > which interest me for some reasons. I'm usually rather successful, but > I'm unable to find the proposal for TIRONIAN SIGN ET. > > Any hints? > > Best regards > > Janusz > From unicode at unicode.org Sat Jan 27 14:17:18 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sat, 27 Jan 2018 21:17:18 +0100 Subject: TIRONIAN SIGN ET In-Reply-To: <5A6CD8A4.2000909@unicode.org> (Rick McGowan's message of "Sat, 27 Jan 2018 11:53:08 -0800") References: <86tvv7ozsu.fsf@mimuw.edu.pl> <5A6CD8A4.2000909@unicode.org> Message-ID: <86po5voy41.fsf@mimuw.edu.pl> On Sat, Jan 27 2018 at 20:53 CET, rick at unicode.org writes: > Hello Janusz -- > > Try this: http://www.unicode.org/L2/L2017/17300-n4841-tironian-et.pdf > > Regards, > > On 1/27/2018 11:40 AM, Janusz S. Bie? via Unicode wrote: >> Hi! >> >> I try to find in UTC Document Register the proposals for characters >> which interest me for some reasons. I'm usually rather successful, but >> I'm unable to find the proposal for TIRONIAN SIGN ET. I've seen this document, but I'm looking for an earlier one. The character was introduced in Unicode 3.0 in 1999, cf. e.g. http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML015/0250.html Regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sat Jan 27 15:54:57 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sat, 27 Jan 2018 22:54:57 +0100 (CET) Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180123215246.56e459f0@JRWUBU2> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <20180123215246.56e459f0@JRWUBU2> Message-ID: <2017846446.13404.1517090097895.JavaMail.www@wwinf1e23> On Tue, 23 Jan 2018 21:52:46 +0000, Richard Wordingham wrote: > > On Wed, 24 Jan 2018 03:22:37 +0800 > Phake Nick via Unicode wrote: > > > >I found the Windows 'US International' keyboard layout highly > > >intuitive for accented Latin-1 characters. > > How common is the US International keyboard in real life..? > > I thought it was two copies per new Windows PC - one for 32- and the > other for 64-bit code. I was talking about the *layout*. [?] The US-Intl is so weird ?you can?t just leave it on all the time? as reported in: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html Now that CLDR is sorting out how to improve keyboard layouts, hopefully something falls off to replace the *legacy* US-Intl. As of how common the new one will become, I guess it depends on whether it gets less weird than the old one, and to what extent. Regards, Marcel From unicode at unicode.org Sat Jan 27 16:13:40 2018 From: unicode at unicode.org (Shervin Afshar via Unicode) Date: Sat, 27 Jan 2018 14:13:40 -0800 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180122220855.7b929272@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> Message-ID: On Mon, Jan 22, 2018 at 2:08 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Mon, 22 Jan 2018 at 16:39:57, Andre Schappo via Unicode < > unicode at unicode.org> wrote: > > By way of example, one programming challenge I set to students a > > couple of weeks ago involves diacritics. Please see > > jsfiddle.net/coas/wda45gLp > > Did any of them come up with the idea of using traces instead of > strings? > Care to elaborate? Are you referring to sequence alignment methods? -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Jan 27 22:12:30 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 28 Jan 2018 04:12:30 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> Message-ID: <20180128041230.26b34022@JRWUBU2> On Sat, 27 Jan 2018 14:13:40 -0800 Shervin Afshar wrote: > On Mon, Jan 22, 2018 at 2:08 PM, Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > On Mon, 22 Jan 2018 at 16:39:57, Andre Schappo via Unicode < > > unicode at unicode.org> wrote: > > > By way of example, one programming challenge I set to students a > > > couple of weeks ago involves diacritics. Please see > > > jsfiddle.net/coas/wda45gLp > > Did any of them come up with the idea of using traces instead of > > strings? > Care to elaborate? Are you referring to sequence alignment methods? No, I'm thinking of the trace monoid (see e.g. https://en.wikipedia.org/wiki/Trace_monoid). One way of thinking of strings is as concatenations of the NFD decompositions of their constituent characters. Then the canonical equivalence classes of these strings form the trace monoid of indecomposable characters. The theory of regular expressions (though you may not think that mathematical regular expressions matter) extends to trace monoids, with the disturbing exception that the Kleene star of a regular language is not necessarily regular. (The prototypical example is sequences (xy)^n where x and y are distinct and commute, i.e. xy and yx are canonically equivalent in Unicode terms. A Unicode example is the set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that will recognise canonically equivalent strings). One consequence of this view is that one has to think of U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (?) bein? both composed of the Vietnamese vowel letter U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (?) and tone mark U+0323 COMBINING DOT BELOW and also composed of, in the spirit of Thai ISO 11940 transliteration, of the transliterated Thai vowel U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW (?), corresponding to U+0E31 THAI CHARACTER MAI HAN-AKAT, and the tone mark U+0302 COMBINING CIRCUMFLEX ACCENT, corresponding to U+0E49 THAI CHARACTER MAI THO. (In ISO 11940 as specified, the tone mark is actually written on the immediately preceding consonant, not on the vowel.) Richard. From unicode at unicode.org Sat Jan 27 23:02:47 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 28 Jan 2018 05:02:47 +0000 Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <2017846446.13404.1517090097895.JavaMail.www@wwinf1e23> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <20180123215246.56e459f0@JRWUBU2> <2017846446.13404.1517090097895.JavaMail.www@wwinf1e23> Message-ID: <20180128050247.51aa6c8d@JRWUBU2> On Sat, 27 Jan 2018 22:54:57 +0100 (CET) Marcel Schneider via Unicode wrote: > The US-Intl is so weird ?you can?t just leave it on all the time? as > reported in: > > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html I did (except when I was using a totally different writing system). One just has to remember that those punctuation marks need two key strokes, the first being the space key. Mark Davis's problem seems to be that he was using an Apple half the time. Richard. From unicode at unicode.org Sun Jan 28 01:12:45 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Sun, 28 Jan 2018 08:12:45 +0100 (CET) Subject: 0027, 02BC, 2019, or a new character? In-Reply-To: <20180128050247.51aa6c8d@JRWUBU2> References: <175e07ea-9092-6c22-9bb4-3d817fa37dbe@efele.net> <20180116080019.2738554a@JRWUBU2> <57DC0C82-2C14-43B3-BED7-5C5C03F0FCAA@lboro.ac.uk> <713142F1-22AF-479B-9DD8-9A317EBD608B@lboro.ac.uk> <08e2fee8-d911-6063-e5fc-1bf1dca07ae6@smontagu.org> <20180121184945.2659a1ab@JRWUBU2> <20180123215246.56e459f0@JRWUBU2> <2017846446.13404.1517090097895.JavaMail.www@wwinf1e23> <20180128050247.51aa6c8d@JRWUBU2> Message-ID: <743636785.171.1517123565285.JavaMail.www@wwinf1h34> On Sun, 28 Jan 2018 05:02:47 +0000, Richard Wordingham via Unicode wrote: > > On Sat, 27 Jan 2018 22:54:57 +0100 (CET) > Marcel Schneider via Unicode wrote: > > > The US-Intl is so weird ?you can?t just leave it on all the time? as > > reported in: > > > > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html > > I did (except when I was using a totally different writing system). > One just has to remember that those punctuation marks need two key > strokes, the first being the space key. Mark Davis's problem seems to > be that he was using an Apple half the time. Indeed, Apple?s US-extended has lots of dead keys on Option level, so that Base level ASCII symbols are left alone. Some of these are hijacked on Windows? US-international for five deadkeys only (likewise, French hijacks two), to disrupt UX wrt macOS, impacting those using both. And developers don?t like to remember hitting space before a vowel to get the (single/double/reverse) quote, or tilde or caret. On any layout, such a complication is inacceptable to most coders. But US-Intl isn?t the only case. The Canadian Standard layout too is cheered on Apple and disliked on Windows, obviously because beyond the first two levels, there are many many differences. That cannot really be a matter of conformance to the CAN specs, as the Windows implementation leaves out the '?' character, beside of messing up the group modifier. We can only hope that now, CLDR is thoroughly re-engineering the way international or otherwise extended keyboards are mapped. Regards, Marcel From unicode at unicode.org Sun Jan 28 01:23:15 2018 From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode) Date: Sun, 28 Jan 2018 08:23:15 +0100 Subject: TIRONIAN SIGN ET In-Reply-To: <99E5E583-F751-4B5C-BE3F-6596D813690B@yahoo.ca> (David Faulks's message of "Sat, 27 Jan 2018 15:59:12 -0500") References: <86tvv7ozsu.fsf@mimuw.edu.pl> <5A6CD8A4.2000909@unicode.org> <86po5voy41.fsf@mimuw.edu.pl> <99E5E583-F751-4B5C-BE3F-6596D813690B@yahoo.ca> Message-ID: <86lggiphuk.fsf@mimuw.edu.pl> On Sat, Jan 27 2018 at 21:59 CET, davidj_faulks at yahoo.ca writes: [...] > As far as I can tell, it was originally proposed in the document n1747 > 'Contraction mark characters for the UCS? by Everson. However, I > cannot find that document anywhere. Thank you very much for the reference. On the page http://www.evertype.com/formal.html there is the link http://unicode.org/wg2/docs/n1747.pdf but it does not work. However the page http://www.unicode.org/wg2/WG2-registry.html states The archival document directory for WG2 is accessible here: http://std.dkuug.dk/jtc1/sc2/wg2/ The archives contain all available documents through 2014 and the document is at ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n1747.pdf Actually the character is "inherited" from ISO 5426-2:1996 Information and documentation -- Extension of the Latin alphabet coded character set for bibliographic information interchange -- Part 2: Latin characters used in minor European languages and obsolete typography Hence my curiosity is fully satisfied :-) Thanks again! Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From unicode at unicode.org Sun Jan 28 13:29:28 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Jan 2018 20:29:28 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180128041230.26b34022@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> Message-ID: 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Sat, 27 Jan 2018 14:13:40 -0800The theory > of regular expressions (though you may not think that mathematical > regular expressions matter) extends to trace monoids, with the > disturbing exception that the Kleene star of a regular language is not > necessarily regular. (The prototypical example is sequences (xy)^n > where x and y are distinct and commute, i.e. xy and yx are canonically > equivalent in Unicode terms. A Unicode example is the set of strings > composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that > will recognise canonically equivalent strings). > I don't see why you can't write this as the regular expression: (x | y)* For the Unicode canonical equivalences, this applies to distinct characters that have distinct non-zero combining classes. But of course searching for or requires transforming it to NFD first as: so thet the regexp transforms to: [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * | * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> Note that the "complex" set of characters used three times above is finite, it contains all combining characters of Unicode that have a non-zero combining class except above and below, i.e. all Unicode characters whose combining class is not 0, 220 (below) or 230 (above). However, It is too simplified, because the allowed combining classes must occur at most once in each possible non-zero combining class and not arbitrary numbers of them: these allowed combining classs currently are in {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, 222, 224, 226, 228, 232..234, 240} whose most member elements are used for very few combining characters (the above and below combining classes are the most populated ones but we exclude them, all the others have 1 to 9 combining characters assigned to them, or 22 characters with cc=7 (nukta), or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). Once again we can refine them also as a regexp, but this is combinatorial because we have 52 combining classes (so we would need to consider the 52! (factorial) alternates). But given the maximum length of what this can match (0 to 52 combining characters: yes it is finite), this is best done by not rewriting this as a regexp but by replacing the final "*" by {1,52}, and then just check each returned match of [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} with a simple scan of these short strings to see that they all have distinct combining classes (this just requires 52 booleans, easily stored in a single 64 bit integer initialized to 0 prior to scan the scan of these small strings). But the theory does not prevent writing it as a regexp (even if it would be extremely long). So a Kleene Star closure is possible and can be efficiently implemented (all depends on the way you represent the "current state" in the FSM: a single integer representing a single node number in the traversal graph is not the best way to do that. This is a valid regexp, the finite state machine DOES have a finite lookahead (the full regexp above will match AT MOST 107 characters including the combining marks, where 107=3+2*52), but this is general to regexps that generally cannot be transformed directly into a FSM with finite lookahead, but a FSM is possible: the regexp first transforms into a simple graph of transitions with a finite number of node (this number is bound to the length of the regexp itself) where there can be multiple states active simultaneously; then a basic transform converts this small graph by transforming nodes into new nodes representing the finite set of the combinations of active states in the first graph : There will be many more nodes, and generally this explodes in size because the transform is combinatorial, and such size explosion has worst perfomance (explosion of the memory needed to representing the new graph with a single state active). So regexp engines use the alternative by representing the current state of traversal of the first simple graph using a stack of active states and transiting them all separately (this can be implemented with a "bitset" whose size in bits is the number of states in the first simple graph, or by using an associative array (dictionnary of boolean properties whose keys are state numbers in the first graph, which can be set or removed: this requires much less memory and it is relatively fast, even if the full current state is not just a single small integer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 28 13:30:44 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Jan 2018 20:30:44 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> Message-ID: Typo, the full regexp has undesired asterisks: [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> 2018-01-28 20:29 GMT+01:00 Philippe Verdy : > > > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > >> On Sat, 27 Jan 2018 14:13:40 -0800The theory >> of regular expressions (though you may not think that mathematical >> regular expressions matter) extends to trace monoids, with the >> disturbing exception that the Kleene star of a regular language is not >> necessarily regular. (The prototypical example is sequences (xy)^n >> where x and y are distinct and commute, i.e. xy and yx are canonically >> equivalent in Unicode terms. A Unicode example is the set of strings >> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that >> will recognise canonically equivalent strings). >> > > I don't see why you can't write this as the regular expression: > (x | y)* > For the Unicode canonical equivalences, this applies to distinct > characters that have distinct non-zero combining classes. > > But of course searching for > > or > > > requires transforming it to NFD first as: > > > > so thet the regexp transforms to: > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > ( * [[ [^[[:cc=0:]]] - > [[:cc=above:][:cc=below:]] ]] * > | * [[ [^[[:cc=0:]]] - > [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> > > Note that the "complex" set of characters used three times above is > finite, it contains all combining characters of Unicode that have a > non-zero combining class except above and below, i.e. all Unicode > characters whose combining class is not 0, 220 (below) or 230 (above). > > However, It is too simplified, because the allowed combining classes must > occur at most once in each possible non-zero combining class and not > arbitrary numbers of them: these allowed combining classs currently are in > {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, > 222, 224, 226, 228, 232..234, 240} whose most member elements are used for > very few combining characters (the above and below combining classes are > the most populated ones but we exclude them, all the others have 1 to 9 > combining characters assigned to them, or 22 characters with cc=7 (nukta), > or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). > > Once again we can refine them also as a regexp, but this is combinatorial > because we have 52 combining classes (so we would need to consider the 52! > (factorial) alternates). But given the maximum length of what this can > match (0 to 52 combining characters: yes it is finite), this is best done > by not rewriting this as a regexp but by replacing the final "*" by {1,52}, > and then just check each returned match of > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} > > with a simple scan of these short strings to see that they all have > distinct combining classes (this just requires 52 booleans, easily stored > in a single 64 bit integer initialized to 0 prior to scan the scan of these > small strings). But the theory does not prevent writing it as a regexp > (even if it would be extremely long). So a Kleene Star closure is > possible and can be efficiently implemented (all depends on the way you > represent the "current state" in the FSM: a single integer representing a > single node number in the traversal graph is not the best way to do that. > > This is a valid regexp, the finite state machine DOES have a finite > lookahead (the full regexp above will match AT MOST 107 characters > including the combining marks, where 107=3+2*52), but this is general to > regexps that generally cannot be transformed directly into a FSM with > finite lookahead, but a FSM is possible: the regexp first transforms into a > simple graph of transitions with a finite number of node (this number is > bound to the length of the regexp itself) where there can be multiple > states active simultaneously; then a basic transform converts this small > graph by transforming nodes into new nodes representing the finite set of > the combinations of active states in the first graph : > > There will be many more nodes, and generally this explodes in size because > the transform is combinatorial, and such size explosion has worst > perfomance (explosion of the memory needed to representing the new graph > with a single state active). So regexp engines use the alternative by > representing the current state of traversal of the first simple graph using > a stack of active states and transiting them all separately (this can be > implemented with a "bitset" whose size in bits is the number of states in > the first simple graph, or by using an associative array (dictionnary of > boolean properties whose keys are state numbers in the first graph, which > can be set or removed: this requires much less memory and it is relatively > fast, even if the full current state is not just a single small integer. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 28 13:45:36 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sun, 28 Jan 2018 20:45:36 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> Message-ID: Note that for finding occurence of simpler combining sequences such as finding the regexp is simpler: [[ [^[[:cc=0:]]] - [[:cc=above:]] ]] * The central character class allows 53 distinct combining classes, and the maximum match length is 2+53=55 characters. If Unicode assigns new combining classes for new combining characters, the maximum match length will increase by 1 character for this regexp, and by 2 characters in the previous example. As there can be at most 255 non-zero combining classes (due to current stability rules), finding will match at most 1+253+1 = 255 characters in any future version of Unicode, and finding will match at most 1+252+1+252+1 = 507 characters. This is still finite, small enough to be implementable with a deterministic FSM, using no more than 1 codepoint of lookahead, without using any backtrailing. 2018-01-28 20:30 GMT+01:00 Philippe Verdy : > Typo, the full regexp has undesired asterisks: > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] > ]] * > | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] > ]] * < COMBINING CIRCUMFLEX> > > > > 2018-01-28 20:29 GMT+01:00 Philippe Verdy : > >> >> >> 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < >> unicode at unicode.org>: >> >>> On Sat, 27 Jan 2018 14:13:40 -0800The theory >>> of regular expressions (though you may not think that mathematical >>> regular expressions matter) extends to trace monoids, with the >>> disturbing exception that the Kleene star of a regular language is not >>> necessarily regular. (The prototypical example is sequences (xy)^n >>> where x and y are distinct and commute, i.e. xy and yx are canonically >>> equivalent in Unicode terms. A Unicode example is the set of strings >>> composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that >>> will recognise canonically equivalent strings). >>> >> >> I don't see why you can't write this as the regular expression: >> (x | y)* >> For the Unicode canonical equivalences, this applies to distinct >> characters that have distinct non-zero combining classes. >> >> But of course searching for >> >> or >> >> >> requires transforming it to NFD first as: >> >> >> >> so thet the regexp transforms to: >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] >> * >> ( * [[ [^[[:cc=0:]]] - >> [[:cc=above:][:cc=below:]] ]] * >> | * [[ [^[[:cc=0:]]] - >> [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> >> >> Note that the "complex" set of characters used three times above is >> finite, it contains all combining characters of Unicode that have a >> non-zero combining class except above and below, i.e. all Unicode >> characters whose combining class is not 0, 220 (below) or 230 (above). >> >> However, It is too simplified, because the allowed combining classes must >> occur at most once in each possible non-zero combining class and not >> arbitrary numbers of them: these allowed combining classs currently are in >> {1, 7..36, 84, 91, 103, 107, 118, 122, 129, 130, 132, 202, 214, 216, 218, >> 222, 224, 226, 228, 232..234, 240} whose most member elements are used for >> very few combining characters (the above and below combining classes are >> the most populated ones but we exclude them, all the others have 1 to 9 >> combining characters assigned to them, or 22 characters with cc=7 (nukta), >> or 32 characters with cc=1 (overlay), or 47 characters with cc=9 (virama). >> >> Once again we can refine them also as a regexp, but this is combinatorial >> because we have 52 combining classes (so we would need to consider the 52! >> (factorial) alternates). But given the maximum length of what this can >> match (0 to 52 combining characters: yes it is finite), this is best done >> by not rewriting this as a regexp but by replacing the final "*" by {1,52}, >> and then just check each returned match of >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]{0,52} >> >> with a simple scan of these short strings to see that they all have >> distinct combining classes (this just requires 52 booleans, easily stored >> in a single 64 bit integer initialized to 0 prior to scan the scan of these >> small strings). But the theory does not prevent writing it as a regexp >> (even if it would be extremely long). So a Kleene Star closure is >> possible and can be efficiently implemented (all depends on the way you >> represent the "current state" in the FSM: a single integer representing a >> single node number in the traversal graph is not the best way to do that. >> >> This is a valid regexp, the finite state machine DOES have a finite >> lookahead (the full regexp above will match AT MOST 107 characters >> including the combining marks, where 107=3+2*52), but this is general to >> regexps that generally cannot be transformed directly into a FSM with >> finite lookahead, but a FSM is possible: the regexp first transforms into a >> simple graph of transitions with a finite number of node (this number is >> bound to the length of the regexp itself) where there can be multiple >> states active simultaneously; then a basic transform converts this small >> graph by transforming nodes into new nodes representing the finite set of >> the combinations of active states in the first graph : >> >> There will be many more nodes, and generally this explodes in size >> because the transform is combinatorial, and such size explosion has worst >> perfomance (explosion of the memory needed to representing the new graph >> with a single state active). So regexp engines use the alternative by >> representing the current state of traversal of the first simple graph using >> a stack of active states and transiting them all separately (this can be >> implemented with a "bitset" whose size in bits is the number of states in >> the first simple graph, or by using an associative array (dictionnary of >> boolean properties whose keys are state numbers in the first graph, which >> can be set or removed: this requires much less memory and it is relatively >> fast, even if the full current state is not just a single small integer. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 28 15:11:06 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sun, 28 Jan 2018 14:11:06 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: Message-ID: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Marcel Schneider wrote: > We can only hope that now, CLDR is thoroughly re-engineering the way > international or otherwise extended keyboards are mapped. I suspect you already know this and just misspoke, but CLDR doesn't prescribe any vendor's keyboard layouts. CLDR mappings reflect what vendors have released. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Jan 28 16:44:56 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 28 Jan 2018 22:44:56 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> Message-ID: <20180128224456.2a93f2a1@JRWUBU2> On Sun, 28 Jan 2018 20:29:28 +0100 Philippe Verdy via Unicode wrote: > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > > > On Sat, 27 Jan 2018 14:13:40 -0800The theory > > of regular expressions (though you may not think that mathematical > > regular expressions matter) extends to trace monoids, with the > > disturbing exception that the Kleene star of a regular language is > > not necessarily regular. (The prototypical example is sequences > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are > > canonically equivalent in Unicode terms. A Unicode example is the > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - > > there is no FSM that will recognise canonically equivalent strings). > > > > I don't see why you can't write this as the regular expression: > (x | y)* Because xx does not match. In principle, it can be done iteratively thus: 1) Look for sequences of x's and y's - your (x | y) * 2) Discard matches from (1) where the number of x's and y's are equal. However, the second step cannot be implemented by a *finite* state machine. > For the Unicode canonical equivalences, this applies to distinct > characters that have distinct non-zero combining classes. Those of us who've looked at optimising collation by reducing normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in theory, a source of many problems. > But of course searching for > > or > > > requires transforming it to NFD first as: That wasn't I had in mind. What I had in mind was accepting the propositions that the string contains both LATIN SMALL LETTER A WITH CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW. > > > so thet the regexp transforms to: > > [[ [^[[:cc=0:]]] - > [[:cc=above:][:cc=below:]] ]] * ( * > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * BELOW> | * [[ [^[[:cc=0:]]] - > BELOW> [[:cc=above:][:cc=below:]] > ]] * < COMBINING CIRCUMFLEX> If everything is converted to NFD, the regular expressions using traces can be converted to frequently unintelligible regexes on strings; the behaviour of the converted regex when faced with an unnormalised string is of course irrelevant. In the search you have in mind, the converted regex for use with NFD strings is actually intelligible and simple: [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * Informal notation can simplify the regex still further. There is no upper bound to the length of a string matching that regex, though examples in correctly spelt natural languages are quite limited in length. Of course, what one is interested in is the input form of the match to . That can be in three parts, and some of the parts may contain parts of composed characters. There isn't a widely used notation for such discontiguous, character-splitting substrings. What can get nasty is NFD regexes for things like [[:InPC=Top:]] [[:InPC=Bottom:]] You don't want to craft these by hand. You just want it to match and its canonical equivalents, including the NFD form: This is a bottom vowel followed by a top vowel. Richard. From unicode at unicode.org Sun Jan 28 17:04:37 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 28 Jan 2018 15:04:37 -0800 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: One addition: with the expansion of keyboards in http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html we are looking to expand the repository to not merely represent those, but to also serve as a resource that vendors can draw on. Mark On Sun, Jan 28, 2018 at 1:11 PM, Doug Ewell via Unicode wrote: > Marcel Schneider wrote: > > We can only hope that now, CLDR is thoroughly re-engineering the way >> international or otherwise extended keyboards are mapped. >> > > I suspect you already know this and just misspoke, but CLDR doesn't > prescribe any vendor's keyboard layouts. CLDR mappings reflect what vendors > have released. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Jan 28 17:20:16 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Sun, 28 Jan 2018 16:20:16 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: Mark Davis wrote: > One addition: with the expansion of keyboards in > http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html > we are looking to expand the repository to not merely represent those, > but to also serve as a resource that vendors can draw on. Would you say, then, that Marcel's statements: "Now that CLDR is sorting out how to improve keyboard layouts, hopefully something falls off to replace the *legacy* US-Intl." and: "We can only hope that now, CLDR is thoroughly re-engineering the way international or otherwise extended keyboards are mapped." reflect the situation accurately? Nothing in the PRI #367 blog post or background document communicated to me that CLDR was going to try to influence vendors to retire these keyboard layouts and replace them with those. I thought it was just about providing a richer CLDR format and syntax to better "support keyboard layouts from all major providers." Please point me to the part I missed. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Jan 28 20:17:34 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 29 Jan 2018 03:17:34 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: <160863352.2.1517192254836.JavaMail.www@wwinf1p18> On Sun, 28 Jan 2018 16:20:16 -0700, Doug Ewell wrote: > > Mark Davis wrote: > > > One addition: with the expansion of keyboards in > > http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html > > we are looking to expand the repository to not merely represent those, > > but to also serve as a resource that vendors can draw on. > > Would you say, then, that Marcel's statements: > > "Now that CLDR is sorting out how to improve keyboard layouts, hopefully > something falls off to replace the *legacy* US-Intl." > > and: > > "We can only hope that now, CLDR is thoroughly re-engineering the way > international or otherwise extended keyboards are mapped." > > reflect the situation accurately? > > Nothing in the PRI #367 blog post or background document communicated to > me that CLDR was going to try to influence vendors to retire these > keyboard layouts and replace them with those. I thought it was just > about providing a richer CLDR format and syntax to better "support > keyboard layouts from all major providers." Please point me to the part > I missed. A replacement candidate for US-International would only be a handy fall-off, and it is up to MIcrosoft to decide whether it has the potential to enhance UX. It all started up when Mark accepted to embrace the idea of adding a Numbers modifier and a Programmer toggle after submission of CLDR ticket #10851: unicode.org/cldr/trac/ticket/10851 I figure out that the working group is on a proof of concept. I?m trying to make up some additions by the deadline of PRI #367. Regards, Marcel From unicode at unicode.org Sun Jan 28 21:55:22 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 29 Jan 2018 04:55:22 +0100 (CET) Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: <1014996228.85.1517198122492.JavaMail.www@wwinf1p18> On Sun, 28 Jan 2018 14:11:06 -0700, Doug Ewell wrote: > > Marcel Schneider wrote: > > > We can only hope that now, CLDR is thoroughly re-engineering the way > > international or otherwise extended keyboards are mapped. > > I suspect you already know this and just misspoke, but CLDR doesn't > prescribe any vendor's keyboard layouts. CLDR mappings reflect what > vendors have released. Sorry I didn?t see the thread until I replied at the point where it is. But looking harder I can see that what I meant when trying to input my concern into the project, is already implied by the wording of the initial blog post (Mark has shared the link of: http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html ) when it comes to a detailed overview of the goals: ?As a part of this work, keyboards [?] provide better layouts overall.? E.g. a Numbers modifier is required for locales using U+202F NARROW NO-BREAK SPACE as a thousands separator (and is useful for all others), while a Programmer toggle is required on keyboards using the upper row for special letters lower-and uppercase, and is handy for all those that have dead keys in the righthand part. Windows Vietnamese is one example, and Michael Kaplan wrote a series of blog posts about it, that you know well: http://archives.miloush.net/michkap/archive/2005/08/27/457224.html http://archives.miloush.net/michkap/archive/2005/11/11/491349.html http://archives.miloush.net/michkap/archive/2007/01/31/1564299.html I was aware that CLDR is a repository, and now I?m amazed how things go on. Regards, Marcel From unicode at unicode.org Sun Jan 28 23:56:25 2018 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 28 Jan 2018 21:56:25 -0800 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: On Sun, Jan 28, 2018 at 3:20 PM, Doug Ewell wrote: > Mark Davis wrote: > > One addition: with the expansion of keyboards in >> http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html >> we are looking to expand the repository to not merely represent those, >> but to also serve as a resource that vendors can draw on. >> > > Would you say, then, that Marcel's statements: > > "Now that CLDR is sorting out how to improve keyboard layouts, hopefully > something falls off to replace the *legacy* US-Intl." > > and: > > "We can only hope that now, CLDR is thoroughly re-engineering the way > international or otherwise extended keyboards are mapped." > > reflect the situation accurately? > > Nothing in the PRI #367 blog post or background document communicated to > me that CLDR was going to try to influence vendors to retire these keyboard > layouts and replace them with those. I thought it was just about providing > a richer CLDR format and syntax to better "support keyboard layouts from > all major providers." Please point me to the part I missed. Your message didn't quote ?the part about ?replace the *legacy* US-Intl."? The PRI blog post is talking about the technical changes, not process. The goal there is to be able to represent keyboard structures and data in a "lingua franca", and to expand the features needed to cover more languages and more vendor requirements. Of course, more extensions will be needed in the future, as well. As far as process goes, we foresee (a) continuing to reflect what is being used in practice, and (b) extending to a repository for keyboards for languages that are not represented by current vendors. That is to enable vendors to easily add keyboards for support of additional languages, if they want. It is not a goal to get "vendors to retire these keyboard layouts and replace them" ? that's not our role. (And I'm sure that a lot of people like and would continue to use the Windows Intl keyboard.) It's more about making it easier to have more choice available for users: more languages, and more choice of layouts within a language that meet people's needs. > > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 00:16:04 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 29 Jan 2018 07:16:04 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180128224456.2a93f2a1@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> Message-ID: 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Sun, 28 Jan 2018 20:29:28 +0100 > Philippe Verdy via Unicode wrote: > > > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < > > unicode at unicode.org>: > > > > > On Sat, 27 Jan 2018 14:13:40 -0800The theory > > > of regular expressions (though you may not think that mathematical > > > regular expressions matter) extends to trace monoids, with the > > > disturbing exception that the Kleene star of a regular language is > > > not necessarily regular. (The prototypical example is sequences > > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are > > > canonically equivalent in Unicode terms. A Unicode example is the > > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - > > > there is no FSM that will recognise canonically equivalent strings). > > > > > > > I don't see why you can't write this as the regular expression: > > (x | y)* > > Because xx does not match. > > In principle, it can be done iteratively thus: > > 1) Look for sequences of x's and y's - your (x | y) * > 2) Discard matches from (1) where the number of x's and y's are equal. > > However, the second step cannot be implemented by a *finite* state > machine. > > > For the Unicode canonical equivalences, this applies to distinct > > characters that have distinct non-zero combining classes. > > Those of us who've looked at optimising collation by reducing > normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in > theory, a source of many problems. > > > But of course searching for > > > > or > > > > > > requires transforming it to NFD first as: > > That wasn't I had in mind. What I had in mind was accepting the > propositions that the string BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH > CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW. > > > > > > > so thet the regexp transforms to: > > > > [[ [^[[:cc=0:]]] - > > [[:cc=above:][:cc=below:]] ]] * ( * > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > BELOW> | * [[ [^[[:cc=0:]]] - > > BELOW> [[:cc=above:][:cc=below:]] > > ]] * < COMBINING CIRCUMFLEX> > > If everything is converted to NFD, the regular expressions using traces > can be converted to frequently unintelligible regexes on strings; the > behaviour of the converted regex when faced with an unnormalised string > is of course irrelevant. > > In the search you have in mind, the converted regex for use with NFD > strings is actually intelligible and simple: > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > > Informal notation can simplify the regex still further. > > There is no upper bound to the length of a string matching that regex, > Wrong, you've not read what followed immediately that commented it already: it IS bound exactly because you cannot duplicate the same combining class, and there's a known finite number of them for acceptable cases: if there's any repetition, it will always be within that bound. But it not necessay to expand all the combinations of combining classes to all their possible ordering of occurence (something that a classic regexp normally requires by expecting a specific order). One way to solve it would have to have (generic) regexp extension allowing to specify a combination of one or more of several items in a choice list in any order, but never more than one occurence of each of item. This is possible using a rule with boolean flags of presence, one boolean for each item in the choice list. Something like {?a|b|c|d} matching zero or more (or all of them) of a,b,c,d (these can be subregexps) in any order, and {?+a|b|c|d} matching one or more, and {?{m,n}a|b|c|d} matching betwen m and n of them (in any order in all cases) So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the capture, i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a} is the same as a, and {?*a} is the same as a? Which can also be written respectrively as {?*[abcd]}, {?+[abcd]} and {?{m,n}[abcd]) if the items of the choice list are characters that can be compacted with the classic "character class" notation [abcd]. In all these the "{?quantifier list}" notation is always bound by the number of items in the list (independantly of the quantifier, and if individual items in the list are bound in length, the whole will be bound by the sum of their lengths. So even if the quantifier is higher than than the number of items in the list, it will be capped: "{?{1000}a}" will only match "a", and "{?{1000}}" will never match anything (because the list is empty: the specified higher bound 1000 is capped to 0 but the specified lower bound 1000 is capped to 1 and this is impossible) and is also equivalent to {?} where the min-max specified bounds are 1 by default, but capped to 1,0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 00:37:46 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 29 Jan 2018 07:37:46 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> Message-ID: I made an error for the character class notation: "{?optionalquantifier[class]}" should be just "{optionalquantifier[class]}"... So "{?[abc]}" contains 1 item "[abc]" to choose from in any order, it is not quantified explicitly so it matches by default 1 or more, but as there's only one item, it will match just one "[abc]" But "{[abc]}" contains 3 items from the class "[abc]" to choose from in any order, so it will match "a", "b", "c", "ab", "ba", "ac", "ca", "abc", "acb", "bac", "bca", "cab" or "cba". And "{{1}[abc]}" is quantified to match one and only one item, and is equivalent to "[abc]" and matches only "a", "b", or "c" And "{{0}[abc]}" is quantified to match zero and only zero item (the items are not relevant) and will never match anything, just like "{{0}a|b|c}" or "{{0}}". And "{{2}[abc]}" or "{{2,2}[abc]}" is quantified to match two and only two items from the character class, and matches only "ab", "ba", "ac", or "ca", it is equivalent to "{{2,2}a|b|c}" or "{{2}a|b|c}". With that extension you can build the necessary regexps to match canonical equivalent strings with a finite regexp. 2018-01-29 7:16 GMT+01:00 Philippe Verdy : > > > 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > >> On Sun, 28 Jan 2018 20:29:28 +0100 >> Philippe Verdy via Unicode wrote: >> >> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < >> > unicode at unicode.org>: >> > >> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory >> > > of regular expressions (though you may not think that mathematical >> > > regular expressions matter) extends to trace monoids, with the >> > > disturbing exception that the Kleene star of a regular language is >> > > not necessarily regular. (The prototypical example is sequences >> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are >> > > canonically equivalent in Unicode terms. A Unicode example is the >> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - >> > > there is no FSM that will recognise canonically equivalent strings). >> > > >> > >> > I don't see why you can't write this as the regular expression: >> > (x | y)* >> >> Because xx does not match. >> >> In principle, it can be done iteratively thus: >> >> 1) Look for sequences of x's and y's - your (x | y) * >> 2) Discard matches from (1) where the number of x's and y's are equal. >> >> However, the second step cannot be implemented by a *finite* state >> machine. >> >> > For the Unicode canonical equivalences, this applies to distinct >> > characters that have distinct non-zero combining classes. >> >> Those of us who've looked at optimising collation by reducing >> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in >> theory, a source of many problems. >> >> > But of course searching for >> > >> > or >> > >> > >> > requires transforming it to NFD first as: >> >> That wasn't I had in mind. What I had in mind was accepting the >> propositions that the string > BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH >> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW. >> >> > >> > >> > so thet the regexp transforms to: >> > >> > [[ [^[[:cc=0:]]] - >> > [[:cc=above:][:cc=below:]] ]] * ( * >> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > BELOW> | * [[ [^[[:cc=0:]]] - >> > BELOW> [[:cc=above:][:cc=below:]] >> > ]] * < COMBINING CIRCUMFLEX> >> >> If everything is converted to NFD, the regular expressions using traces >> can be converted to frequently unintelligible regexes on strings; the >> behaviour of the converted regex when faced with an unnormalised string >> is of course irrelevant. >> >> In the search you have in mind, the converted regex for use with NFD >> strings is actually intelligible and simple: >> >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >> >> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >> >> >> Informal notation can simplify the regex still further. >> >> There is no upper bound to the length of a string matching that regex, >> > > Wrong, you've not read what followed immediately that commented it > already: it IS bound exactly because you cannot duplicate the same > combining class, and there's a known finite number of them for acceptable > cases: if there's any repetition, it will always be within that bound. But > it not necessay to expand all the combinations of combining classes to all > their possible ordering of occurence (something that a classic regexp > normally requires by expecting a specific order). > > One way to solve it would have to have (generic) regexp extension allowing > to specify a combination of one or more of several items in a choice list > in any order, but never more than one occurence of each of item. This is > possible using a rule with boolean flags of presence, one boolean for each > item in the choice list. > > Something like {?a|b|c|d} matching zero or more (or all of them) of > a,b,c,d (these can be subregexps) in any order, and {?+a|b|c|d} matching > one or more, and {?{m,n}a|b|c|d} matching betwen m and n of them (in any > order in all cases) > So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the capture, > i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a} is the > same as a, and {?*a} is the same as a? > > Which can also be written respectrively as {?*[abcd]}, {?+[abcd]} and {?{m,n}[abcd]) > if the items of the choice list are characters that can be compacted with > the classic "character class" notation [abcd]. > > In all these the "{?quantifier list}" notation is always bound by the > number of items in the list (independantly of the quantifier, and if > individual items in the list are bound in length, the whole will be bound > by the sum of their lengths. So even if the quantifier is higher than than > the number of items in the list, it will be capped: "{?{1000}a}" will only > match "a", and "{?{1000}}" will never match anything (because the list > is empty: the specified higher bound 1000 is capped to 0 but the specified > lower bound 1000 is capped to 1 and this is impossible) and is also > equivalent to {?} where the min-max specified bounds are 1 by default, but > capped to 1,0 > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 01:54:29 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 29 Jan 2018 08:54:29 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> Message-ID: You may also wonder why I describe a regexp that would never match anything but would be handled itself as a successful match: it is a useful extension that allows stopping early the analysis and genenalizes the concept of negation (defined in character classes with the minus operator). For example "(b{}|[a-z])*" would match any word made of any letters [a-z] except that when it encounters any "b" it attempts to match "{}" (which doesn't match anything as it means "match 1 or more within a list having no item to choose from), it succeeds early and invalidates all matches in the alternatives given, in summary it will be mostly the same as "[[a-z]-[b]]*", or "[ac-z]*", but if there's a "b", it returns an empty match for it, located just after the "b" found. Note: the optional quantifiers are the classic ones used in regexps: - "{m,n}" for m to n occurences, - "{n}" for exactly n occurences (for n>0), equivalent to "{n,n}" (the default quantifier is "{1}" or "{1,1}"), except "{0}" which is equivalent to "{1,0}", - "{m,}" for at least m occurences, - "{,n}" or "{0,n}" for at most n occurences, - "?" for 0 or 1 occurence, equivalent to "{,1}" - "+" for 1 or more occurences, equivalent to "{1,}" - "*" for 0 or more occurences, equivalent to "{0,}" - I don't know if we need here special greedy/non-greedy quantifiers ("?*", "?+", "??", "*?, "*+", "+?", "+*", and so on...) However the quantifiers at start of this (unordered) "exclusive choice lists" extension, with form "{quantifier item1 | item2 | ...}" or "{quantifier [class]}", do not just count the items, they also disallow multiple occurences of the same chosen item, unlike the quantifiers used as suffixes (after a character, character class or subregexp between parentheses). The number of items in the exclusive choice list is never zero but it may be a matchable empty string (it cannot be an "empty class", as a character class must match exactly 1 character to choose from a set of 1-character values; but the "empty class" is representable as "{{0}}" using a quantified exclusive choice list). If an item in the exclusive choice list is an empty string and the quantifier of the choice list is not "{0}", the exclusive choice list will always match successfully. I insist on the term "exclusive", because this is the interesting property that allows bounding the occurences, and on the term "unordered" (which avoids having to combinatorially specify all the possibile order or occurence of items in the choice list (if the choice list has (n) items, there are (n!) possible orders with classic regexps and an item list with only 13 items would have more than 2^32 orders to specify with classic regexps; in Unicode, we have up to 255 possible non-zero canonical classes, so we cannot represent them all in classic regexps within any computer: (253!) ~5.17e499). 2018-01-29 7:37 GMT+01:00 Philippe Verdy : > I made an error for the character class notation: > "{?optionalquantifier[class]}" should be just > "{optionalquantifier[class]}"... > > So "{?[abc]}" contains 1 item "[abc]" to choose from in any order, it is > not quantified explicitly so it matches by default 1 or more, but as > there's only one item, it will match just one "[abc]" > But "{[abc]}" contains 3 items from the class "[abc]" to choose from in > any order, so it will match "a", "b", "c", "ab", "ba", "ac", "ca", "abc", > "acb", "bac", "bca", "cab" or "cba". > And "{{1}[abc]}" is quantified to match one and only one item, and is > equivalent to "[abc]" and matches only "a", "b", or "c" > And "{{0}[abc]}" is quantified to match zero and only zero item (the > items are not relevant) and will never match anything, just like > "{{0}a|b|c}" or "{{0}}". > And "{{2}[abc]}" or "{{2,2}[abc]}" is quantified to match two and only > two items from the character class, and matches only "ab", "ba", "ac", or > "ca", it is equivalent to "{{2,2}a|b|c}" or "{{2}a|b|c}". > > With that extension you can build the necessary regexps to match canonical > equivalent strings with a finite regexp. > > 2018-01-29 7:16 GMT+01:00 Philippe Verdy : > >> >> >> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < >> unicode at unicode.org>: >> >>> On Sun, 28 Jan 2018 20:29:28 +0100 >>> Philippe Verdy via Unicode wrote: >>> >>> > 2018-01-28 5:12 GMT+01:00 Richard Wordingham via Unicode < >>> > unicode at unicode.org>: >>> > >>> > > On Sat, 27 Jan 2018 14:13:40 -0800The theory >>> > > of regular expressions (though you may not think that mathematical >>> > > regular expressions matter) extends to trace monoids, with the >>> > > disturbing exception that the Kleene star of a regular language is >>> > > not necessarily regular. (The prototypical example is sequences >>> > > (xy)^n where x and y are distinct and commute, i.e. xy and yx are >>> > > canonically equivalent in Unicode terms. A Unicode example is the >>> > > set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - >>> > > there is no FSM that will recognise canonically equivalent strings). >>> > > >>> > >>> > I don't see why you can't write this as the regular expression: >>> > (x | y)* >>> >>> Because xx does not match. >>> >>> In principle, it can be done iteratively thus: >>> >>> 1) Look for sequences of x's and y's - your (x | y) * >>> 2) Discard matches from (1) where the number of x's and y's are equal. >>> >>> However, the second step cannot be implemented by a *finite* state >>> machine. >>> >>> > For the Unicode canonical equivalences, this applies to distinct >>> > characters that have distinct non-zero combining classes. >>> >>> Those of us who've looked at optimising collation by reducing >>> normalisation will recognise U+0F73 TIBETAN VOWEL SIGN II as, in >>> theory, a source of many problems. >>> >>> > But of course searching for >>> > >>> > or >>> > >>> > >>> > requires transforming it to NFD first as: >>> >>> That wasn't I had in mind. What I had in mind was accepting the >>> propositions that the string >> BELOW, COMBINING CIRCUMFLEX> contains both LATIN SMALL LETTER A WITH >>> CIRCUMFLEX and LATIN SMALL LETTER A WITH DOT BELOW. >>> >>> > >>> > >>> > so thet the regexp transforms to: >>> > >>> > [[ [^[[:cc=0:]]] - >>> > [[:cc=above:][:cc=below:]] ]] * ( * >>> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >> > BELOW> | * [[ [^[[:cc=0:]]] - >>> > BELOW> [[:cc=above:][:cc=below:]] >>> > ]] * < COMBINING CIRCUMFLEX> >>> >>> If everything is converted to NFD, the regular expressions using traces >>> can be converted to frequently unintelligible regexes on strings; the >>> behaviour of the converted regex when faced with an unnormalised string >>> is of course irrelevant. >>> >>> In the search you have in mind, the converted regex for use with NFD >>> strings is actually intelligible and simple: >>> >>> >>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >>> >>> [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * >>> >>> >>> Informal notation can simplify the regex still further. >>> >>> There is no upper bound to the length of a string matching that regex, >>> >> >> Wrong, you've not read what followed immediately that commented it >> already: it IS bound exactly because you cannot duplicate the same >> combining class, and there's a known finite number of them for acceptable >> cases: if there's any repetition, it will always be within that bound. But >> it not necessay to expand all the combinations of combining classes to all >> their possible ordering of occurence (something that a classic regexp >> normally requires by expecting a specific order). >> >> One way to solve it would have to have (generic) regexp extension >> allowing to specify a combination of one or more of several items in a >> choice list in any order, but never more than one occurence of each of >> item. This is possible using a rule with boolean flags of presence, one >> boolean for each item in the choice list. >> >> Something like {?a|b|c|d} matching zero or more (or all of them) of >> a,b,c,d (these can be subregexps) in any order, and {?+a|b|c|d} >> matching one or more, and {?{m,n}a|b|c|d} matching betwen m and n of >> them (in any order in all cases) >> So that {?a|b|c|d}{1,1} is the same as (a|b|c|d) but without the >> capture, i.e. (?:a|b|c|d), and {?{m,n}a} is the same as a{m,n}, and {?+a} >> is the same as a, and {?*a} is the same as a? >> >> Which can also be written respectrively as {?*[abcd]}, {?+[abcd]} and {?{m,n}[abcd]) >> if the items of the choice list are characters that can be compacted with >> the classic "character class" notation [abcd]. >> >> In all these the "{?quantifier list}" notation is always bound by the >> number of items in the list (independantly of the quantifier, and if >> individual items in the list are bound in length, the whole will be bound >> by the sum of their lengths. So even if the quantifier is higher than than >> the number of items in the list, it will be capped: "{?{1000}a}" will only >> match "a", and "{?{1000}}" will never match anything (because the list >> is empty: the specified higher bound 1000 is capped to 0 but the specified >> lower bound 1000 is capped to 1 and this is impossible) and is also >> equivalent to {?} where the min-max specified bounds are 1 by default, but >> capped to 1,0 >> >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 02:57:41 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 29 Jan 2018 08:57:41 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> Message-ID: <20180129085741.6fcf00f8@JRWUBU2> On Mon, 29 Jan 2018 07:16:04 +0100 Philippe Verdy via Unicode wrote: > 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > > In the search you have in mind, the converted regex for use with NFD > > strings is actually intelligible and simple: > > > > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > > > > > Informal notation can simplify the regex still further. > > > > There is no upper bound to the length of a string matching that > > regex, > > Wrong, you've not read what followed immediately that commented it > already: it IS bound exactly because you cannot duplicate the same > combining class, and there's a known finite number of them for > acceptable cases: if there's any repetition, it will always be within > that bound. Are you talking about regular expressions or strings that match them? Natural language text can very easily contain adjacent combining characters of the same combining class - look no further than the full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND MACRON. For a few combining characters, such as U+1A7F TAI THAM COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence. One can find pairs of combining circumflexes in plain text maths. Incidentally, I was talking about regular expressions, which imply *finite* state machines, albeit huge, rather then 'regexes', which are similar but may formally require unbounded memory. Richard. From unicode at unicode.org Mon Jan 29 05:26:01 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 29 Jan 2018 12:26:01 +0100 (CET) Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> Message-ID: <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> On Sun, 28 Jan 2018 21:56:25 -0800, Mark Davis replied to Doug Ewell: > > It is not a goal to get "vendors to retire these keyboard layouts and > replace them" ? that's not our role. (And I'm sure that a lot of people > like and would continue to use the Windows Intl keyboard.) Instead of ?replace? I should have written /provide an alternative to/. Discontinuing a major layout variant would be bad practice. Prior to this thread, I believed that the ratio of Windows users liking the US-International vs Mac users liking the US-Extended was like other ?Windows implementation? vs ?Apple implementation? ratios. So far we can tell that failing to be updated, the Windows US-Intl does not allow to write French in a usable manner, as the ?? is still missing, and does not allow to type German correctly neither due to the lack of single angle quotation marks (used in some French locales, too, and perhaps likely to become even more widespread). Of course these are all on the macOS US-Extended. If so many people like it, why was Windows Intl not updated, then? (Or has it been for Windows 10, and just not on https://docs.microsoft.com/fr-fr/globalization/keyboards/kbdusx.html while the Keyboard layouts index page has come into the benefit of a slight enhancement of user experience: https://docs.microsoft.com/en-us/globalization/windows-keyboard-layouts ) > > It's more about making it easier to have more choice available for users: > more languages, and more choice of layouts within a language that meet > people's needs. Covering more ? and ideally ALL ? languages is top priority. Marc Durdin of SIL Keyman teaming up with the CLDR enhancement project is very good news. Regards, Marcel (If you wonder why Mark Davis blacklisted me: That happened at the 2015 ?Apostrophe? thread when I was new to this and any other Mailing List.) From unicode at unicode.org Mon Jan 29 06:38:48 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Mon, 29 Jan 2018 13:38:48 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> Message-ID: <1485629394.6192.1517229528665.JavaMail.www@wwinf2230> BTW the 5 dead keys of Windows US Intl are already on Apple?s *normal* US layout, along with the letter o-with-e. US Extended adds 20 more deadkeys. On Sun, 28 Jan 2018 16:20:16 -0700, Doug Ewell wrote: [?] > Nothing in the PRI #367 blog post or background document communicated to > me that CLDR was going to try to influence vendors to retire these > keyboard layouts and replace them with those. [?] The ?replacement? would be on user side, not on vendors side. I was never thinking that Microsoft could put another layout *in the place of US Intl* instead of letting users choose. To like a particular layout does not mean to want to stick with it when anything better comes up. User?s choice is always respected. But users must also respect other people?s orthographies, as seen in the wake of the preceding Kazakh apostrophes thread. Hence we are expected to upgrade our tooling if it proves inappropriate. (Nevertheless the time I?m using an Apple instead of my Windows-7-driven netbook is less than 0.1?%.) Regards, Marcel From unicode at unicode.org Mon Jan 29 07:03:38 2018 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Mon, 29 Jan 2018 13:03:38 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180128041230.26b34022@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> Message-ID: <97507BDF-7908-4BFC-A7F4-1CCE1B90E563@lboro.ac.uk> On 28 Jan 2018, at 04:12, Richard Wordingham via Unicode > wrote: On Sat, 27 Jan 2018 14:13:40 -0800 Shervin Afshar > wrote: On Mon, Jan 22, 2018 at 2:08 PM, Richard Wordingham via Unicode < unicode at unicode.org> wrote: On Mon, 22 Jan 2018 at 16:39:57, Andre Schappo via Unicode < unicode at unicode.org> wrote: By way of example, one programming challenge I set to students a couple of weeks ago involves diacritics. Please see jsfiddle.net/coas/wda45gLp Did any of them come up with the idea of using traces instead of strings? Cor Blimey?? I am really pleased if the students have even heard of Unicode let alone heard of trace monoid?? ...and I confess, I knew nothing of trace monoid until I read the below wikipedia article but then again my ignorance is profound?? BTW. these internationalised computer science exercises I have written and am writing are not part of any course or module and so are optional. In providing such exercises I am hoping to spark an interest in Unicode and internationalisation. I wrote a couple more yesterday jsfiddle.net/coas/3c7y88ot & jsfiddle.net/coas/aau8cqaw Andr? Schappo Care to elaborate? Are you referring to sequence alignment methods? No, I'm thinking of the trace monoid (see e.g. https://en.wikipedia.org/wiki/Trace_monoid). One way of thinking of strings is as concatenations of the NFD decompositions of their constituent characters. Then the canonical equivalence classes of these strings form the trace monoid of indecomposable characters. The theory of regular expressions (though you may not think that mathematical regular expressions matter) extends to trace monoids, with the disturbing exception that the Kleene star of a regular language is not necessarily regular. (The prototypical example is sequences (xy)^n where x and y are distinct and commute, i.e. xy and yx are canonically equivalent in Unicode terms. A Unicode example is the set of strings composed only of U+0F73 TIBETAN VOWEL SIGN II - there is no FSM that will recognise canonically equivalent strings). One consequence of this view is that one has to think of U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW (?) bein? both composed of the Vietnamese vowel letter U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (?) and tone mark U+0323 COMBINING DOT BELOW and also composed of, in the spirit of Thai ISO 11940 transliteration, of the transliterated Thai vowel U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW (?), corresponding to U+0E31 THAI CHARACTER MAI HAN-AKAT, and the tone mark U+0302 COMBINING CIRCUMFLEX ACCENT, corresponding to U+0E49 THAI CHARACTER MAI THO. (In ISO 11940 as specified, the tone mark is actually written on the immediately preceding consonant, not on the vowel.) Richard. ?? ?? ?? Andr? Schappo schappo.blogspot.co.uk twitter.com/andreschappo weibo.com/andreschappo groups.google.com/forum/#!forum/computer-science-curriculum-internationalization -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 07:15:04 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Mon, 29 Jan 2018 14:15:04 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180129085741.6fcf00f8@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> Message-ID: No since the begining we were talking about matching strings that are canonically equivalent within regexps. So that searching for a regexp containing precombined characters or decomposed characters would find them independantly of the encoded form (normalized or not) of the input and independantly that there are addtional combining characters inserted between them. The case of u with diaeresis and macron is simpler: it has two combining characters of the same combining class and they don't commute, still the regexp to match it is something like: U [[:cc>0:]-[:cc=above:]]* [[:cc>0:]-[:cc=above:]]* [[:cc>0:]-[:cc=above:]]* The source is simply decomposed (does not need to be normalized to NFD) and matched accroding to this transformed regexp but does not need here the "{exclusive choice list}" notation because DIAERESIS and MACRON do not commute. 2018-01-29 9:57 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 29 Jan 2018 07:16:04 +0100 > Philippe Verdy via Unicode wrote: > > > 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode < > > unicode at unicode.org>: > > > > In the search you have in mind, the converted regex for use with NFD > > > strings is actually intelligible and simple: > > > > > > > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > > > > > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * > > > > > > > > > Informal notation can simplify the regex still further. > > > > > > There is no upper bound to the length of a string matching that > > > regex, > > > > Wrong, you've not read what followed immediately that commented it > > already: it IS bound exactly because you cannot duplicate the same > > combining class, and there's a known finite number of them for > > acceptable cases: if there's any repetition, it will always be within > > that bound. > > Are you talking about regular expressions or strings that match them? > Natural language text can very easily contain adjacent combining > characters of the same combining class - look no further than the > full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND > MACRON. For a few combining characters, such as U+1A7F TAI THAM > COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence. > One can find pairs of combining circumflexes in plain text maths. > > Incidentally, I was talking about regular expressions, which imply > *finite* state machines, albeit huge, rather then 'regexes', which are > similar but may formally require unbounded memory. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 29 12:13:21 2018 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Mon, 29 Jan 2018 11:13:21 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> Message-ID: > On Jan 29, 2018, at 4:26 AM, Marcel Schneider via Unicode wrote: > > > the Windows US-Intl > does not allow to write French in a usable manner, as the ?? is still > missing, and does not allow to type German correctly neither due to > the lack of single angle quotation marks (used in some French locales, > too, and perhaps likely to become even more widespread). Of course > these are all on the macOS US-Extended. They are also all on the MacOS "US International PC", provided since 2009 by Apple for Windows users who like US International. ? ? are on alt and alt-shift q ?? are on alt-shift 3/4 (US Extended has also been renamed ABC Extended back in 2015) From unicode at unicode.org Mon Jan 29 14:53:05 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 29 Jan 2018 20:53:05 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> Message-ID: <20180129205305.5d5d202d@JRWUBU2> On Mon, 29 Jan 2018 14:15:04 +0100 Philippe Verdy via Unicode wrote: > No since the begining we were talking about matching strings that are > canonically equivalent within regexps. So that searching for a regexp > containing precombined characters or decomposed characters would find > them independantly of the encoded form (normalized or not) of the > input and independantly that there are addtional combining characters > inserted between them. OK, we're taking different approaches. Given finite automata for recognising NFD strings matching regular languages A and B of traces, I know how to construct a non-deterministic finite automaton for recognising any of AB, A and B, A or B, and, if itself a regular language, A*, where the sets denoted are traces. For AB, the states I keep track of are states of A, states of B, and A ? 255 ? B, where the 2nd coordinate of the latter is the ccc of the latest element of the searched string used to propagate a state of B. If I didn't normalise the searched string, I'd have to keep a list of ccc's of characters used in propagating states of B. That gets complicated with A*, for which in theory I need the simultaneous progression of 255 FSMs (OK, only about 52 at the moment). I actually treat A* as A(A*), where no capture is implied by the parentheses. Theory says that if NFD([A]) is a regular language (where [x] is the set of strings in the canonical equivalence class x), then A is a regular language. However, constructing a finite automaton to check for matches may not be straightforward. I prefer to sacrifice the purity of finiteness by buffering enough of the searched string to convert it to NFD on the fly. As an example of the complexity, consider checking whether a string was composed of pairs of identical combining characters. > The case of u with diaeresis and macron is simpler: it has two > combining characters of the same combining class and they don't > commute, still the regexp to match it is something like: > > U [[:cc>0:]-[:cc=above:]]* [[:cc>0:]-[:cc=above:]]* > [[:cc>0:]-[:cc=above:]]* was meant to be an example of a searched string. For example, contains, under canonical equivalence, the substring . Your regular expressions would not detect this relationship. Richard. From unicode at unicode.org Mon Jan 29 17:07:11 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 29 Jan 2018 16:07:11 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new =?UTF-8?Q?character=3F=29?= Message-ID: <20180129160711.665a7a7059d7ee80bb4d670165c8327d.7964f4dccc.wbe@email03.godaddy.com> Marcel Schneider wrote: > Prior to this thread, I believed that the ratio of Windows users > liking the US-International vs Mac users liking the US-Extended was > like other ?Windows implementation? vs ?Apple implementation? ratios. For many users, it may not be a question of what they like, but rather (a) what they are aware of, (b) what comes standard with their Windows installation, and (c) in the workplace, what their IT overlords have granted them permission to use. I use a modified version of John Cowan's "Moby Latin" layout on all my machines: http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html which allows me to type about 900 characters *in addition* to Basic Latin, with 100% backward compatibility with U.S. English (i.e. none of the apostrophe and quotation-mark shenanigans we are talking about). But (a) I happen to know about Moby Latin, (b) it doesn't ship with Windows, and (c) I am able to install it (and even modify it). Many users do not have all or even any of these luxuries. There is perhaps another factor: many Americans, who are probably the majority users of US-International though not the only ones, simply do not know or care about accents and other "foreign stuff." Even those who know a language other than English often write it in ASCII, and see it that way in marketing and other professionally created material. For example, menus in Mexican restaurants often list "albondigas" and "jalapenos." The non-phonetic spelling of English may further encourage English-only speakers to ignore the squiggles and dots that are necessary to indicate correct pronunciation of other languages. Given that, interest among potential users of US-International to find a better solution is probably very low. > If so many people like it, why was Windows Intl not updated, then? 1. I'd be surprised if there were "so many people," or much demand to update it. Microsoft might have a few other items on their backlogs. 2. I don't speak for Microsoft, but there is often fear of making changes to existing standards, even changes that fill in holes in the standard. Users who type a formerly invalid sequence and get a valid character, instead of the beep or question mark they once got, and complain about the change, might seem to be a low-priority constituency, but you'd be surprised. > To like a particular layout does not mean to want to stick with it > when anything better comes up. User?s choice is always respected. See above regarding what users might like if only they had a choice. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jan 29 20:08:55 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Mon, 29 Jan 2018 19:08:55 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) Message-ID: <6449615512BC4379B3028CE79217DB8B@DougEwell> > (b) it doesn't ship with Windows Of course that is not a "luxury." Knowing that third-party options are available, let alone free and easily installed ones, is the luxury. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Mon Jan 29 23:09:10 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 06:09:10 +0100 (CET) Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) Message-ID: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> On Mon, 29 Jan 2018 16:07:11 -0700, Doug Ewell wrote: > > Marcel Schneider wrote: > > > Prior to this thread, I believed that the ratio of Windows users > > liking the US-International vs Mac users liking the US-Extended was > > like other ?Windows implementation? vs ?Apple implementation? ratios. > > For many users, it may not be a question of what they like, but rather > (a) what they are aware of, (b) what comes standard with their Windows > installation, and (c) in the workplace, what their IT overlords have > granted them permission to use. c: Hierarchical relationships may be complicated in some places but generally there should be an open door, and suggestion boxes or pinwalls may also be available. As an ?overuser? the IT manager must think and evaluate in the place of his employees/coworkers, but professional stress and the desire to unplug during week-ends could be mainly responsible of his unawareness. Though I thought that for professionalism?s sake they should deploy an appropriate layout fork, and I can?t see any point in not using MSKLC at that level. b: There is often much reluctance to add only an extra font, and I hesitated a long time prior to installing Firefox or any other extraneous (third-party) software, but that is actually so common ? Chrome browser doesn?t ship with Windows neither but we?re often encouraged to download it ? that failing to update one?s keyboard is due to a lack of marketing and thus, resolves to point (a). > > I use a modified version of John Cowan's "Moby Latin" layout on all my > machines: It would be interesting to know more about your modifications. > > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html Sadly the downloads are still unavailable (as formerly discussed). But I saved in time, too (June 2015). > > which allows me to type about 900 characters *in addition* to Basic > Latin, with 100% backward compatibility with U.S. English (i.e. none of > the apostrophe and quotation-mark shenanigans we are talking about). But > (a) I happen to know about Moby Latin, (b) it doesn't ship with Windows, [Addition on Mon, 29 Jan 2018 19:08:55 -0700: > Of course that is not a "luxury." Knowing that third-party options are > available, let alone free and easily installed ones, is the luxury. ] > and (c) I am able to install it (and even modify it). Many users do not > have all or even any of these luxuries. I agree. It is a blessing to be able to fine-tune one?s keyboard. Often those knowing about correct diacritics but not about keyboarding can?t help clicking in charmaps and symbol dialogs at document length. That is utter time waste. And that?s what is wrong about letting people waste their time while knowing better. > > There is perhaps another factor: many Americans, who are probably the > majority users of US-International though not the only ones, simply do > not know or care about accents and other "foreign stuff." Even those who > know a language other than English often write it in ASCII, and see it > that way in marketing and other professionally created material. For > example, menus in Mexican restaurants often list "albondigas" and > "jalapenos." French too, though being an accented language/script, is prone to omitting other locale?s diacritics as far as they are supposed not to be on the AZERTY. (It happened that a tilde was refused for lack of support, while we do have that dead key!) The more when accents merely indicate correct intonation, as in ?alb?ndigas.? When reading ?jalapenos? we reflexively add the tilde for spelling. Compare with locales writing consonants only. Writing in ASCII is also a sort of assimilation, as we all like to name things in our own language. > > The non-phonetic spelling of English may further encourage English-only > speakers to ignore the squiggles and dots that are necessary to indicate > correct pronunciation of other languages. That can be OK for common words, but gets dangerous when it comes to proper names. There?s a slippery path from easy to sloppy. It?s still about i18n. But I wasn?t aware that by not using diacritics and having to do much effort to remember correct spelling, English-speaking people may really hate those diacritics even when occurring on foreign stuff. > > Given that, interest among potential users of US-International to find a > better solution is probably very low. I was about to make a quick update of the US-Intl [or ABC-Intl], but if so, I could eventually save that for now. > > > If so many people like it, why was Windows Intl not updated, then? > > 1. I'd be surprised if there were "so many people," or much demand to > update it. Microsoft might have a few other items on their backlogs. However, resulting from the KLC files you kindly provided me, as of Latin keyboard layouts shipped with Windows, from Windows 7 to Windows 8, eight layouts were updated, including United Kingdom, Turkish, German and Inuktitut, both variants for each locale. Not US. For completeness? sake, we?ll also metion that five new layouts were added: Azerbaijani (Standard) English (India) Hausa Hawaiian Latvian (Standard) Soon, thanks to enriched CLDR, many many more should be released by the means of Windows Update. > > 2. I don't speak for Microsoft, but there is often fear of making > changes to existing standards, even changes that fill in holes in the > standard. Users who type a formerly invalid sequence and get a valid > character, instead of the beep or question mark they once got, and > complain about the change, might seem to be a low-priority constituency, > but you'd be surprised. Indeed, I am! > > > To like a particular layout does not mean to want to stick with it > > when anything better comes up. User?s choice is always respected. > > See above regarding what users might like if only they had a choice. We won?t blame end-users sticking with the choices of other people they?re paid by, and who are those customers who could request changes that vendors would be ready to accept but who presumably don?t for whatever reasons. Regards, Marcel From unicode at unicode.org Mon Jan 29 23:26:54 2018 From: unicode at unicode.org (via Unicode) Date: Tue, 30 Jan 2018 13:26:54 +0800 Subject: Support for Extension F In-Reply-To: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> References: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> Message-ID: <67684ce4dc9661d1d64c5e7aa730d8a9@koremail.com> Dear All, As many of you are aware getting characters encoded is only half the battle, enabling people to use them is the other half. CJK Extenion F was added last year in Unicode 10. I have come across a number of people saying they are having problems with Ext F. I was wondering what the current support is for Ext F at OS level and in terms of fonts. Regards John Knightley From unicode at unicode.org Mon Jan 29 23:31:34 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 06:31:34 +0100 (CET) Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> Message-ID: <412077223.147.1517290294530.JavaMail.www@wwinf1d20> OnMon, 29 Jan 2018 11:13:21 -0700, Tom Gewecke wrote: > > > On Jan 29, 2018, at 4:26 AM, Marcel Schneider via Unicode wrote: > > > > > > the Windows US-Intl > > does not allow to write French in a usable manner, as the ?? is still > > missing, and does not allow to type German correctly neither due to > > the lack of single angle quotation marks (used in some French locales, > > too, and perhaps likely to become even more widespread). Of course > > these are all on the macOS US-Extended. > > They are also all on the MacOS "US International PC", provided since 2009 by Apple > for Windows users who like US International. I suppose that this layout ships with the Windows emulation that can be run on a Mac. It?s hard to find through especially when I can?t see the layout or find on the internet. Thanks anyway. They seem to be always first, and then, other wendors can?t copy nor invent something else people won?t like. > > ? ? are on alt and alt-shift q > > ?? are on alt-shift 3/4 Then this is ported from the Apple US layout, where these characters are in the same places. However that does not include correct spacing, as required for French. > > (US Extended has also been renamed ABC Extended back in 2015) Presumably because it is interesting for many locales worldwide accustomed to the US QWERTY layout. That tends to prove that Mac users accept changes, while Windows users refuse changes. However I fail to understand such a discrepancy. Regards, Marcel From unicode at unicode.org Tue Jan 30 01:18:49 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 30 Jan 2018 08:18:49 +0100 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <412077223.147.1517290294530.JavaMail.www@wwinf1d20> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> Message-ID: I have always wondered why Microsoft did not push itself at least the five simple additions needed since long in French for the French AZERTY LAYOUT: - [AltGr]+[?] to produce the cedilla dead key (needed only before capital C in French) : this is frequently needed, the alternative would be [AltGr]+[C] to map "?" without the dead key; spell checkers forget the frequent words: ?a or ?'. - [AltGr]+[1&] to produce the acute accent dead key (similar to [AltGr+7?`] giving the grave accent deadkey) : this is the most frequent missing letter we need all the time. - [AltGr]+[O] to produce "?" (without ShiftLock or CapsLock mode enabled), or "?" (in ShiftLock or CapsLock mode), and [AltGr]+[Shift]+[O] to produce "?" (independantly of [ShiftLock] which is disabled by [Shift], but without [CapsLock]) or "?" (independantly of [CapsLock], but without [ShiftLock]) : this is needed occasionnaly for very few common words, the most frequent omission is "?uf" or its plural "?ufs". - [AltGr]+[A] to produce "?" (without ShiftLock or CapsLock mode enabled), or "?" (in ShiftLock or CapsLock mode), and [AltGr]+[Shift]+[O] to produce " ?" (independantly of [ShiftLock] which is disabled by [Shift], but without [CapsLock]) or "?" (independantly of [CapsLock], but without [ShiftLock]) : this is rarely needed, except for a few words borrowed from Latin used in biology or some legal/judiciary terminology. - Adding Y to the list of allowed letters after the dieresis deadkey to produce "?" : the most frequent case is L'HA?E-L?S-ROSES, the official name of a French municipality when written with full capitalisation, almost all spell checkers often forget to correct capitalized names such as this one. This would allow typing French completely including on initial capitals. All other French capital letters can be typed (????? with the circumflex dead key, ???? with the dieresis dead key which already allows ?? not needed for French but for Alsatian or some names borrowed from German). But we have mappings already in the AZERTY layout for: - the tilde as a dead key on [AltGr]+[2?~], even if it is not used for French but only for "?" or "?" in names from Spanish or Breton, " ??" not needed at all, /??/ needed only for standard French IPA phonetics where we still can't type /?????/ for French phonetics - the grave accent as a dead key on [AltGr]+[7?`], needed for "??" but allowing also "???" not used at all in French. There's not any good rationale in the French AZERTY layout to keep it incomplete on capitals while maintaining other capital letters with diacritics composed with dead keys but not needed at all in French, except the case of "???" missing from ISO 8859-1 but present in Windows-1252. ---- Using the Windows "Charmap" accessory with the "Unicode" charset and "Latin" subset is still too difficult to locate the missing letters, as it is only sorted by code point value but still does not cover all Latin letters; the Windows "Charmap" tool is usable for French only when selecting the Windows-1252 charset (aka "Windows : Occidental"). But I don't understand why this accessory cannot simply add some rows at top of the table for the current language selected on the "Languages Bar", or why it does not simply features the complete alphabet of the current language, sorted correctly according to CLDR rules for that language (not sorted randomly by code point value) to make it really usable. If we select another subset, it should also be sortable according to language rules (or CLDR default root otherwise) and not according to code point value: this could be a simple checkbox or a pair of radio buttons (binary sort, or alphabetic sort). Finally, the Charmap tool should be updated to add missing characters that are not covered in the "Unicode" charset selection, even if they are encoded in Unicode and really mapped in fonts: the coverage of proposed "subsets" is an extremely old version of Unicode. 2018-01-30 6:31 GMT+01:00 Marcel Schneider via Unicode : > OnMon, 29 Jan 2018 11:13:21 -0700, Tom Gewecke wrote: > > > > > On Jan 29, 2018, at 4:26 AM, Marcel Schneider via Unicode wrote: > > > > > > > > > the Windows US-Intl > > > does not allow to write French in a usable manner, as the ?? is still > > > missing, and does not allow to type German correctly neither due to > > > the lack of single angle quotation marks (used in some French locales, > > > too, and perhaps likely to become even more widespread). Of course > > > these are all on the macOS US-Extended. > > > > They are also all on the MacOS "US International PC", provided since > 2009 by Apple > > for Windows users who like US International. > > I suppose that this layout ships with the Windows emulation that can be > run on a Mac. > It?s hard to find through especially when I can?t see the layout or find > on the internet. > Thanks anyway. They seem to be always first, and then, other wendors can?t > copy nor > invent something else people won?t like. > > > > > ? ? are on alt and alt-shift q > > > > ?? are on alt-shift 3/4 > > Then this is ported from the Apple US layout, where these characters are > in the same > places. However that does not include correct spacing, as required for > French. > > > > > (US Extended has also been renamed ABC Extended back in 2015) > > Presumably because it is interesting for many locales worldwide accustomed > to the > US QWERTY layout. That tends to prove that Mac users accept changes, while > Windows users refuse changes. However I fail to understand such a > discrepancy. > > Regards, > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 30 02:06:52 2018 From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode) Date: Tue, 30 Jan 2018 17:06:52 +0900 Subject: Keyboard layouts and CLDR In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> Message-ID: <295ef0d0-c585-632c-09fb-05dc25d7a13c@it.aoyama.ac.jp> On 2018/01/30 16:18, Philippe Verdy via Unicode wrote: > - Adding Y to the list of allowed letters after the dieresis deadkey to > produce "?" : the most frequent case is L'HA?E-L?S-ROSES, the official name > of a French municipality when written with full capitalisation, almost all > spell checkers often forget to correct capitalized names such as this one. Wikipedia has this as L'Ha?-les-Roses (see https://fr.wikipedia.org/wiki/L'Ha?-les-Roses). It surely would be L'HA?-LES-ROSES, and not L'HA?E-L?S-ROSES, when capitalized. I of course know of the phenomenon that in French, sometimes the accents on upper-case letters are left out, but I haven't heard of a reverse phenomenon yet. Regards, Martin. From unicode at unicode.org Tue Jan 30 04:20:46 2018 From: unicode at unicode.org (Alastair Houghton via Unicode) Date: Tue, 30 Jan 2018 10:20:46 +0000 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <412077223.147.1517290294530.JavaMail.www@wwinf1d20> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> Message-ID: <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> On 30 Jan 2018, at 05:31, Marcel Schneider via Unicode wrote: > > OnMon, 29 Jan 2018 11:13:21 -0700, Tom Gewecke wrote: >> >>> On Jan 29, 2018, at 4:26 AM, Marcel Schneider via Unicode wrote: >>> >>> >>> the Windows US-Intl >>> does not allow to write French in a usable manner, as the ?? is still >>> missing, and does not allow to type German correctly neither due to >>> the lack of single angle quotation marks (used in some French locales, >>> too, and perhaps likely to become even more widespread). Of course >>> these are all on the macOS US-Extended. >> >> They are also all on the MacOS "US International PC", provided since 2009 by Apple >> for Windows users who like US International. > > I suppose that this layout ships with the Windows emulation that can be run on a Mac. No. It?s included as standard with the macOS itself. Go to System Preferences, choose ?Keyboard?, then ?Input Sources?. Click the ?+? button at the bottom left, then enter ?PC? in the search field and you?ll see there are a range of ?PC? layouts. >> ? ? are on alt and alt-shift q >> >> ?? are on alt-shift 3/4 More of a nitpick than anything, but Apple keyboards have *Option*, not ?alt?. Yes, some (but not all) keyboards? Option keys have an ?alt? annotation at the top, but that was added AFAIK for the benefit of people running PC emulation (or these days, Windows under e.g. VMWare Fusion). The ?alt? annotation isn?t on the latest keyboards (go look in an Apple Store if you don?t believe me :-)). > Then this is ported from the Apple US layout, where these characters are in the same > places. However that does not include correct spacing, as required for French. Not sure what you mean about spacing. That, surely, is a matter mainly for the software you?re using, rather than for a keyboard layout? >> (US Extended has also been renamed ABC Extended back in 2015) > > Presumably because it is interesting for many locales worldwide accustomed to the > US QWERTY layout. That tends to prove that Mac users accept changes, while > Windows users refuse changes. However I fail to understand such a discrepancy. I don?t think it?s the users. I think, rather, that Apple is (or has been) prepared to make radical changes, even at the expense of backwards compatibility and even where it knows there will be short term pain from users complaining about them, where Microsoft is more conservative. This pattern exists across the board at the two companies; the Windows API hasn?t changed all that much since Windows NT 4/95, whereas Apple has basically thrown away all the work it did up to Mac OS 9 and is a lot more aggressive about deprecating and removing functionality even in Mac OS X/macOS than Microsoft ever was. This is exemplified, actually, by the length of time Microsoft keeps backwards compatibility layers, versus the length of time Apple does so. The WoW subsystem is (I think) still part of the 32-bit builds of Windows, so they can still run Windows 3.1 software, DOS software and so on (i.e. software back to the 1980s). Apple, on the other hand, dropped support for ?Classic? Mac apps back in 10.4 and has never supported running PowerPC classic apps on any Intel machine. Indeed, six years ago now, in Mac OS X 10.7, Apple dropped support for running PowerPC apps built for Mac OS X, which basically means that software Mac users bought to run on their older PowerPC-based Macs is now not usable on new machines. Kind regards, Alastair. -- http://alastairs-place.net From unicode at unicode.org Tue Jan 30 07:15:16 2018 From: unicode at unicode.org (Eric Muller via Unicode) Date: Tue, 30 Jan 2018 05:15:16 -0800 Subject: Keyboard layouts and CLDR In-Reply-To: <295ef0d0-c585-632c-09fb-05dc25d7a13c@it.aoyama.ac.jp> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <295ef0d0-c585-632c-09fb-05dc25d7a13c@it.aoyama.ac.jp> Message-ID: Indeed. But "Fa?-l?s-Nemours" / "FA?-L?S-NEMOURS". "l?s" in French place names means "near", typically followed by another city name or a river name. In the case of "L'Ha?-les-Roses", it's just that they have a famous rose garden, so "les". Eric. On 1/30/2018 12:06 AM, Martin J. D?rst via Unicode wrote: > On 2018/01/30 16:18, Philippe Verdy via Unicode wrote: > >> ? - Adding Y to the list of allowed letters after the dieresis >> deadkey to >> produce "?" : the most frequent case is L'HA?E-L?S-ROSES, the >> official name >> of a French municipality when written with full capitalisation, >> almost all >> spell checkers often forget to correct capitalized names such as this >> one. > > Wikipedia has this as L'Ha?-les-Roses (see > https://fr.wikipedia.org/wiki/L'Ha?-les-Roses). It surely would be > L'HA?-LES-ROSES, and not L'HA?E-L?S-ROSES, when capitalized. I of > course know of the phenomenon that in French, sometimes the accents on > upper-case letters are left out, but I haven't heard of a reverse > phenomenon yet. > > Regards,?? Martin. > From unicode at unicode.org Tue Jan 30 09:54:19 2018 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Tue, 30 Jan 2018 08:54:19 -0700 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> Message-ID: > On Jan 30, 2018, at 3:20 AM, Alastair Houghton wrote: > > The ?alt? annotation isn?t on the latest keyboards (go look in an Apple Store if you don?t believe me :-)). Interesting! Apple?s documentation shows these keys mostly with ?alt? and ???. https://support.apple.com/en-us/HT201794 From unicode at unicode.org Tue Jan 30 11:55:46 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 18:55:46 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> Message-ID: <789250525.14033.1517334947091.JavaMail.www@wwinf1j02> On Tue, 30 Jan 2018 08:54:19 -0700, Tom Gewecke wrote: > > > On Jan 30, 2018, at 3:20 AM, Alastair Houghton wrote: > > > > The ?alt? annotation isn?t on the latest keyboards (go look in an Apple Store if you don?t believe me :-)). > > Interesting! Apple?s documentation shows these keys mostly with ?alt? and ???. > > https://support.apple.com/en-us/HT201794 While the ??? symbol is persistent across locales, the ?alt? label is somewhere replaced with ?option? and I believed that this the macOS name, whereas ?alt? is merely wrt BootCamp users. However that is confusing as Windows ?Alt? has not the ?option? functionality but rather the quick access like its internal name of ?MENU? (?LMENU?, ?RMENU?), while ?option? equals ?AltGr? since there are alternate graphics, too. But now since we need a ?Numbers? modifier, none of both schemes seems appropriate: Left Option should be Numbers, and Alt should become Numbers, too, while itself could be mapped to Left Windows or so. See: https://unicode.org/cldr/trac/ticket/10851#comment:2 Regards, Marcel From unicode at unicode.org Tue Jan 30 12:34:40 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 Jan 2018 11:34:40 -0700 Subject: Keyboard layouts and CLDR Message-ID: <20180130113440.665a7a7059d7ee80bb4d670165c8327d.552d8d7f7d.wbe@email03.godaddy.com> Marcel Schneider wrote: > That tends to prove that Mac users accept changes, while Windows users > refuse changes. I was going to say that was a gross over-generalization, but that didn't adequately express how gross it was. It's just plain wrong. Pardon my bluntness. How about: Windows is often used in the workplace, where users may not have the freedom or motivation to make their own changes and be different from other users, while Macs are often used by individuals who do. That's an over-generalization too, but not quite at the level of "Windows users refuse changes." Alastair Houghton replied: > I think, rather, that Apple is (or has been) prepared to make radical > changes, even at the expense of backwards compatibility and even where > it knows there will be short term pain from users complaining about > them, where Microsoft is more conservative. That too. Good point. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jan 30 12:50:49 2018 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Tue, 30 Jan 2018 11:50:49 -0700 Subject: Keyboard layouts and CLDR Message-ID: <20180130115049.665a7a7059d7ee80bb4d670165c8327d.6caabee144.wbe@email03.godaddy.com> Marcel Schneider wrote: >> http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > > Sadly the downloads are still unavailable (as formerly discussed). But > I saved in time, too (June 2015). Sorry, try this: http://vrici.lojban.org/~cowan/MobyLatinKeyboard.zip -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Tue Jan 30 13:09:31 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 20:09:31 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> Message-ID: <1047695230.16055.1517339371756.JavaMail.www@wwinf1j02> On Tue, 30 Jan 2018 10:20:46 +0000, Alastair Houghton wrote: > > On 30 Jan 2018, at 05:31, Marcel Schneider via Unicode wrote: > > > > OnMon, 29 Jan 2018 11:13:21 -0700, Tom Gewecke wrote: > >> [?] > >> > >> They are also all on the MacOS "US International PC", provided since 2009 by Apple > >> for Windows users who like US International. > > > > I suppose that this layout ships with the Windows emulation that can be run on a Mac. > > No. It?s included as standard with the macOS itself. Go to System Preferences, choose ?Keyboard?, then ?Input Sources?. > Click the ?+? button at the bottom left, then enter ?PC? in the search field and you?ll see there are a range of ?PC? layouts. Indeed. It?s sort of a mix made of Apple?s common US and Windows? US-International. So it has the five Windows-style dead keys in Base and Shift, AND the five Mac-style dead keys (for the same diacritics) in Option. > > >> ? ? are on alt and alt-shift q > >> > >> ?? are on alt-shift 3/4 > > More of a nitpick than anything, but Apple keyboards have *Option*, not ?alt?. Yes, some (but not all) keyboards? Option keys have an ?alt? > annotation at the top, but that was added AFAIK for the benefit of people running PC emulation (or these days, Windows under e.g. VMWare > Fusion). The ?alt? annotation isn?t on the latest keyboards (go look in an Apple Store if you don?t believe me :-)). Then, ?alt? is obsoleted on Mac, and calling them ?Option? is correct? I?m relieved if so, as I used ?Option? when referring to macOS, or better, ?AltGr/Option? to be cross-platform? ?Option? is shorter, but ?AltGr? is already printed on most keyboards, though it isn?t a short form of an easily localizable term, while ?Option? is multi-locale (English, French, German, ?). > > > Then this is ported from the Apple US layout, where these characters are in the same > > places. However that does not include correct spacing, as required for French. > > Not sure what you mean about spacing. That, surely, is a matter mainly for the software you?re using, rather than for a keyboard layout? It may be handled by an input editing functionality as embedded in Word, like many things can be done by input editing, even ??? (for which Word has also a shortcut: Ctrl+&, o) because Microsoft and Bill Gates in person were eager to support the French locale and did a lot to help French efforts in keyboarding. But properly, correct spacing must be handled on keyboard level, otherwise we?ll always end up with a mass of wrong data amidst which a subset of correct documents having U+202F NARROW NO-BREAK SPACE before ??? and ?!? and ?;?, and even ??? and after ???, and currently also before ?:? as Philippe Verdy wrote to this List on Fri, 26 Jun 2015 22:16:48 +0200: http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0220.html ?Les ?diteurs de presse et de livres en France utilisent tous des fines de chasse fixe dans leurs moteurs de composition? (?Print media and book publishers in France all use fixed-width narrows in their typesetting engines?) Unicode did not support interoperable French typesetting until it encoded U+202F for Mongolian, in 1999, six years after v1.1 of the Standard. Making U+2008 PUNCTUATION SPACE a no-break space would have been well done. This was seemingly encoded for hot-metal typesetting of tables, like U+2012 FIGURE DASH and U+2007 FIGURE SPACE that is non-breakable. While U+2007 can be considered a fixed-width counterpart of U+00A0, and U+2012 could have been a longer variant of U+2212 to denote intervals *if only* it had been specified as such, U+2008 could have been the proper representation of French punctuation spacing, instead of ending up as a completely useless character, depriving the French locale of Unicode support. See the feedback items about these topics that have been posted so far: http://www.unicode.org/L2/L2018/18009-pubrev.html#Error_Reports > > >> (US Extended has also been renamed ABC Extended back in 2015) > > > > Presumably because it is interesting for many locales worldwide accustomed to the > > US QWERTY layout. That tends to prove that Mac users accept changes, while > > Windows users refuse changes. However I fail to understand such a discrepancy. > > I don?t think it?s the users. > > I think, rather, that Apple is (or has been) prepared to make radical changes, even at the expense of backwards compatibility and even where it > knows there will be short term pain from users complaining about them, where Microsoft is more conservative. This pattern exists across the > board at the two companies; the Windows API hasn?t changed all that much since Windows NT 4/95, whereas Apple has basically thrown away > all the work it did up to Mac OS 9 and is a lot more aggressive about deprecating and removing functionality even in Mac OS X/macOS than > Microsoft ever was. > > This is exemplified, actually, by the length of time Microsoft keeps backwards compatibility layers, versus the length of time Apple does so. > The WoW subsystem is (I think) still part of the 32-bit builds of Windows, so they can still run Windows 3.1 software, DOS software and so on > (i.e. software back to the 1980s). Apple, on the other hand, dropped support for ?Classic? Mac apps back in 10.4 and has never supported > running PowerPC classic apps on any Intel machine. Indeed, six years ago now, in Mac OS X 10.7, Apple dropped support for running PowerPC > apps built for Mac OS X, which basically means that software Mac users bought to run on their older PowerPC-based Macs is now not usable on > new machines. Ah, good to know. Apple?s (and some other companies) strategy is currently nicknamed ?programmed obsolescence.? Bing?s top search result for that keyword is this BBC article: http://www.bbc.com/future/story/20160612-heres-the-truth-about-the-planned-obsolescence-of-tech Based on your report, I think that Apple push wealthy people to use always the best of tech, whereas Microsoft help poor people alike not to discard well-functioning software. Regards, Marcel From unicode at unicode.org Tue Jan 30 13:42:02 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 20:42:02 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: <20180130113440.665a7a7059d7ee80bb4d670165c8327d.552d8d7f7d.wbe@email03.godaddy.com> References: <20180130113440.665a7a7059d7ee80bb4d670165c8327d.552d8d7f7d.wbe@email03.godaddy.com> Message-ID: <1315610464.16867.1517341322817.JavaMail.www@wwinf1j02> On Tue, 30 Jan 2018 11:34:40 -0700, Doug Ewell via Unicode wrote: > > Marcel Schneider wrote: > > > That tends to prove that Mac users accept changes, while Windows users > > refuse changes. > > I was going to say that was a gross over-generalization, but that didn't > adequately express how gross it was. It's just plain wrong. Pardon my > bluntness. > > How about: Windows is often used in the workplace, where users may not > have the freedom or motivation to make their own changes and be > different from other users, while Macs are often used by individuals who > do. That's an over-generalization too, but not quite at the level of > "Windows users refuse changes." I?m relieved to be wrong, and that ?such a discrepancy? that ?I fail[ed] to understand? doesn?t exist. I know a company that prescribes and delivers Apple hardware to all its affiliates. Second-hand retailers offer very few Apple machines while they have plenty of PC computers, whose turnover at business customers is two years. Apple computers are not replaced every two years in workplaces. That may be a reason why Apple Inc. takes steps to get them replaced nevertheless, as Alastair?s report you quoted [and I already answered] might suggest (but stop, no more over-interpretation!). > > Alastair Houghton replied: > > > I think, rather, that Apple is (or has been) prepared to make radical > > changes, even at the expense of backwards compatibility and even where > > it knows there will be short term pain from users complaining about > > them, where Microsoft is more conservative. > > That too. Good point. Very good point. Regards, Marcel From unicode at unicode.org Tue Jan 30 14:06:06 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 21:06:06 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: <20180130115049.665a7a7059d7ee80bb4d670165c8327d.6caabee144.wbe@email03.godaddy.com> References: <20180130115049.665a7a7059d7ee80bb4d670165c8327d.6caabee144.wbe@email03.godaddy.com> Message-ID: <1815406746.17350.1517342767472.JavaMail.www@wwinf1j02> On Tue, 30 Jan 2018 11:50:49 -0700, Doug Ewell wrote: > > Marcel Schneider wrote: > > > > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > > > > Sadly the downloads are still unavailable (as formerly discussed). But > > I saved in time, too (June 2015). > > Sorry, try this: > > http://vrici.lojban.org/~cowan/MobyLatinKeyboard.zip Thank you! I?ve gone through John Cowan?s Home Page, too: http://vrici.lojban.org/~cowan/ Regards, Marcel From unicode at unicode.org Tue Jan 30 15:24:23 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Tue, 30 Jan 2018 22:24:23 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> Message-ID: <397820388.18715.1517347464472.JavaMail.www@wwinf1j02> On Tue, 30 Jan 2018 08:18:49 +0100, Philippe Verdy wrote: > > I have always wondered why Microsoft did not push itself at least the five > simple additions needed since long in French for the French AZERTY LAYOUT: Many people in F?an?? are wondering, but it is primarily a matter of honoring a country?s policies and not interfering with official work. France is expected to fix itself its keyboarding problems and publish a standard, and that?s what is actually happening. See Shawn Steele?s blog post about Locale Data in Windows 10 & CLDR: https://blogs.msdn.microsoft.com/shawnste/2015/08/29/locale-data-in-windows-10-cldr/ > - [AltGr]+[?] to produce the cedilla dead key (needed only before capital C in French) : > this is frequently needed, the alternative would be [AltGr]+[C] to map "?" without the dead key; That would be easier than Alt+0199. But Alt+something should yield consistently either uppercase or lowercase. Then, especially in the United States, not having all uppercase letters accessed with Shift+lowercase is considered counter-intuitive. And when the lowercase letter is directly accessed, going through a dead key to get its uppercase is not something I would recommend. Therefore, all uppercase that are used as initials (not ?) should be Shift+lowercase, and digits in AltGr like on a few Latin layouts shipping with Windows, plus a Programmer toggle described in: https://unicode.org/cldr/trac/ticket/10851#comment:2 > spell checkers forget the frequent words:??a or ?'. I never use spell checkers, and when they show up with red wavy underline, I quickly try to disable them (outside of Gooogle Search). That is why I have typos. [This one has occurred unintentionally.] > > - [AltGr]+[1&] to produce the acute accent dead key (similar to [AltGr+7?`] giving the grave accent deadkey) : > this is the most frequent missing letter we need all the time. Therefore, the ? should be mapped to a live key. But the acute dead key is really the missing one. Belgium?s AZERTY has it. Getting ? at least by dead key would have divided our trouble by half. > > - [AltGr]+[O] to produce "?" (without ShiftLock or CapsLock mode enabled), > or "?" (in ShiftLock or CapsLock mode), and >?[AltGr]+[Shift]+[O] to produce "?" (independantly of [ShiftLock] which is disabled by [Shift], but without [CapsLock]) > or "?" (independantly of [CapsLock], but without [ShiftLock]) : > this is needed occasionnaly for very few common words, the most frequent omission?is "?uf" or?its plural "?ufs". To repay the ?? for its exclusion from Latin-1 (due to a Frenchman), it should be granted two key positions in the Base and Shift shift states, amidst the upper row letters. > > - [AltGr]+[A] to produce "?" (without ShiftLock or CapsLock mode enabled), > or "?" (in ShiftLock or CapsLock mode), and >?[AltGr]+[Shift]+[O] to produce "?" (independantly of [ShiftLock] which is disabled by [Shift], but without [CapsLock]) > or "?" (independantly of [CapsLock], but without [ShiftLock]) : > this is rarely needed, except for a few words borrowed from Latin used in biology or some legal/judiciary terminology. And one spelling of _L?titia_. > > - Adding Y to the list of allowed letters after the dieresis deadkey to produce "?" : > the most frequent case is L'HA?E-L?S-ROSES, the official name of a French municipality when written with full capitalisation, > almost all spell checkers often forget to correct capitalized names such as this one. That?s really something I never understood neither. Why that deadlist was not updated. Maybe like above: If Microsoft had updated our layout with '?', we could have wondered why they didn?t add the other missing stuff while they were on it. > > This would allow typing French completely including on initial capitals. > All other French capital letters can be typed (????? with the circumflex dead key, > ???? with the dieresis dead key which already allows ?? not needed for French but for Alsatian or some names borrowed from German). > > But we have mappings already in the AZERTY layout for: >?- the tilde as a dead key on [AltGr]+[2?~], even if it is not used for French but only for "?" or "?" in names from Spanish or Breton, That didn?t prevent Breton authorities from refusing it in a first name, Denis Jacquerye reported in the wake of the Kazakh apostrophe thread: http://unicode.org/mail-arch/unicode-ml/y2018-m01/0133.html > " ??"?not needed at all, /??/ needed only for standard French IPA?phonetics where we still can't type /?????/ for French phonetics >?- the grave accent as a dead key on [AltGr]+[7?`], needed for "??" but allowing also "???" not used at all in French. > > There's not any good rationale in the French AZERTY layout to keep it incomplete on capitals > while maintaining other capital letters with diacritics composed with dead keys but not needed at all in French, > except the case of?"???" missing from ISO 8859-1 but present in Windows-1252. There is even a way of putting all into the existing dead keys, if ??circumflex accent?? (that is our directly accessed dead key) followed by any diacriticized letter did yield its uppercase, and followed by b or q, yield ? or ?? But that isn?t what one would call a properly designed keyboard layout. > ---- > Using the Windows "Charmap" accessory with the "Unicode" charset and "Latin" subset is still too difficult to locate the missing letters, > as it is only sorted by code point value but still does not cover all Latin letters; > the?Windows "Charmap" tool?is usable for French only when selecting the Windows-1252 charset (aka "Windows : Occidental"). > > But I don't understand why this accessory cannot simply add some rows at top of the table for the current language selected > on the "Languages Bar", or why it does not simply features the complete alphabet of the current language, sorted correctly > according to CLDR rules for that language (not sorted randomly by code point value) to make it really usable. If we select > another subset, it should also be sortable according to language rules (or CLDR default root otherwise) and not according to code point value: > this could be a simple checkbox or a pair of radio buttons (binary sort, or alphabetic sort).? > > Finally, the Charmap tool should be updated to add missing characters that are not covered in the "Unicode" charset selection, > even if they are encoded in Unicode and really mapped in fonts: the coverage of proposed "subsets" is an extremely old version of Unicode. I see that as a very valuable feature request. And this one doesn?t need to wait for any national standard to get implemented. Regards, Marcel From unicode at unicode.org Tue Jan 30 17:30:59 2018 From: unicode at unicode.org (David Starner via Unicode) Date: Tue, 30 Jan 2018 23:30:59 +0000 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <7A978859-4C9E-41B2-A291-713D7DE5E002@alastairs-place.net> Message-ID: On Tue, Jan 30, 2018 at 2:23 AM Alastair Houghton via Unicode < unicode at unicode.org> wrote: > This pattern exists across the board at the two companies; the Windows API > hasn?t changed all that much since Windows NT 4/95, whereas Apple has > basically thrown away all the work it did up to Mac OS 9 and is a lot more > aggressive about deprecating and removing functionality even in Mac OS > X/macOS than Microsoft ever was. > I'm not really clear on all the Windows details, as a long time Linux programmer, but Mac OS X (2001) was 16 years ago and Windows 95 (1995) is 22, so not much difference even taking your numbers. The .NET framework debuted in 2002, and the Universal Windows Platform debuted with Windows 8 in 2012, so Microsoft has made some pretty large changes since NT 4. They do seem to more focused on keeping backwards compatibility layers, but it's not that they've been not "prepared to make radical changes". -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 30 22:39:00 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Jan 2018 05:39:00 +0100 (CET) Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) Message-ID: <623361518.54.1517373540785.JavaMail.www@wwinf1g10> On Tue, 30 Jan 2018 23:30:59 +0000, David Starner via Unicode wrote: > > On Tue, Jan 30, 2018 at 2:23 AM Alastair Houghton via Unicode wrote: > > > This pattern exists across the board at the two companies; the Windows API hasn?t changed all that much > > since Windows NT 4/95, whereas Apple has basically thrown away all the work it did up to Mac OS 9 and is > > a lot more aggressive about deprecating and removing functionality even in Mac OS X/macOS than Microsoft > > ever was. > > I'm not really clear on all the Windows details, as a long time Linux programmer, but Mac OS X (2001) was > 16 years ago and Windows 95 (1995) is 22, so not much difference even taking your numbers. The .NET framework > debuted in 2002, and the Universal Windows Platform debuted with Windows 8 in 2012, so Microsoft has made some > pretty large changes since NT 4. They do seem to more focused on keeping backwards compatibility layers, but it's > not that they've been not "prepared to make radical changes". I don?t think that Alastair?s point was about Microsoft not being innovative. They simply allow old software to be used on new machines and Windows versions, something that Apple reportedly does not on macOS. However, as of Unicode support by keyboard layouts, the current advice is that adding new functionalities to Windows 10 would keep them out of reach for users of older versions of Windows, and new keyboard layouts relying on them would be truncated for a still huge part of the users. That is surely part of the ?significant, perhaps insurmountable headwinds? faced by ?making significant changes to user32.dll?, Andrew Glass warned in 2015: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0042.html Perhaps there could be a way to update older frameworks via Windows Update, making an end with ?limitation[s] of the Windows USER keyboard architecture? that Michael Kaplan pointed in response to Karl Pentzlin, January 2010: http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0030.html Several discussions, in the past years, stated that we should be able to input combining sequences using dead keys, a feature supported by macOS and Linux natively, while Windows does not come along with that kind of support, although this is recommended by TUS: ?It is straightforward to adapt such a system to emit combining character sequences or precomposed characters as needed.? (5.12, p. 222) http://www.unicode.org/versions/Unicode10.0.0/ch05.pdf#G1076 In a 2015 discussion we/I also learned that Tavultesoft Keyman, now SIL, provides all these features and is cross-platform, and had a free offer since long, now even including the Developer tooling thanks to backing by SIL. I?m advertising this software to advocate Microsoft?s presumed position: For all that goes beyond legacy support, we can rely on Keyman. As a consequence, layouts that are to be shipped with Windows, such as the new French, must stick with Windows resources (input editors are excluded by spec), whereas minority languages needing extended functionalities can promote additional software for support, as far as they are not expected to be supported out-of-the-box. Hopefully we can now expect, by contrast, that Apple will add the missing toggle, internally VK_KANA on Windows (Linux is reported to have it, too), to make it available on macOS as well. Regards, Marcel From unicode at unicode.org Wed Jan 31 11:25:22 2018 From: unicode at unicode.org (John H. Jenkins via Unicode) Date: Wed, 31 Jan 2018 10:25:22 -0700 Subject: Support for Extension F In-Reply-To: <67684ce4dc9661d1d64c5e7aa730d8a9@koremail.com> References: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> <67684ce4dc9661d1d64c5e7aa730d8a9@koremail.com> Message-ID: <8354C81D-0CFE-480D-9FC3-A22736040659@apple.com> macOS (and iOS, for that matter) fully support Extension F provided fonts are availble. I'm not aware of any work that Apple has done to its fonts for Extension F support. Indeed, I'm not aware of any publically available fonts for Extension F but would gladly install one myself if it's available. > On Jan 29, 2018, at 10:26 PM, via Unicode wrote: > > > Dear All, > > As many of you are aware getting characters encoded is only half the battle, enabling people to use them is the other half. > > CJK Extenion F was added last year in Unicode 10. I have come across a number of people saying they are having problems with Ext F. I was wondering what the current support is for Ext F at OS level and in terms of fonts. > > Regards > John Knightley From unicode at unicode.org Wed Jan 31 11:51:04 2018 From: unicode at unicode.org (Tom Gewecke via Unicode) Date: Wed, 31 Jan 2018 10:51:04 -0700 Subject: Support for Extension F In-Reply-To: <8354C81D-0CFE-480D-9FC3-A22736040659@apple.com> References: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> <67684ce4dc9661d1d64c5e7aa730d8a9@koremail.com> <8354C81D-0CFE-480D-9FC3-A22736040659@apple.com> Message-ID: > On Jan 31, 2018, at 10:25 AM, John H. Jenkins via Unicode wrote: > > I'm not aware of any publically available fonts for Extension F but would gladly install one myself if it's available. > There may be something here: https://chinese.stackexchange.com/questions/24210/how-to-display-cjk-extension-f From unicode at unicode.org Wed Jan 31 12:05:17 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 31 Jan 2018 19:05:17 +0100 Subject: Keyboard layouts and CLDR In-Reply-To: <397820388.18715.1517347464472.JavaMail.www@wwinf1j02> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <397820388.18715.1517347464472.JavaMail.www@wwinf1j02> Message-ID: Another idea: you can already have multiple layouts loaded for the same language : For French, nothing prohibits to have a "technical/programmer layout", favoring input of ASCII, a "bibliographic/typographical" one with improved characters (e.g. the correct curly apostrophe); the technical/programmer layout has the spell-checker disabled by default, while the other has a spell checker enabled by default: whever to activate the spell checker or not will depend on software where it is enable, but it will switch automatically it on or off according to the state defined by changing the layout for the same language. Switching from one layout to another is easy with the Language bar, this means that even if you keep the first layout unchanged to match the national standard, additional layouts can be tuned specifically. For French fortunately there are two ISO 639-2 codes "fra" and "fre" (technical and bibliographic) which allows also defining the code "FRA" or "FRE" to display in the language bar (users should be able to tune the abbreviation or icon or emoji displayed in the language bar when they switch languages or input layouts, even if there are defaults, and Windows can infer a non-conflicting visual identification by adding a digit to the ISO 639-2 or -3 code (which would be the same as the layout ordering number in the list of loaded layouts, and used in shortcuts like CTRL+ALT+F1...F9). 2018-01-30 22:24 GMT+01:00 Marcel Schneider : > On Tue, 30 Jan 2018 08:18:49 +0100, Philippe Verdy wrote: > > > > I have always wondered why Microsoft did not push itself at least the > five > > simple additions needed since long in French for the French AZERTY > LAYOUT: > > Many people in F?an?? are wondering, but it is primarily a matter of > honoring > a country?s policies and not interfering with official work. France is > expected to > fix itself its keyboarding problems and publish a standard, and that?s > what is > actually happening. See Shawn Steele?s blog post about Locale Data in > Windows 10 & CLDR: > > https://blogs.msdn.microsoft.com/shawnste/2015/08/29/ > locale-data-in-windows-10-cldr/ > > > - [AltGr]+[?] to produce the cedilla dead key (needed only before > capital C in French) : > > this is frequently needed, the alternative would be [AltGr]+[C] to map > "?" without the dead key; > > That would be easier than Alt+0199. But Alt+something should yield > consistently either uppercase > or lowercase. Then, especially in the United States, not having all > uppercase letters accessed with > Shift+lowercase is considered counter-intuitive. And when the lowercase > letter is directly accessed, > going through a dead key to get its uppercase is not something I would > recommend. Therefore, > all uppercase that are used as initials (not ?) should be Shift+lowercase, > and digits in AltGr like on > a few Latin layouts shipping with Windows, plus a Programmer toggle > described in: > > https://unicode.org/cldr/trac/ticket/10851#comment:2 > > > spell checkers forget the frequent words: ?a or ?'. > > I never use spell checkers, and when they show up with red wavy underline, > I quickly try to disable > them (outside of Gooogle Search). That is why I have typos. [This one has > occurred unintentionally.] > > > > > - [AltGr]+[1&] to produce the acute accent dead key (similar to > [AltGr+7?`] giving the grave accent deadkey) : > > this is the most frequent missing letter we need all the time. > > Therefore, the ? should be mapped to a live key. But the acute dead key is > really the missing one. > Belgium?s AZERTY has it. Getting ? at least by dead key would have divided > our trouble by half. > > > > > - [AltGr]+[O] to produce "?" (without ShiftLock or CapsLock mode > enabled), > > or "?" (in ShiftLock or CapsLock mode), and > > [AltGr]+[Shift]+[O] to produce "?" (independantly of [ShiftLock] which > is disabled by [Shift], but without [CapsLock]) > > or "?" (independantly of [CapsLock], but without [ShiftLock]) : > > this is needed occasionnaly for very few common words, the most frequent > omission is "?uf" or its plural "?ufs". > > To repay the ?? for its exclusion from Latin-1 (due to a Frenchman), it > should be granted > two key positions in the Base and Shift shift states, amidst the upper row > letters. > > > > > - [AltGr]+[A] to produce "?" (without ShiftLock or CapsLock mode > enabled), > > or "?" (in ShiftLock or CapsLock mode), and > > [AltGr]+[Shift]+[O] to produce "?" (independantly of [ShiftLock] which > is disabled by [Shift], but without [CapsLock]) > > or "?" (independantly of [CapsLock], but without [ShiftLock]) : > > this is rarely needed, except for a few words borrowed from Latin used > in biology or some legal/judiciary terminology. > > And one spelling of _L?titia_. > > > > > - Adding Y to the list of allowed letters after the dieresis deadkey to > produce "?" : > > the most frequent case is L'HA?E-L?S-ROSES, the official name of a > French municipality when written with full capitalisation, > > almost all spell checkers often forget to correct capitalized names such > as this one. > > That?s really something I never understood neither. Why that deadlist was > not updated. > Maybe like above: If Microsoft had updated our layout with '?', we could > have wondered > why they didn?t add the other missing stuff while they were on it. > > > > > This would allow typing French completely including on initial capitals. > > All other French capital letters can be typed (????? with the circumflex > dead key, > > ???? with the dieresis dead key which already allows ?? not needed for > French but for Alsatian or some names borrowed from German). > > > > But we have mappings already in the AZERTY layout for: > > - the tilde as a dead key on [AltGr]+[2?~], even if it is not used for > French but only for "?" or "?" in names from Spanish or Breton, > > That didn?t prevent Breton authorities from refusing it in a first name, > Denis Jacquerye reported in the wake of the Kazakh apostrophe thread: > > http://unicode.org/mail-arch/unicode-ml/y2018-m01/0133.html > > > " ??" not needed at all, /??/ needed only for standard French > IPA phonetics where we still can't type /?????/ for French phonetics > > - the grave accent as a dead key on [AltGr]+[7?`], needed for "??" but > allowing also "???" not used at all in French. > > > > There's not any good rationale in the French AZERTY layout to keep it > incomplete on capitals > > while maintaining other capital letters with diacritics composed with > dead keys but not needed at all in French, > > except the case of "???" missing from ISO 8859-1 but present in > Windows-1252. > > There is even a way of putting all into the existing dead keys, if > ??circumflex accent?? (that is our directly accessed > dead key) followed by any diacriticized letter did yield its uppercase, > and followed by b or q, yield ? or ?? > But that isn?t what one would call a properly designed keyboard layout. > > > > ---- > > Using the Windows "Charmap" accessory with the "Unicode" charset and > "Latin" subset is still too difficult to locate the missing letters, > > as it is only sorted by code point value but still does not cover all > Latin letters; > > the Windows "Charmap" tool is usable for French only when selecting the > Windows-1252 charset (aka "Windows : Occidental"). > > > > But I don't understand why this accessory cannot simply add some rows at > top of the table for the current language selected > > on the "Languages Bar", or why it does not simply features the complete > alphabet of the current language, sorted correctly > > according to CLDR rules for that language (not sorted randomly by code > point value) to make it really usable. If we select > > another subset, it should also be sortable according to language rules > (or CLDR default root otherwise) and not according to code point value: > > this could be a simple checkbox or a pair of radio buttons (binary sort, > or alphabetic sort). > > > > Finally, the Charmap tool should be updated to add missing characters > that are not covered in the "Unicode" charset selection, > > even if they are encoded in Unicode and really mapped in fonts: the > coverage of proposed "subsets" is an extremely old version of Unicode. > > I see that as a very valuable feature request. And this one doesn?t need > to wait for any national standard > to get implemented. > > Regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 31 12:26:51 2018 From: unicode at unicode.org (Paul Hackett via Unicode) Date: Wed, 31 Jan 2018 13:26:51 -0500 Subject: Support for Extension F In-Reply-To: References: <964123988.94.1517288950265.JavaMail.www@wwinf1d20> <67684ce4dc9661d1d64c5e7aa730d8a9@koremail.com> <8354C81D-0CFE-480D-9FC3-A22736040659@apple.com> Message-ID: > On Jan 31, 2018, at 12:51 PM, Tom Gewecke via Unicode wrote: > > There may be something here: > > https://chinese.stackexchange.com/questions/24210/how-to-display-cjk-extension-f That post claims "Hanazono hasn't been updated in a while and only supports up to Extension E" but if you visit the project page: http://fonts.jp/hanazono/ *it* claims full coverage of Ext. F ("U+2CEB0 .. U+2EBE0 Ext.F 7,473? ????") in ????B?HanaMinB.ttf?: ? CJK?????Ext.B?Ext.C?Ext.D?Ext.E?Ext.F? ------------ From unicode at unicode.org Wed Jan 31 12:45:56 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 31 Jan 2018 19:45:56 +0100 Subject: Internationalised Computer Science Exercises In-Reply-To: <20180129205305.5d5d202d@JRWUBU2> References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> Message-ID: 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < unicode at unicode.org>: > On Mon, 29 Jan 2018 14:15:04 +0100 > > The case of u with diaeresis and macron is simpler: it has two > > combining characters of the same combining class and they don't > > commute, still the regexp to match it is something like: > > > > U [[:cc>0:]-[:cc=above:]]* [[:cc>0:]-[:cc=above:]]* > > [[:cc>0:]-[:cc=above:]]* > > was meant to be an example of a searched > string. For example, > contains, under canonical equivalence, the substring COMBINING DOT BELOW>. Your regular expressions would not detect this > relationship. My regular expression WILL detect this: scanning the text implies first composing it to "full equivalent decomposition form" (without even reordering it, and possibly recompose it to NFD) while reading it and bufering it in forward direction (it just requires the decomposition pairs from the UCD, including those that are "excluded" from NFC/NFD). The regexp exgine will then only process the "fully decomposed" input text to find matches, using the regexp transformed as above (which has some initial "complex" setup to "fully decompose" the initial regexp), but only once when constructing it, but not while processing the input text which can be then done straightforward with its full decomposition easily performed on the fly without any additional buffering except the very small lookahead whose length is never longer than the longest "canonical" decompositions in the UCD, i.e. at most 2 code points of look ahead). The automata is of course using the classic NFA used by regexp engine (and not the DFA which explodes combinatorially in all regexps), but which is still fully deterministic (the current "state" in the automata is not a single integer for the node number in the traversal graph, but a set of node numbers, and all regexps have a finite number of nodes in the traversal graph, this number being proportional to the length of the regexp, it does not need lot of memory, and the size of the current "state" is also fully bounded, never larger than the length of the regexp). Optimizing some contextual parts of the NFA to DFA is possible (to speedup the matching process and reduce the maximum storage size of the "current state") but only if it does not cause a growth of the total number of nodes in the traversal graph, or as long as this growth of the total number does not exceed some threshold e.g. not more than 2 or 3 times the regexp size). In practice, most regexps never exceed several hundreds of characters (including meta-characters of the regexp syntax itself), and the maximum number of active nodes in the graph traversal rarely exceeds 2 or 3, so the "current state" is not several hundreds integers, but a handful of integers, and een if you optimize the NFA partly to DFA, you can double or triple the number of nodes to significantly speedup the engine (in order to reduce the number of node numbers to store in the "current state"). Some common examples of reduction of nodes in the traversal graph is to compute character classes, or the local expansion of "bounded non-empty repetitions" (like in the regexp /x{m,n}/ when m>=1 and n is small). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 31 13:26:42 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 31 Jan 2018 20:26:42 +0100 Subject: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?) In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> Message-ID: > > Note the French "touch" keyboard layout is complete for French (provided > you select the one of the 3 new layouts with Emoji: it has the extra "key" > for selecting the input language in all 4 layouts) > > But the "full" (dockable) touch layout in French which emulates a physical > keyboard is still incomplete. > > This "full" layout is also incorrect because the top row has been > unexpectedly shifted to the right, in order to place the [Esc] key > (reducing the [Backspace] at end of row: this row should have been left > unchanged, placing the [Esc] key at end after the (reduced) Backspace key, > or just to the right of the [X] icon that closes the touch keyboard panel > (so that the [Backspace] keeps its size, or in order to place the [Del] key > at end of this top row) > > The placement of [Fn] to the bottom left corner is UI design error (and a > really bad decision taken by ISO): there should be only THREE keys to the > left of the [Space bar] (whose size is correct), using larger keys (1.5 x > 1.0 units) so that the left [Ctrl] remains in the bottom left corner. That > [Fn] key should better be to the right of the [Space bar]. > > The "Language" selector button should not be there in the layout (and > not in any one of the proposed layouts), it should be in the top bar, > beside the "layout/option" selector icon in the top-left corner opening a > popup menu. > > Removing the language selector key from the touch layout allows moving the > arrow keys (in the full layout) to the right, restoring the correct > position of the Right [Shift] key > > > The general appearance would then be as on this image at: > https://drive.google.com/file/d/12t_w7fZZ2RKJho_FW9CbVwgIS8B8WmzX/view?usp=sharing The [Fn] (virtual) key should also allow typing [PgUp], [PgDown], [Home], > [End] and [Insert] on existing cursor keys, as seen on the right part of > this image (I've cut most of the layout of the [Fn] key, where [Fn]+1 gives > [F1] for example)... > You'll note the keys resized more conventionnaly, [Fn] placed at right (after the too long Left Shift), the cedilla mapped in fact on [AltGr]+[,], and the language selector and [Esc] key moved to the title bar, [Del] added next to [Backspace] The second [Ctrl] is still there on the bottom right (before the cursor keys) but not really needed on the touch layout, and can safely be replaced by the [App/Menu] key. The top title bar should also be usable (in its current empty dark area) to place customizable characters or keystrokes, but it could also be prefilled with common characters used in the selected language... Also when that touch layout is displayed, pressing any key physical keyboard could reduce it to only this title bar which could remain on screen as an horizontal strip, where the prefered keystrokes are still clickable directly, instead of closing the panel completely (requiring then to click the small icon in the task bar to reopen it with the full layout only): this can apply to ALL touch layouts (not just the full layout). If the touch layout is displayed, it can be docked at the bottom of screen, or can be floatting, but when it is reduced to just its top title bar (because we are typing a key on the physical keyboard), it can combine with the reduced language bar (that you can place at top of the screen over the title bar of other applications...) -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 31 16:44:05 2018 From: unicode at unicode.org (Marcel Schneider via Unicode) Date: Wed, 31 Jan 2018 23:44:05 +0100 (CET) Subject: Keyboard layouts and CLDR In-Reply-To: References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <397820388.18715.1517347464472.JavaMail.www@wwinf1j02> Message-ID: <1021788982.31258.1517438645951.JavaMail.www@wwinf1p19> On Wed, 31 Jan 2018 19:05:17 +0100, Philippe Verdy wrote: > > Another idea: you can already have multiple layouts loaded for the same > language : For French, nothing prohibits to have a "technical/programmer > layout", favoring input of ASCII, a "bibliographic/typographical" one with > improved characters (e.g. the correct curly apostrophe); Czech, Polish and Romanian have Programmer layouts shipped with Windows 7. But more than one single and easy keypress to switch between is inefficient. And a French Programmer layout as I see it cannot be called French, and the Programmer mode is needed as an ALtGr Lock on the upper row digits for convenience. Nothing of all these requirements is met by proposing two distinct layouts. > the technical/programmer layout has the spell-checker disabled by default, > while the other has a spell checker enabled by default: whever to activate > the spell checker or not will depend on software where it is enable, but it > will switch automatically it on or off according to the state defined by > changing the layout for the same language. I don?t really see the point of spell-checking. Usually their libraries are so poor they don?t even make the equivalence between U+2019 and U+0027. For me it suffices to see wavy underlines in the Google search bar. (And fortunately I don?t do many searches a day with the Google search *bar.*) > > Switching from one layout to another is easy with the Language bar, this > means that even if you keep the first layout unchanged to match the > national standard, additional layouts can be tuned specifically. The Language bar is a good feature, but it has little to do with what I try to achieve with the Programmer toggle. It?s part of the layout like CapsLock on bicameral layouts. Imagine that you had to toggle between a lowercase layout and an uppercase layout, and you understand why switching back and forth between two layouts is unpractical, though many users must actually rely on it. Surely that impacts productivity, and therefore, all non-Latin scripts are sort of digitally disadvantaged. What we need is a real layout toggle on our keyboards. As most scripts are unicase, the CapsLock key is the best candidate. (The more as many Latin script users hate CapsLock.) And those locales that require typing in uppercase usually have also the ISO B00 key, where CapsLock can be mapped. (Too bad that US-QWERTY is lacking key B00.) > > For French fortunately there are two ISO 639-2 codes "fra" and "fre" > (technical and bibliographic) which allows also defining the code "FRA" or > "FRE" to display in the language bar (users should be able to tune the > abbreviation or icon or emoji displayed in the language bar when they > switch languages or input layouts, even if there are defaults, and Windows > can infer a non-conflicting visual identification by adding a digit to the > ISO 639-2 or -3 code (which would be the same as the layout ordering number > in the list of loaded layouts, and used in shortcuts like CTRL+ALT+F1...F9). Isn?t the fre/fra alternative linguistic only, like gre/ell? Otherwise, every non-ASCII language writing system should have two codes. And it isn?t as if tech writers shouldn?t use correct French orthography. Regards, Marcel From unicode at unicode.org Wed Jan 31 17:30:45 2018 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Thu, 1 Feb 2018 00:30:45 +0100 Subject: Keyboard layouts and CLDR In-Reply-To: <1021788982.31258.1517438645951.JavaMail.www@wwinf1p19> References: <9EE03900F5F24F12855DEA697C3E2141@DougEwell> <1573635395.8306.1517225161705.JavaMail.www@wwinf1m23> <412077223.147.1517290294530.JavaMail.www@wwinf1d20> <397820388.18715.1517347464472.JavaMail.www@wwinf1j02> <1021788982.31258.1517438645951.JavaMail.www@wwinf1p19> Message-ID: The spell checker I was invoking was to allow fixing basic the typography (e.g. the ae and oe ligatures contextually, it does not have to be a full spell checker, but only concentrate on the typography, not the orthography, most transforms should be limited to one or two characters, so that it is a true "input mode", It could fix the leading capitals with accents : you type the normal accent and it gets capitalized for you; the apostrophe does not need a spell checker, the key produces directly the curly apostrophe U+2019 and not the vertical apostrophe from ASCII in the technical/programmer's mode) But multiple layouts available in one click allows setting a context for automatic transforms, or for typing more advanced character subsets. The two codes in ISO639-2 are just suggested for a visual distinction (in the language bar that displays "FRA" only for French, but does not display the layout currently used) instead of appending some digit (a layout number) to distinguish it. The language bar unfortunately does not clearly display the layout in current use, but a combination of a language+layout can have a visible code so that pressing a single key will make the change visible: the layout selector can really be used as an input mode selector which complements the existing mode keys (Shift, Ctrl, Alt, AltGr). Even the touch keyboard in Windows has several layouts builtin (including the Emoji selector, and some input modes where you can maintain a key pressed to have a choice of "related" characters: the layout is really different on screen, but I don't see why it cannot change also on the physical keyboard, and made visible correctly also on the full layout of the touch input panel, where key labels change dynamically according to the state of mode keys). Additionally each application has its own input mode builtin (own language and own layout), so the input mode in one app is not necessarily the same as another app on the same screen, depending on which one has the input focus. Even within the same application, you could have several input areas using different input modes, and depending on where you put the focus, the app can automatically memoize its input mode when we leave a text input field and restore it when we reenter it. The top bar of the touch panel is a precious area where we can give more visual info... 2018-01-31 23:44 GMT+01:00 Marcel Schneider : > On Wed, 31 Jan 2018 19:05:17 +0100, Philippe Verdy wrote: > > > > Another idea: you can already have multiple layouts loaded for the same > > language : For French, nothing prohibits to have a "technical/programmer > > layout", favoring input of ASCII, a "bibliographic/typographical" one > with > > improved characters (e.g. the correct curly apostrophe); > > Czech, Polish and Romanian have Programmer layouts shipped with Windows 7. > But more than one single and easy keypress to switch between is > inefficient. > And a French Programmer layout as I see it cannot be called French, and the > Programmer mode is needed as an ALtGr Lock on the upper row digits for > convenience. > Nothing of all these requirements is met by proposing two distinct layouts. > > > the technical/programmer layout has the spell-checker disabled by > default, > > while the other has a spell checker enabled by default: whever to > activate > > the spell checker or not will depend on software where it is enable, but > it > > will switch automatically it on or off according to the state defined by > > changing the layout for the same language. > > I don?t really see the point of spell-checking. Usually their libraries > are so poor > they don?t even make the equivalence between U+2019 and U+0027. For me it > suffices to see wavy underlines in the Google search bar. (And fortunately > I > don?t do many searches a day with the Google search *bar.*) > > > > > Switching from one layout to another is easy with the Language bar, this > > means that even if you keep the first layout unchanged to match the > > national standard, additional layouts can be tuned specifically. > > The Language bar is a good feature, but it has little to do with what I > try to achieve > with the Programmer toggle. It?s part of the layout like CapsLock on > bicameral layouts. > Imagine that you had to toggle between a lowercase layout and an uppercase > layout, > and you understand why switching back and forth between two layouts is > unpractical, > though many users must actually rely on it. Surely that impacts > productivity, and > therefore, all non-Latin scripts are sort of digitally disadvantaged. What > we need is a > real layout toggle on our keyboards. As most scripts are unicase, the > CapsLock key > is the best candidate. (The more as many Latin script users hate > CapsLock.) And > those locales that require typing in uppercase usually have also the ISO > B00 key, > where CapsLock can be mapped. (Too bad that US-QWERTY is lacking key B00.) > > > > > For French fortunately there are two ISO 639-2 codes "fra" and "fre" > > (technical and bibliographic) which allows also defining the code "FRA" > or > > "FRE" to display in the language bar (users should be able to tune the > > abbreviation or icon or emoji displayed in the language bar when they > > switch languages or input layouts, even if there are defaults, and > Windows > > can infer a non-conflicting visual identification by adding a digit to > the > > ISO 639-2 or -3 code (which would be the same as the layout ordering > number > > in the list of loaded layouts, and used in shortcuts like > CTRL+ALT+F1...F9). > > Isn?t the fre/fra alternative linguistic only, like gre/ell? > Otherwise, every non-ASCII language writing system should have two codes. > And it isn?t as if tech writers shouldn?t use correct French orthography. > > Regards, > > Marcel > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 31 17:50:51 2018 From: unicode at unicode.org (Sarasvati via Unicode) Date: Wed, 31 Jan 2018 17:50:51 -0600 Subject: CLDR Keyboard and Layout discussion Message-ID: <201801312350.w0VNopt6026613@sarasvati.unicode.org> Greetings and Happy New Year, The discussion of CLDR Keyboards and layout is getting lengthy and it should probably be moved to the CLDR-Users mail list where it is more appropriate. Especially because it is so technically detailed. Please see this page for instructions about how to subscribe: http://www.unicode.org/consortium/distlist-cldr-users.html Thank you for your attention, -- Sarasvati From unicode at unicode.org Wed Jan 31 19:38:58 2018 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 1 Feb 2018 01:38:58 +0000 Subject: Internationalised Computer Science Exercises In-Reply-To: References: <20180122220855.7b929272@JRWUBU2> <20180128041230.26b34022@JRWUBU2> <20180128224456.2a93f2a1@JRWUBU2> <20180129085741.6fcf00f8@JRWUBU2> <20180129205305.5d5d202d@JRWUBU2> Message-ID: <20180201013858.383c7313@JRWUBU2> On Wed, 31 Jan 2018 19:45:56 +0100 Philippe Verdy via Unicode wrote: > 2018-01-29 21:53 GMT+01:00 Richard Wordingham via Unicode < > unicode at unicode.org>: > > On Mon, 29 Jan 2018 14:15:04 +0100 > > was meant to be an example of a > > searched string. For example, > COMBINING DOT BELOW> contains, under canonical equivalence, the > > substring . Your regular > > expressions would not detect this relationship. > My regular expression WILL detect this: scanning the text implies > first composing it to "full equivalent decomposition form" (without > even reordering it, and possibly recompose it to NFD) while reading > it and bufering it in forward direction (it just requires the > decomposition pairs from the UCD, including those that are "excluded" > from NFC/NFD). No. To find , you constructed, on "Sun, 28 Jan 2018 20:30:44 +0100": [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] * < COMBINING CIRCUMFLEX> To be consistent, to find you would construct [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]]]] * ( [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* | [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]]* ) (A final ')' got lost between brain and text; I have restored it.) However, decomposes to . It doesn't match your regular expression, for between COMBINING DIAERESIS and COMBINING DOT BELOW there is COMBINING MACRON, for which ccc = above! > The regexp exgine will then only process the "fully decomposed" input > text to find matches, using the regexp transformed as above (which > has some initial "complex" setup to "fully decompose" the initial > regexp), but only once when constructing it, but not while processing > the input text which can be then done straightforward with its full > decomposition easily performed on the fly without any additional > buffering except the very small lookahead whose length is never > longer than the longest "canonical" decompositions in the UCD, i.e. > at most 2 code points of look ahead). Nitpick: U+1F84 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI decomposes to . Conversion to NFD on input only requires a small buffer for natural orthographies. I suspect the worst in natural language will come from either narrow IPA transcriptions or Classical Greek. > The automata is of course using the classic NFA used by regexp engine > (and not the DFA which explodes combinatorially in all regexps), but > which is still fully deterministic (the current "state" in the > automata is not a single integer for the node number in the traversal > graph, but a set of node numbers, and all regexps have a finite > number of nodes in the traversal graph, this number being > proportional to the length of the regexp, it does not need lot of > memory, and the size of the current "state" is also fully bounded, > never larger than the length of the regexp). Optimizing some > contextual parts of the NFA to DFA is possible (to speedup the > matching process and reduce the maximum storage size of the "current > state") but only if it does not cause a growth of the total number of > nodes in the traversal graph, or as long as this growth of the total > number does not exceed some threshold e.g. not more than 2 or 3 times > the regexp size). In your claim, what is the length of the regexp for searching for ? in a trace? Is it 3, or is it abut 14? If the former, I am very interested in how you do it. If the latter, I would say you already have a form of blow up in the way you cater for canonical equivalence. Even with the dirty trick of normalising the searched trace for input (I wanted the NFA propagation to be defined by the trace - I also didn't want to have to worry about the well-formedness of DFAs or NFAs), I found that the number of states for a concatenation of regular languages of traces was bounded above by the product of the number of states. This doesn't strike me as inherently unreasonable, for I get the same form of bound for the intersection of regular languages even for strings. In both cases, a lot of the nodes for the concatenation or intersection are unreachable. Kleene star is a problem on size. I think there is a polynomial bound for when A* is a regular language. If I substitute 'concurrent star' for Kleene star, which has the nice property that the concurrent star of a regular trace language is itself regular, then the bound I have on the number of states of the concurrent star is proportional to the third power of the number of states for the original trace language. The states are fairly simply derived from the states for recognising the regular language A. (My size bounds are for the trace of fully decomposed Unicode character strings under canonical equivalence. I am not sure that they hold for arbitrary traces.) I believe the concurrent star of a language A is (|A|)*, where |A| = {x ? A : {x}* is a regular language} (The definition works for the trace of fully decomposed Unicode character strings under canonical equivalence.) Concurrent star is not a perfect generalisation. If ab = ba, then X = {aa, ab, b} has the annoying property that X* is a regular trace language, but |X|* is a proper subset of X*. For Unicode, X would be a rather unusual regular language. Richard.