From unicode at unicode.org Wed May 1 03:30:06 2019 From: unicode at unicode.org (Marius Spix via Unicode) Date: Wed, 1 May 2019 10:30:06 +0200 Subject: Aw: Re: Symbols of colors used in Portugal for transport In-Reply-To: <0CA4EF90-CA21-476C-AC58-757E7E8B83A5@telia.com> References: <20190429123444.665a7a7059d7ee80bb4d670165c8327d.3d307d3f9a.wbe@email03.godaddy.com> <67e8ec6d-d8c8-1456-f0e2-006f8a95e40e@kli.org> <0CA4EF90-CA21-476C-AC58-757E7E8B83A5@telia.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 1 05:23:51 2019 From: unicode at unicode.org (Shriramana Sharma via Unicode) Date: Wed, 1 May 2019 15:53:51 +0530 Subject: Emoji boom? Message-ID: http://www.unicode.org/L2/L-curdoc.htm The number of emoji-related proposals seems to be increasing compared to the number of script-related ones. Have we reached a plateau re scripts encoding? Somehow this seems sad to me considering the great role Unicode played in bringing Indic scripts (from my POV as an Indian) to mainstream digital devices. -- Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Wed May 1 07:19:38 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 1 May 2019 05:19:38 -0700 Subject: Emoji boom? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 1 09:59:51 2019 From: unicode at unicode.org (Phillips, Addison via Unicode) Date: Wed, 1 May 2019 14:59:51 +0000 Subject: Emoji boom? References: Message-ID: <149e9824a5b2470eb61978c2544957b3@EX13D08UWB002.ant.amazon.com> Why is this surprising? Encoding a script is many many orders of magnitude more complex than encoding emoji. This is especially true given that the scripts that remain unencoded are largely used by small populations (or, in the case of historic scripts, by *no* population at all). It is a complex, painstaking business. In many ways emoji is actually a godsend for this effort, since it attracts attention to Unicode programs such as Adopt a Character, which funds script encoding grants, and ultimately result in Unicode being better able to serve its deeper mission of making all the world's languages digitally accessible. What's more, when implementations support emoji features, such as ZWJ sequences, variation selectors, etc., they are also building necessary mechanisms for supporting complex scripts (many of which are recently encoded or on the roadmap). Addison Phillips Sr. Principal SDE ? I18N (Amazon) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. > > -----Original Message----- > > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > > Shriramana Sharma via Unicode > > Sent: Wednesday, May 01, 2019 3:24 AM > > To: UnicoDe List > > Subject: Emoji boom? > > > > http://www.unicode.org/L2/L-curdoc.htm > > > > The number of emoji-related proposals seems to be increasing compared > > to the number of script-related ones. > > > > Have we reached a plateau re scripts encoding? > > > > Somehow this seems sad to me considering the great role Unicode played > > in bringing Indic scripts (from my POV as an Indian) to mainstream > > digital devices. > > > > -- > > Shriramana Sharma ???????????? ???????????? ???????????????????????? From unicode at unicode.org Thu May 2 10:44:56 2019 From: unicode at unicode.org (J Andrew Lipscomb via Unicode) Date: Thu, 2 May 2019 11:44:56 -0400 Subject: Symbols of colors used in Portugal for transport Message-ID: Why not just use U+25E4 and U+25E2 for the triangles, and U+2215 for the diagonal? From unicode at unicode.org Thu May 2 11:36:53 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 2 May 2019 09:36:53 -0700 Subject: Symbols of colors used in Portugal for transport In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 3 03:01:33 2019 From: unicode at unicode.org (Jack Rueter via Unicode) Date: Fri, 3 May 2019 11:01:33 +0300 Subject: asking advice of the Unicode community on new character proposal Message-ID: Hello! I am looking for advice from the Unicode community. I am working within the Finnish NB on a proposal for additional characters used to write the Komi-Permyak and Komi-Zyrian languages in Latin script in the 1930s (1932-1937 in Komi-Permyak (Latin alone) and 1932-1935 years in Komi-Zyrian publications). Prior to this period a modified Cyrillic alphabet (Molodcov, for which supplementary characters are encoded in the range U+0500?050F) was used by both the Komi-Zyrians and Komi-Permyaks (in 1936 the Komi-Zyrians completely reverted to Molodcov, which some Komi-Zyrian publishers had retained throughout). By late 1938 both Komi-Permyak and Komi-Zyrian orthographies became closely aligned with the Russian Cyrillic system with only two supplementary characters). The previously used, Old Permic characters are encoded in the range U+10350?1037F. Komi belongs to the Uralic (Finno-Ugric) family of languages, related to Finnish. It is spoken in the Republic of Komi, a member state of the Russian Federation and the Permski Krai. The additional Latin characters to be proposed include Latin capital and small letters C, D, L, S, T and ? with descenders. They also include a number of Cyrillic letters, capital and small Ukrainian IE (in Komi a hard affricate CHA) and Soft Sign (in Komi a high central unrounded vowel), used together with Latin letters. Could/should these (four) be encoded as Latin characters (which would clearly add to confusables) or how could the mix of scripts be best handled? Sincerely, Jack Rueter Ph.D., Language Researcher at University of Helsinki -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 3 12:07:42 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 3 May 2019 18:07:42 +0100 Subject: asking advice of the Unicode community on new character proposal In-Reply-To: References: Message-ID: <20190503180742.6fa38d2a@JRWUBU2> On Fri, 3 May 2019 11:01:33 +0300 Jack Rueter via Unicode wrote: > The additional Latin characters to be proposed include Latin capital > and small letters C, D, L, S, T and ? with descenders. They also > include a number of Cyrillic letters, capital and small Ukrainian IE > (in Komi a hard affricate CHA) and Soft Sign (in Komi a high central > unrounded vowel), used together with Latin letters. Could/should > these (four) be encoded as Latin characters (which would clearly add > to confusables) or how could the mix of scripts be best handled? The latter pair may already be encoded as U+0184/U+0185 LATIN CAPITAL/SMALL LETTER TONE 6, which was once intended to use the glyph of the Cyrillic soft sign. The Ukrainian IE used in Latin script is certainly eligible, by the principle of separation of scripts. The only challenge I can see would be a claim that it was already encoded as U+0190/U+025B LATIN CAPITAL/SMALL LETTER OPEN E. Richard. From unicode at unicode.org Mon May 6 18:30:03 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 7 May 2019 00:30:03 +0100 Subject: Choice between Identical Tai Tham Characters Message-ID: <20190507003003.292ed0db@JRWUBU2> What authoritative recommendations or injunctions have been given for choosing between the encodings and for the subscript character known natively as 'hang ba'? The choice has no implication as to glyph shape or the pronunciation of the character, and the only difference in Unicode-associated properties is that the difference is a primary difference in the DUCET default and CLDR root collations. It is quite conceivable that a prescribed choice may be intended to distinguish homophonous homographs, e.g. ???? 'bad smell' v. 'curse', which are usually spelt differently in Northern Thai in the Thai script and are spelt differently in Thai (??? v. ???). This subscript consonant is used in all the languages that regularly use the script. I can think of some common sense rules such as, "A Pali writing system should use only one of U+1A37 and U+1A38", but it's not impossible that even this has been overridden. The Khmer script has a similar issue with COENG DA and COENG TA, but between them they represent two different sounds, and TUS recommends that the encoding be chosen on the basis of the sound. Richard. From unicode at unicode.org Mon May 6 20:27:11 2019 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 6 May 2019 21:27:11 -0400 Subject: MIRROR emoji Message-ID: <5678c8d4-a4e2-a3f4-99ea-c2cc9535bd9e@kli.org> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 9 10:55:23 2019 From: unicode at unicode.org (Ed Trager via Unicode) Date: Thu, 9 May 2019 11:55:23 -0400 Subject: What is the time frame for USE shapers to provide support for CV+C ? Message-ID: Hi, Andrew and Behdad, Prompted by a conversation I had with Liang Hai yesterday, I am just curious to get some idea about the following: (1) When can we anticipate that the USE spec will be updated to provide support for subjoined consonants below vowels (as required for TAI THAM) ? (2) Once the USE spec is updated, how much lag time can we expect until Microsoft actually releases an implementation with said support for CV+C ? (3a) And the related question ?for Behdad and the HarfBuzz development group? is when can we expect to see CV+C support (at least for TAI THAM) in HarfBuzz ? (3b) Would the HarfBuzz team consider providing CV+C support for TAI THAM even before the USE spec gets updated, so that we could test things out ? * ** --------------------------------------- * PLEASE AND THANKYOU? ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , transcribed to Central Thai script as ???, (*to kiss*). Currently, people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("???") which violates the "phonetic ordering" but is the current workaround because USE is still broken for TAI THAM. REFERENCE DOCUMENT: http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 9 14:04:38 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 9 May 2019 20:04:38 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: Message-ID: <20190509200438.567759ff@JRWUBU2> On Thu, 9 May 2019 11:55:23 -0400 Ed Trager via Unicode wrote: > ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , > transcribed to Central Thai script as ???, (*to kiss*). Currently, > people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("???") which > violates the "phonetic ordering" but is the current workaround > because USE is still broken for TAI THAM. > > REFERENCE DOCUMENT: > http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf How is this a good test case? The 6th preliminary recommendation reads, "To represent a cluster, regardless of the phonetic order CCV or CVC, a consonant sign should always be encoded before the vowel sign, unless the vowel sign has inline advance and is apparently followed by the consonant sign". If this recommendation is adopted, then the spelling "U+1A27 U+1A6A U+1A60 U+1A37" will be wrong. Now, SIGN U and SIGN UU before subscript BA, HIGH PA and LOW YA aren't always written as though they followed the subscript consonants in phonetic order. Sometimes the vowel sign is written in the bottom left of the syllable. Presumably we'll need 3 or 4 new signs: TAI THAM UNAMBIGUOUS UB TAI THAM UNAMBIGUOUS UUB TAI THAM UNAMBIGUOUS UY TAI THAM UNAMBIGUOUS UUY (?) I'm not sure that the fourth one can occur. An example of the contrast is shown in the attached files luynam.png, with first orthographic syllable , and yukya.png, with the first orthographic syllable . I wonder how we'd be supposed to encode ?????? (currently 'to crawl'? The simplest way would be to encode it as , which currently encodes the unlikely ??????. Will good fonts be expected to move the vowel left and down from the subscript LOW YA to the MEDIAL LA? Or will we need to encode it with *TAI THAM UNAMBIGUOUS UY? Richard. -------------- next part -------------- A non-text attachment was scrubbed... Name: luynam.png Type: image/png Size: 2132 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: yukya.png Type: image/png Size: 2406 bytes Desc: not available URL: From unicode at unicode.org Mon May 13 19:58:07 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Tue, 14 May 2019 00:58:07 +0000 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: Message-ID: Here is the essence of the initial changes needed to support CV+C. Open to feedback. * Create new SAKOT class SAKOT (Sk) based on UISC = Invisible_Stacker * Reduced HALANT class Now only HALANT (H) based on UISC = Virama * Updated Standard cluster mode [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM] The only required component of a standard cluster is a BASE or BASE_OTHER. A cluster may optionally begin with a REPH or CONS_WITH_STACKER. A BASE or BASE_OTHER may be followed immediately by a VARIATION_SELECTOR and/or multiple CONS_MOD characters in the order CONS_MOD_ABOVE CONS_MOD_BELOW. Multiple sequences of a HALANT BASE or SAKOT BASE with optional VARIATION_SELECTOR or optional CONS_MOD can occur. The sequence can continue with zero or one CONS_MED for each cardinal position (Pre, Above, Below, Post); zero to many VOWEL characters in each cardinal position; zero to many VOWEL_MODs in each cardinal position; zero to many sequences of SAKOT BASE; zero to many CONS_FINALs in each of Above, Below, and Post; and lastly, an optional FINAL_MOD. * Updated Halant-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)*)* < H | Sk > This is similar to the Standard cluster but terminates in a final HALANT or SAKOT after a BASE, BASE_OTHER, or CONS_MOD. When such a HALANT or SAKOT it will form a cluster. When any character other than a BASE or BASE_OTHER follows the HALANT or SAKOT there will be a cluster break between the HALANT or SAKOT and the following character. Multiple sequences of a HALANT BASE or SAKOT BASE with optional VARIATION_SELECTOR or optional CONS_MOD can occur. A CONS_SUBJ is equivalent to the sequence HALANT BASE. * New Sakot-terminated cluster [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B [VS] (CMAbv)* (CMBlw)*)* Sk This is similar to the Standard cluster but terminates in a final SAKOT after a VOWEL or VOWEL_MOD. When such a SAKOT follows a VOWEL or VOWEL_MOD it will form a cluster. When any character other than a BASE or BASE_OTHER follows this SAKOT there will be a cluster break between the SAKOT and the following character. Multiple sequences of a SAKOT BASE with optional VARIATION_SELECTOR or optional CONS_MOD can occur. A CONS_SUBJ is equivalent to the sequence HALANT BASE. This would allow a consonant to follow a vowel when joined with a Sakot. It would support multiple final consonants. It would not support polysyllabic chaining of CV+CV+CV etc. Cheers, Andrew From: Behdad Esfahbod Sent: 10 May 2019 11:32 To: Ed Trager Cc: Andrew Glass ; Unicode Mailing List Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? I'm open to doing that if there's consensus on how it should be done. On Thu, May 9, 2019 at 8:55 AM Ed Trager > wrote: Hi, Andrew and Behdad, Prompted by a conversation I had with Liang Hai yesterday, I am just curious to get some idea about the following: (1) When can we anticipate that the USE spec will be updated to provide support for subjoined consonants below vowels (as required for TAI THAM) ? (2) Once the USE spec is updated, how much lag time can we expect until Microsoft actually releases an implementation with said support for CV+C ? (3a) And the related question ?for Behdad and the HarfBuzz development group? is when can we expect to see CV+C support (at least for TAI THAM) in HarfBuzz ? (3b) Would the HarfBuzz team consider providing CV+C support for TAI THAM even before the USE spec gets updated, so that we could test things out ? * ** --------------------------------------- * PLEASE AND THANKYOU? ** A good use case is the Tai Tham word U+1A27 U+1A6A U+1A60 U+1A37 , transcribed to Central Thai script as ???, (to kiss). Currently, people are writing this as U+1A27 U+1A60 U+1A37 U+1A6A ("???") which violates the "phonetic ordering" but is the current workaround because USE is still broken for TAI THAM. REFERENCE DOCUMENT: http://www.unicode.org/L2/L2018/18332-tai-tham-ad-hoc-report.pdf -- behdad http://behdad.org/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon May 13 21:08:04 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 14 May 2019 03:08:04 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: Message-ID: <20190514030804.17e1b37b@JRWUBU2> On Tue, 14 May 2019 00:58:07 +0000 Andrew Glass via Unicode wrote: > Here is the essence of the initial changes needed to support CV+C. > Open to feedback. > > > * Create new SAKOT class > SAKOT (Sk) based on UISC = Invisible_Stacker > * Reduced HALANT class > Now only HALANT (H) based on UISC = Virama > * Updated Standard cluster mode > > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* > > (FAbv)* (FBlw)* (FPst)* [FM] This comes a lot closer to supporting Tai Tham monosyllabic clusters. Although this shouldn't affect Tai Tham, some of those medials need to be made repeatable; I belief this has already been done in HarfBuzz. I trust you'll be reclassifying U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA and U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA into the category SUB so that we can write about bananas forever (?????????????): /kluai/ 'banana' /t?al??t/ 'for ever' The issues here are that WA in a medial r?le is indistinguishable from a coda ('sakot') consonant and that MEDIAL RA can act as a consonant aspirator. Unfortunately, we didn't define a consonant HIGH RATTHA with a canonical decomposition to . The problem is that 'HIGH RATTHA', widely seen as an alternative form of HIGH RATHA, can act as a subscript coda consonant. There are also a couple of words in the Northern Thai Dictionary of Palm-Leaf Manuscripts where MEDIAL LA acts as a coda consonant. Together, these call for (Sk B)* to be replaced by (). This next question does not, I believe, affect HarfBuzz. Will NFC code render as well as unnormalised code? In the first example above, normalises to , which does not match any portion of the regular expression. Richard. From unicode at unicode.org Tue May 14 14:49:25 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 14 May 2019 20:49:25 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <20190514030804.17e1b37b@JRWUBU2> References: <20190514030804.17e1b37b@JRWUBU2> Message-ID: <20190514204925.661f1ff6@JRWUBU2> On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: > Together, > these call for (Sk B)* to be replaced by (). Correction: Together, these call for (Sk B)* to be replaced by ()*. Richard. From unicode at unicode.org Tue May 14 20:54:56 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 15 May 2019 02:54:56 +0100 Subject: Lao Nukta Message-ID: <20190515025456.769a083e@JRWUBU2> I was looking though Maha Sena's textbook on Tai Tham for Pali, and I noticed that he had a Lao script Pali section that made use of a nukta that seems to me to be indistinguishable from U+0EBA LAO SIGN PALI VIRAMA. Is it therefore in order to use that character for this nukta, just as U+0E3A THAI CHARACTER PHINTHU functions as a nukta? Now the nukta and the vowels below slightly interact, with the nukta on the left and the vowel below in the right. As U+0EBA has ccc=9 and the Lao vowels below have ccc=118, this seems to be fine. (of course, I may have to wait to find a font that arranges them correctly.) I attach an example of the word "vi???h?ti". Richard. -------------- next part -------------- A non-text attachment was scrubbed... Name: vinnuhiti.png Type: image/png Size: 13639 bytes Desc: not available URL: From unicode at unicode.org Wed May 15 06:22:09 2019 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Wed, 15 May 2019 11:22:09 +0000 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? Message-ID: Hello Unicode experts! Which is correct: (a) The input file contains a string. The string is encoded using UTF-8. (b) The input file contains a string. The string is encoded with UTF-8. (c) The input file contains a string. The string is encoded in UTF-8. (d) Something else (what?) /Roger From unicode at unicode.org Wed May 15 06:51:23 2019 From: unicode at unicode.org (Aleksey Tulinov via Unicode) Date: Wed, 15 May 2019 14:51:23 +0300 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: References: Message-ID: >From The Unicode? Standard Version 12.0 ? Core Specification: "5.7 Compression ... Encoding forms defined in Section 2.5, Encoding Forms, have different storage characteris- tics. For example, as long as text contains only characters from the Basic Latin (ASCII) block, it occupies the same amount of space whether it is encoded with the UTF-8 or ASCII codes. Conversely, text consisting of CJK ideographs encoded with UTF-8 will require more space than equivalent text encoded with UTF-16." Hope this helps. ??, 15 ??? 2019 ?. ? 14:24, Costello, Roger L. via Unicode : > > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using UTF-8. > > (b) The input file contains a string. The string is encoded with UTF-8. > > (c) The input file contains a string. The string is encoded in UTF-8. > > (d) Something else (what?) > > /Roger > From unicode at unicode.org Wed May 15 07:56:54 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 15 May 2019 05:56:54 -0700 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: References: Message-ID: <67fda42f-fbac-2b7e-9a46-10418db2ba06@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 15 08:21:11 2019 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Wed, 15 May 2019 13:21:11 +0000 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: References: Message-ID: > On May 15, 31 Heisei, at 12:22 pm, Costello, Roger L. via Unicode wrote: > > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using UTF-8. > > (b) The input file contains a string. The string is encoded with UTF-8. > > (c) The input file contains a string. The string is encoded in UTF-8. > > (d) Something else (what?) > > /Roger > (d) The input file contains a string which is UTF-8 encoded. Andr? Schappo From unicode at unicode.org Wed May 15 08:53:39 2019 From: unicode at unicode.org (Neil Shadrach via Unicode) Date: Wed, 15 May 2019 14:53:39 +0100 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: References: Message-ID: (e) The input file contains a UTF-8 encoded string. Ar Mer, 15 Mai 2019 am 14:22 Andre Schappo via Unicode ysgrifennodd: > > > > On May 15, 31 Heisei, at 12:22 pm, Costello, Roger L. via Unicode < > unicode at unicode.org> wrote: > > > > Hello Unicode experts! > > > > Which is correct: > > > > (a) The input file contains a string. The string is encoded using UTF-8. > > > > (b) The input file contains a string. The string is encoded with UTF-8. > > > > (c) The input file contains a string. The string is encoded in UTF-8. > > > > (d) Something else (what?) > > > > /Roger > > > > (d) The input file contains a string which is UTF-8 encoded. > > Andr? Schappo > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 15 09:36:20 2019 From: unicode at unicode.org (Rebecca T via Unicode) Date: Wed, 15 May 2019 10:36:20 -0400 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: References: Message-ID: I think that colloquially ?the file contains a UTF-8 string? is best, but perhaps not in more formal communications. On Wed, May 15, 2019, 7:24 AM Costello, Roger L. via Unicode < unicode at unicode.org> wrote: > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using UTF-8. > > (b) The input file contains a string. The string is encoded with UTF-8. > > (c) The input file contains a string. The string is encoded in UTF-8. > > (d) Something else (what?) > > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed May 15 13:16:43 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 15 May 2019 19:16:43 +0100 Subject: Correct way to express in English that a string is encoded ... using UTF-8 ... with UTF-8 ... in UTF-8? In-Reply-To: <67fda42f-fbac-2b7e-9a46-10418db2ba06@ix.netcom.com> References: <67fda42f-fbac-2b7e-9a46-10418db2ba06@ix.netcom.com> Message-ID: <20190515191643.7495f019@JRWUBU2> On Wed, 15 May 2019 05:56:54 -0700 Asmus Freytag via Unicode wrote: > On 5/15/2019 4:22 AM, Costello, Roger L. via Unicode wrote: > Hello Unicode experts! > > Which is correct: > > (a) The input file contains a string. The string is encoded using > UTF-8. > > (b) The input file contains a string. The string is encoded with > UTF-8. > > (c) The input file contains a string. The string is encoded in UTF-8. > > (d) Something else (what?) > > /Roger > > > I would say I've seen all three uses about equally. > > If you search for each phrase, though, "in" comes up as the most > frequent one. > > That would make the last one, or simply "in UTF-8" (that is, without > the "encoded") good choices for general audiences. Additionally, the latter is about the current form of the string; the others refer to its history, suggesting it might once have been represented in some other way. Richard. From unicode at unicode.org Thu May 16 12:44:19 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 16 May 2019 18:44:19 +0100 (BST) Subject: QID Emoji and their applications In-Reply-To: <3c5309f9.6a.16ac1a2e3d3.Webtop.216@btinternet.com> References: <16452722.5b.16ac19c753a.Webtop.216@btinternet.com> <7152eb35.5d.16ac19df409.Webtop.216@btinternet.com> <3c5309f9.6a.16ac1a2e3d3.Webtop.216@btinternet.com> Message-ID: <27c380c3.aa.16ac1beaaf5.Webtop.216@btinternet.com> There are two versions of a proposal document for QID emoji currently available, the original and a revised version. https://www.unicode.org/L2/L2019/19082-qid-emoji.pdf https://www.unicode.org/L2/L2019/19082r-qid-emoji.pdf I sent in two comments about the original proposal and they are included in the following document. https://www.unicode.org/L2/L2019/19124-pubrev.html In the event, there is a response in the minutes of meeting #159 of the Unicode Technical Committee about my comments. https://www.unicode.org/L2/L2019/19122.htm#159-A17 The response about the QID emoji proposal itself is listed in those minutes. https://www.unicode.org/L2/L2019/19122.htm#159-A83 I tried some experimentation during early April 2019 and the three following threads may possibly be of interest to some readers. There is a font which readers are welcome to download and try if they so choose. https://forum.affinity.serif.com/index.php?/topic/82885-can-you-find-the-white-crested-tiger-heron/ https://forum.high-logic.com/viewtopic.php?f=10&t=7941 https://forum.high-logic.com/viewtopic.php?f=3&t=7942 William Overington Thursday 16 May 2019 From unicode at unicode.org Mon May 20 16:53:36 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 20 May 2019 22:53:36 +0100 Subject: Lao Sign Pali Virama and vowels above Message-ID: <20190520225336.15f7ee10@JRWUBU2> When a consonant bears both U+0EBA LAO SIGN PALI VIRAMA (acting as a nukta) and a vowel above, is there or is there intended to be any constraint on there relative order? While U+0EBA has canonical combining class 9, the vowels above have canonical combining class 0, so the order makes a difference. Typographically, these marks don't interfere, but renderers may consider that to be a problem. The example of nukta and vowel below has now gone up on Wiktionary at https://en.wiktionary.org/wiki/??????? . The rendering worry arises with the other form of the instrumental plural masculine. MS Edge is currently giving me dotted circles for the sequences and . I trust this is just a temporary aberration. Richard. From unicode at unicode.org Mon May 20 17:21:25 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 20 May 2019 23:21:25 +0100 Subject: Lao Sign Pali Virama and vowels above In-Reply-To: <20190520225336.15f7ee10@JRWUBU2> References: <20190520225336.15f7ee10@JRWUBU2> Message-ID: <20190520232125.528764a9@JRWUBU2> On Mon, 20 May 2019 22:53:36 +0100 Richard Wordingham via Unicode wrote: > MS Edge is currently giving me dotted circles for the sequences > and UU>. I trust this is just a temporary aberration. Also with the sequence , as in the nominative singular ???????????? of ????????????, which transliterates as sandi??hika. This last example displays perfectly well on HarfBuzz renderers. Richard. From unicode at unicode.org Mon May 20 19:36:33 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Tue, 21 May 2019 00:36:33 +0000 Subject: Lao Sign Pali Virama and vowels above In-Reply-To: <20190520232125.528764a9@JRWUBU2> References: <20190520225336.15f7ee10@JRWUBU2> <20190520232125.528764a9@JRWUBU2> Message-ID: Hi Richard, This is because the sequences include U+0EBA which was added in Unicode 12.0. Edge has not updated for Unicode 12 at this time. Cheers, Andrew -----Original Message----- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Monday, May 20, 2019 3:21 PM To: unicode at unicode.org Subject: Re: Lao Sign Pali Virama and vowels above On Mon, 20 May 2019 22:53:36 +0100 Richard Wordingham via Unicode wrote: > MS Edge is currently giving me dotted circles for the sequences > and UU>. I trust this is just a temporary aberration. Also with the sequence , as in the nominative singular ???????????? of ????????????, which transliterates as sandi??hika. This last example displays perfectly well on HarfBuzz renderers. Richard. From unicode at unicode.org Tue May 21 01:55:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 21 May 2019 07:55:57 +0100 Subject: Lao Sign Pali Virama and vowels above In-Reply-To: References: <20190520225336.15f7ee10@JRWUBU2> <20190520232125.528764a9@JRWUBU2> Message-ID: <20190521075557.684ab43f@JRWUBU2> On Tue, 21 May 2019 00:36:33 +0000 Andrew Glass via Unicode wrote: > This is because the sequences include U+0EBA which was added in > Unicode 12.0. Edge has not updated for Unicode 12 at this time. That suspicion was why I was hoping it was a temporary aberration. When it is so updated, will it support these sequences? Yes, no or undecided? The Lao section of Microsoft Typography has not yet been updated. I've raised a formal issue at https://github.com/MicrosoftDocs/typography-issues/issues/238 . Richard. From unicode at unicode.org Wed May 22 02:49:34 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 22 May 2019 08:49:34 +0100 Subject: What is the best way to work around the current USE CV+C limitation in Tai Tham? In-Reply-To: References: Message-ID: <20190522084934.1144f8c8@JRWUBU2> On Wed, 22 May 2019 00:14:57 -0400 Ed Trager wrote: > I'm hoping one or both of you can provide me some guidance on this, > thank you! Unfortunately, my OpenType skills are not at the "ninja" > level required to get around all of the limitations in USE ... If blind copying of Lamphun or Da Lekh, which I allow and encourage in this respect, is not possible, then one can reduce the skill level in one ways. One is to indiscriminately eliminate dotted circles that follow marks; that would simplify the conditions in those fonts. (I know that eliminating dotted circles present in the original string is wrong - it's collateral damage in opposing oppression.) Unfortunately, it's not as simple as that. If the USE is still misclassifying the InSC medial consonants as USE-medial consonants, then they can still leave one with the need to do Indic reordering in the font, e.g. with the reflexes of /ria/, to be encoded , as the dotted circles prevent SIGN E reordering to the start of the cluster. The second way is to attack individual cases. For example, one probably get away with a special substitutions to repair the HarfBuzz and Windows corruptions of ????. I don't know if CoreText has yet another corruption. Richard. From unicode at unicode.org Thu May 23 11:51:32 2019 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 23 May 2019 17:51:32 +0100 (BST) Subject: QID Emoji and their applications Message-ID: <36510a8.2131.16ae59adbf0.Webtop.227@btinternet.com> There has been a development in that the following document has been published yesterday. https://www.unicode.org/L2/L2019/19203-wd-uts51-17-draft.pdf I refer to Annex C.2 of that document. In that section the use of U+1F194 SQUARED ID is suggested as the base character for QID emoji. I have thought of a mnemonic to help remember the code number - namely 1 F then the number of letters in the phrase "a memorable code". I have now produced a maquette font that uses that base character rather than the Private Use Area character that I used before. Here is the substitution sequence that is within the new font. sub u1F194 uE0051 uE0032 uE0031 uE0038 uE0035 uE0034 uE0033 uE007F -> glyph218543; In order to try the experiment one needs to install the font. So here is the sequence that one needs to enter in order to cause the display of the (stylized) glyph that represents the white crested tiger heron in these experiments. u1F194 uE0051 uE0032 uE0031 uE0038 uE0035 uE0034 uE0033 uE007F That is, the SQUARED ID character then tag characters 2 1 8 5 4 3 then the CANCEL TAG character. Please note that for the glyph substitution to work the OpenType liga feature needs to be on in whichever OpenType-aware application that you use for the experiment . The font fontQ218543maquette3 in the file fontQ218543maquette3.otf and is included in each of the following threads. https://forum.affinity.serif.com/index.php?/topic/82885-can-you-find-the-white-crested-tiger-heron/ https://forum.high-logic.com/viewtopic.php?f=10&t=7941 You are welcome to download, install and use the font. ---- I noticed in the document published yesterday the following, on page 45 of the PDF document. quote A subset of QIDs are associated with entities that would be valid for emoji. For example, risk management (Q189447) and this (Q3109046) would not be valid. Of those that are valid, Wikidata may not have associated images for the referenced entity, and such images would rarely ? if ever ? be appropriate for use as images for emoji. end quote I have it in mind to suggest that there should not be that restriction and that all QID items should be valid for emoji and thus for interchange and interoperability in a plain text environment. Some may never be used yet I am thinking that to state that that some "would not be valid" would be a decision that could restrict progress and the implementation and beneficial application of new ideas in the future. As it happens when we were discussing the possibility of abstract emoji some time ago in this mailing list I produced glyphs for "this" and for "that" as a gentleman had indirectly suggested the possibility. They are about 60% of the way down the following web page. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm I accept that "this" as in "this and that" is not the same as "this" as used in some computer languages, yet maybe, just maybe, a glyph for "this" used in that context could be like my design for a glyph for "this" with a large round dot, say in green, added in the lower right corner, so as to indicate a dot as used in listing the name of an object in some computer programming languages. Restricting which QID items could be emoji also restricts the possibility of using the QID page data for text to speech. For example, risk management (Q189447) already has text in three languages. The encoding abstract items as QID items and thus as QID emoji could help communication through the language barrier, including possibly very helpfully in emergency situations. I am thinking about a glyph for risk management. I am wondering of a red jagged shape enclosed within a yellow rounded shape might work. Shapes something like those in the following article. https://en.wikipedia.org/wiki/Bouba/kiki_effect William Overington Thursday 23 May 2019 -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_this.png Type: image/png Size: 3086 bytes Desc: not available URL: From unicode at unicode.org Thu May 30 03:07:28 2019 From: unicode at unicode.org (Andre Schappo via Unicode) Date: Thu, 30 May 2019 08:07:28 +0000 Subject: unicode tweet Message-ID: <63B30FD9-1CF6-45F0-98A9-22C190CA519A@lboro.ac.uk> This tweet made me laugh twitter.com/padolsey/status/1133835770773626881 ???? Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu May 30 07:56:46 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 30 May 2019 05:56:46 -0700 Subject: unicode tweet In-Reply-To: <63B30FD9-1CF6-45F0-98A9-22C190CA519A@lboro.ac.uk> References: <63B30FD9-1CF6-45F0-98A9-22C190CA519A@lboro.ac.uk> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: nojcneffinipmdck.png Type: image/png Size: 9828 bytes Desc: not available URL: From unicode at unicode.org Thu May 30 11:49:13 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Thu, 30 May 2019 09:49:13 -0700 Subject: Format A Message-ID: <20190530094913.665a7a7059d7ee80bb4d670165c8327d.6684c000a8.wbe@email03.godaddy.com> Apologies if this is a repeat of a (much) earlier inquiry. The mapping tables that are available as part of the Unicode Standard (http://www.unicode.org/Public/MAPPINGS/) are generally provided in a text format called "Format A." Each line in the file defines a mapping between a character in a legacy encoding and the Unicode equivalent, with fields separated by tabs or sequences of spaces, like this: 0xA0 0x00A0 #NO-BREAK SPACE 0xA1 0x00A1 #INVERTED EXCLAMATION MARK 0xA2 0x00A2 #CENT SIGN The format supports DBCS as well: 0x8140 0x4E02 #CJK UNIFIED IDEOGRAPH 0x8141 0x4E04 #CJK UNIFIED IDEOGRAPH 0x8142 0x4E05 #CJK UNIFIED IDEOGRAPH My questions are: 1. Is there a specification for this format anywhere, and if so, where? 2. Is there a "Format B" or similar? (I don't mean UCM, CharMapML, RFC 1345 format, etc., but something truly similar to and/or derivative of Format A.) Please reply on-list only if you think the list at large would benefit from your reply. I'm hoping some of the Unicode elders might have some insight here. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Fri May 31 05:18:10 2019 From: unicode at unicode.org (bristol_poo via Unicode) Date: Fri, 31 May 2019 10:18:10 +0000 Subject: Proposal to extend the U+1F4A9 Symbol Message-ID: Greetings, I hope I dont intrude too much on this list with a proposal. U+1F4A9, aka the 'pile of poo' emoji, has gained somewhat of a legendary status in the modern society [1]. With the somewhat recent addition of skin tones in the Emoji Modifier Sequences, I think there is some small room to add more depth to the emoji by modulating it via the Bristol Scale [2]. This would produce 7 variants of the U+1F4A9 emoji, including existing (Which I believe is about Type 4 on the scale). Why? I think this would really benefit the medical profession, with a large uptick in e-doctor/text only conversations towards the medical profession. Cheers /BP [1] We even have plush toys dedicated to this emoji https://www.amazon.co.uk/Emoji-Shape-Pillow-Cushion-Stuffed/dp/B00VL55Q8O [2] https://en.wikipedia.org/wiki/Bristol_stool_scale -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri May 31 09:12:34 2019 From: unicode at unicode.org (Michael Everson via Unicode) Date: Fri, 31 May 2019 15:12:34 +0100 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: References: Message-ID: <7B2BF8D8-360F-4B08-9752-BE93EB83E4DF@evertype.com> No, thank you. > On 31 May 2019, at 11:18, bristol_poo via Unicode wrote: > > Greetings, > > I hope I dont intrude too much on this list with a proposal. > > U+1F4A9, aka the 'pile of poo' emoji, has gained somewhat of a legendary status in the modern society [1]. > > With the somewhat recent addition of skin tones in the Emoji Modifier Sequences, I think there is some small room to add more depth to the emoji by modulating it via the Bristol Scale [2]. > > This would produce 7 variants of the U+1F4A9 emoji, including existing (Which I believe is about Type 4 on the scale). > > Why? I think this would really benefit the medical profession, with a large uptick in e-doctor/text only conversations towards the medical profession. > > Cheers > /BP > > [1] We even have plush toys dedicated to this emoji https://www.amazon.co.uk/Emoji-Shape-Pillow-Cushion-Stuffed/dp/B00VL55Q8O > [2] https://en.wikipedia.org/wiki/Bristol_stool_scale From unicode at unicode.org Fri May 31 12:09:25 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Fri, 31 May 2019 10:09:25 -0700 Subject: Proposal to extend the U+1F4A9 Symbol In-Reply-To: <7B2BF8D8-360F-4B08-9752-BE93EB83E4DF@evertype.com> References: <7B2BF8D8-360F-4B08-9752-BE93EB83E4DF@evertype.com> Message-ID: An HTML attachment was scrubbed... URL: