From unicode at unicode.org Wed Aug 7 15:28:53 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 7 Aug 2019 21:28:53 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <20190514030804.17e1b37b@JRWUBU2> References: <20190514030804.17e1b37b@JRWUBU2> Message-ID: <20190807212853.3e7bb4b4@JRWUBU2> On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: > On Tue, 14 May 2019 00:58:07 +0000 > Andrew Glass via Unicode wrote: > > > Here is the essence of the initial changes needed to support CV+C. > > Open to feedback. > > > > > > * Create new SAKOT class > > SAKOT (Sk) based on UISC = Invisible_Stacker > > * Reduced HALANT class > > Now only HALANT (H) based on UISC = Virama > > * Updated Standard cluster mode > > > > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | > > SUB > > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* > > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk > > > B)* (FAbv)* (FBlw)* (FPst)* [FM] > This next question does not, I believe, affect HarfBuzz. Will NFC > code render as well as unnormalised code? In the first example above, > normalises to , which > does not match any portion of the regular expression. Could someone answer this question, please? The USE documentation ("CGJ handling will need to be updated if USE is modified to support normalization") still implies that the USE does not respect canonical equivalence. Richard. From unicode at unicode.org Wed Aug 7 15:39:15 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Wed, 7 Aug 2019 20:39:15 +0000 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <20190807212853.3e7bb4b4@JRWUBU2> References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> Message-ID: That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process. Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize. By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread. Cheers, Andrew -----Original Message----- From: Richard Wordingham Sent: 07 August 2019 13:29 To: Richard Wordingham via Unicode Cc: Andrew Glass Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: > On Tue, 14 May 2019 00:58:07 +0000 > Andrew Glass via Unicode wrote: > > > Here is the essence of the initial changes needed to support CV+C. > > Open to feedback. > > > > > > * Create new SAKOT class > > SAKOT (Sk) based on UISC = Invisible_Stacker > > * Reduced HALANT class > > Now only HALANT (H) based on UISC = Virama > > * Updated Standard cluster mode > > > > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* > > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk > > > B)* (FAbv)* (FBlw)* (FPst)* [FM] > This next question does not, I believe, affect HarfBuzz. Will NFC > code render as well as unnormalised code? In the first example above, > normalises to , which > does not match any portion of the regular expression. Could someone answer this question, please? The USE documentation ("CGJ handling will need to be updated if USE is modified to support normalization") still implies that the USE does not respect canonical equivalence. Richard. From unicode at unicode.org Wed Aug 7 16:19:26 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 7 Aug 2019 14:19:26 -0700 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> Message-ID: <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 7 19:08:04 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Thu, 8 Aug 2019 00:08:04 +0000 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: Shaping domain names is a new requirement. It would be good to understand the specific cases that are falling in the gap here. From: Unicode On Behalf Of Asmus Freytag via Unicode Sent: 07 August 2019 14:19 To: unicode at unicode.org Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? What about text that must exist normalized for other purposes? Domain names must be normalized to NFC, for example. Will such strings display correctly if passed to USE? A./ On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote: That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process. Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize. By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread. Cheers, Andrew -----Original Message----- From: Richard Wordingham Sent: 07 August 2019 13:29 To: Richard Wordingham via Unicode Cc: Andrew Glass Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: On Tue, 14 May 2019 00:58:07 +0000 Andrew Glass via Unicode wrote: Here is the essence of the initial changes needed to support CV+C. Open to feedback. * Create new SAKOT class SAKOT (Sk) based on UISC = Invisible_Stacker * Reduced HALANT class Now only HALANT (H) based on UISC = Virama * Updated Standard cluster mode [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM] This next question does not, I believe, affect HarfBuzz. Will NFC code render as well as unnormalised code? In the first example above, normalises to , which does not match any portion of the regular expression. Could someone answer this question, please? The USE documentation ("CGJ handling will need to be updated if USE is modified to support normalization") still implies that the USE does not respect canonical equivalence. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 7 19:16:38 2019 From: unicode at unicode.org (Asmus Freytag (c) via Unicode) Date: Wed, 7 Aug 2019 17:16:38 -0700 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: On 8/7/2019 5:08 PM, Andrew Glass wrote: > > Shaping domain names is a new requirement. It would be good to > understand the specific cases that are falling in the gap here. > Domain names are simply strings, but the protocol enforces normalization to NFC. In some situations, it might be possible for a browser, for example, to have access to the user-provided string, but I can see any number of situations where the actual string (as stored in the DNS) would need to be displayed. For the scenario, it does not matter whether it's NFC or NFD, what matters is that some particular un-normalized state would be lost; and therefore it would be bad if the result is that the string can no longer be rendered correctly. In particular, as the strings in question would be identifiers, where accurate recognition is prime. A./ > *From:*Unicode *On Behalf Of *Asmus > Freytag via Unicode > *Sent:* 07 August 2019 14:19 > *To:* unicode at unicode.org > *Subject:* Re: What is the time frame for USE shapers to provide > support for CV+C ? > > What about text that must exist normalized for other purposes? > > Domain names must be normalized to NFC, for example. Will such strings > display correctly if passed to USE? > > A./ > > On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote: > > That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process. > > Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize. > > By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread. > > Cheers, > > Andrew > > -----Original Message----- > > From: Richard Wordingham > > Sent: 07 August 2019 13:29 > > To: Richard Wordingham via Unicode > > Cc: Andrew Glass > > Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? > > On Tue, 14 May 2019 03:08:04 +0100 > > Richard Wordingham via Unicode wrote: > > On Tue, 14 May 2019 00:58:07 +0000 > > Andrew Glass via Unicode wrote: > > Here is the essence of the initial changes needed to support CV+C. > > Open to feedback. > > ? *?? Create new SAKOT class > > SAKOT (Sk) based on UISC = Invisible_Stacker > > ? *?? Reduced HALANT class > > Now only HALANT (H) based on UISC = Virama > > ? *?? Updated Standard cluster mode > > [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB > > [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* > > (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk > > B)* (FAbv)* (FBlw)* (FPst)* [FM] > > This next question does not, I believe, affect HarfBuzz.? Will NFC > > code render as well as unnormalised code?? In the first example above, > > normalises to , which > > does not match any portion of the regular expression. > > Could someone answer this question, please?? The USE documentation ("CGJ handling will need to be updated if USE is modified to support > > normalization") still implies that the USE does not respect canonical equivalence. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 7 19:33:47 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Thu, 8 Aug 2019 00:33:47 +0000 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: I agree and understand that accurate representation is important in this case. It would be good to understand how widespread the issue is in order to begin to justify the work to retrofit shaping with normalization. The number of problematic strings may be small but the risk of regression in this case might be quite large. Cheers, Andrew From: Asmus Freytag (c) Sent: 07 August 2019 17:17 To: Andrew Glass ; Unicode Mailing List Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? On 8/7/2019 5:08 PM, Andrew Glass wrote: Shaping domain names is a new requirement. It would be good to understand the specific cases that are falling in the gap here. Domain names are simply strings, but the protocol enforces normalization to NFC. In some situations, it might be possible for a browser, for example, to have access to the user-provided string, but I can see any number of situations where the actual string (as stored in the DNS) would need to be displayed. For the scenario, it does not matter whether it's NFC or NFD, what matters is that some particular un-normalized state would be lost; and therefore it would be bad if the result is that the string can no longer be rendered correctly. In particular, as the strings in question would be identifiers, where accurate recognition is prime. A./ From: Unicode On Behalf Of Asmus Freytag via Unicode Sent: 07 August 2019 14:19 To: unicode at unicode.org Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? What about text that must exist normalized for other purposes? Domain names must be normalized to NFC, for example. Will such strings display correctly if passed to USE? A./ On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote: That's correct, the Microsoft implementation of USE spec does not normalize as part of the shaping process. Why? Because the ccc system for non-Latin scripts is not a good mechanism for handling complex requirements for these writing systems and the effects of ccc-based normalization can disrupt authors intent. Unfortunately, because we cannot fix ccc values, shaping engines at Microsoft have ignored them. Therefore, recommendation for passing text to USE is to not normalize. By the way, at the current time, I do not have a final consensus from Tai Tham experts and community on the changes required to support Tai Tham in USE. Therefore, I've not been able to make the changes proposed in this thread. Cheers, Andrew -----Original Message----- From: Richard Wordingham Sent: 07 August 2019 13:29 To: Richard Wordingham via Unicode Cc: Andrew Glass Subject: Re: What is the time frame for USE shapers to provide support for CV+C ? On Tue, 14 May 2019 03:08:04 +0100 Richard Wordingham via Unicode wrote: On Tue, 14 May 2019 00:58:07 +0000 Andrew Glass via Unicode wrote: Here is the essence of the initial changes needed to support CV+C. Open to feedback. * Create new SAKOT class SAKOT (Sk) based on UISC = Invisible_Stacker * Reduced HALANT class Now only HALANT (H) based on UISC = Virama * Updated Standard cluster mode [< R | CS >] < B | GB > [VS] (CMAbv)* (CMBlw)* (< < H | Sk > B | SUB [VS] (CMAbv)* (CMBlw)*)* [MPre] [MAbv] [MBlw] [MPst] (VPre)* (VAbv)* (VBlw)* (VPst)* (VMPre)* (VMAbv)* (VMBlw)* (VMPst)* (Sk B)* (FAbv)* (FBlw)* (FPst)* [FM] This next question does not, I believe, affect HarfBuzz. Will NFC code render as well as unnormalised code? In the first example above, normalises to , which does not match any portion of the regular expression. Could someone answer this question, please? The USE documentation ("CGJ handling will need to be updated if USE is modified to support normalization") still implies that the USE does not respect canonical equivalence. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 7 23:54:08 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 7 Aug 2019 21:54:08 -0700 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: <3ae70899-6f36-b080-5ea6-a6903a8a9e44@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 8 03:06:47 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Aug 2019 09:06:47 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: <20190808090647.7db9c5b2@JRWUBU2> On Wed, 7 Aug 2019 14:19:26 -0700 Asmus Freytag via Unicode wrote: > What about text that must exist normalized for other purposes? > > Domain names must be normalized to NFC, for example. Will such > strings display correctly if passed to USE? One solution, of course, is to minimise the use of Microsoft products. (The trick is to apply the normalisation algorithm using a permutation of the positive ccc values.) The latest version of HarfBuzz renders subscripted final consonants; it's slowly recovering its pre-USE rendering capabilities. > On 8/7/2019 1:39 PM, Andrew Glass via Unicode wrote: > That's correct, the Microsoft implementation of USE spec does not > normalize as part of the shaping process. Why? Because the ccc system > for non-Latin scripts is not a good mechanism for handling complex > requirements for these writing systems and the effects of ccc-based > normalization can disrupt authors intent. Unfortunately, because we > cannot fix ccc values, shaping engines at Microsoft have ignored > them. Therefore, recommendation for passing text to USE is to not > normalize. HarfBuzz solved the problem of by choosing a suitable normalisation; it uses the same technique for Hebrew, where the normalisation classes are also unfriendly to renderers. > By the way, at the current time, I do not have a final consensus from > Tai Tham experts and community on the changes required to support Tai > Tham in USE. Therefore, I've not been able to make the changes > proposed in this thread. Grammatical denazification is one solution. Another one is to delegate matters to the font. Give us a script type that will implement a GSUB feature by default, and font writers can take it from there. At present I have a conundrum on how to render the accusative singular of the cruciform form of the word for enlightenment without usin? chained syllables, _bodhi?_. The obvious visual encoding is . This combination is very unusual, perhaps unique to this word. (Pali 'o' is ). However, a very common combination, because the UTC refused Tai Tham the character SIGN AM, is SIGN AA, MAI KANG, so for the USE, SIGN AA and MAI KANG have to be in the same character class. (Alternatively, we split the syllable before SIGN AA.) MAI KANG has InSc=bindu, while SIGN AA is a right matra. Unfortunately, there is a strong temptation for many to write what would have been 'SIGN AM' as MAI KANG, SIGN AA, which is to be rendered quite differently from 'SIGN AM' outside Northern Thailand, e.g. in NE Thailand. (Northern Thailand has both syles; it is quite diverse.) If I understand the principles of USE, allowing both '... MAI KANG, SIGN AA...' and '... SIGN AA, MAI KANG ....', which immediately after a consonant have the same rendering in some fonts and very confusable renderings in many others, is considered highly undesirable. For Microsoft applications, another solution is for fonts to deleted dotted circles between Tai Tham characters. (I try to be more selective, but this results in a complicated set of lookups to ensure that deletion only occurs when the renderer has inserted inappropriate dotted circles.) This is not compliant with Unicode, but neither is deliberately treating canonically equivalent forms differently. Richard. From unicode at unicode.org Thu Aug 8 09:23:00 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 8 Aug 2019 07:23:00 -0700 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: <20190808090647.7db9c5b2@JRWUBU2> References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> <20190808090647.7db9c5b2@JRWUBU2> Message-ID: <55aebb70-8b4c-c404-0632-dbd54ebfccb0@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Aug 8 12:38:13 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 8 Aug 2019 18:38:13 +0100 Subject: What is the time frame for USE shapers to provide support for CV+C ? In-Reply-To: References: <20190514030804.17e1b37b@JRWUBU2> <20190807212853.3e7bb4b4@JRWUBU2> <30d0ac2f-54d6-ebba-636e-18d7e42e9fde@ix.netcom.com> Message-ID: <20190808183813.3413a08f@JRWUBU2> On Thu, 8 Aug 2019 00:33:47 +0000 Andrew Glass via Unicode wrote: > I agree and understand that accurate representation is important in > this case. It would be good to understand how widespread the issue is > in order to begin to justify the work to retrofit shaping with > normalization. The number of problematic strings may be small but the > risk of regression in this case might be quite large. Well, you could always reverse engineer HarfBuzz! Just a reminder though. You would be using a permutation of the canonical combining classes - for Tai Tham, U+1A60 should be treated as ccc=254, not ccc=0, and for Tibetan you would need to ensure that the vowels below (ccc=132) came before the vowels above (ccc=130). Richard. From unicode at unicode.org Sat Aug 10 02:26:34 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 10 Aug 2019 08:26:34 +0100 Subject: Fonts and Canonical Equivalence Message-ID: <20190810082634.7e163948@JRWUBU2> I've spun this question off from the issue of what the USE is to do when confronted with the NFC canonical equivalent of a string it will accept when this equivalent does not match its regular expressions when they are applied to strings of characters rather than canonical equivalence classes of strings. What sort of guidance is there on the streams of characters to be supported by a font with respect to canonical equivalence? For example, one might think it would suffice for a font to support NFD strings only, but sometimes it seems that the only canonical equivalent that needs be supported is not the Unicode-defined canonical form, but a renderer-defined canonical form. For example, when a Tai Tham renderer supports subscripted final consonants, should the font support both the sequences and , or just the one chosen by the rendering engine? The HarfBuzz SEA engine would present the font with the former; font designers had seen rendering failures when Tai Tham text belatedly started being canonically normalised. There are similar issues with Tibetan; some fonts do not work properly if a vowel below (ccc=132) is separated from the base of the consonant stack by a vowel above (ccc=130). TUS sees a rendering engine plus a font file (or a set of them) as a single entity, so I don't think it's much guidance here. It seems tolerant of the loss of precision in placement when a Latin character is rendered as base plus diacritic rather than as a precomposed glyph. One can also pedantically argue that a font is a data file rather than a 'process'. (Additionally, a lot of us get confused by the mens rea aspect of Unicode compliance.) Richard. From unicode at unicode.org Sat Aug 10 05:22:01 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Sat, 10 Aug 2019 11:22:01 +0100 Subject: Fonts and Canonical Equivalence In-Reply-To: <20190810082634.7e163948@JRWUBU2> References: <20190810082634.7e163948@JRWUBU2> Message-ID: On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode wrote: > > There are similar issues with Tibetan; some fonts do not work properly > if a vowel below (ccc=132) is separated from the base of the > consonant stack by a vowel above (ccc=130). It's not that the fonts don't work, it's that some the rendering engines do not apply the OpenType features in the font that support both sequences of vowels (vowel-above followed by vowel-below, and vowel-below followed by vowel-above). Just retested on Windows 10 with a Tibetan font that supports both sequences of vowels, and both sequences display correctly under Harfbuzz (as expected), but only vowel-below followed by vowel-above displays correctly when using built-in Windows rendering. It is very frustrating that Windows cannot correctly support the display of Tibetan in normalized form, yet Harfbuzz does not have any problems. Personally, I think USE is a failed experiment, and I wish Microsoft would simply adopt Harfbuzz as the default rendering engine. Andrew From unicode at unicode.org Sat Aug 10 09:44:04 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 10 Aug 2019 15:44:04 +0100 Subject: Fonts and Canonical Equivalence In-Reply-To: References: <20190810082634.7e163948@JRWUBU2> Message-ID: <20190810154404.0c8ebd69@JRWUBU2> On Sat, 10 Aug 2019 11:22:01 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 08:29, Richard Wordingham via Unicode > wrote: > > > > There are similar issues with Tibetan; some fonts do not work > > properly if a vowel below (ccc=132) is separated from the base of > > the consonant stack by a vowel above (ccc=130). > > It's not that the fonts don't work, it's that some the rendering > engines do not apply the OpenType features in the font that support > both sequences of vowels (vowel-above followed by vowel-below, and > vowel-below followed by vowel-above). My observation was based on a Tibetan font that failed when pre-USE HarfBuzz added or changed the normalisation for Tibetan. > Just retested on Windows 10 with > a Tibetan font that supports both sequences of vowels, and both > sequences display correctly under Harfbuzz (as expected), but only > vowel-below followed by vowel-above displays correctly when using > built-in Windows rendering. Does vowel above before vowel below yield a dotted circle? According to the documentation - and the USE may have been improved in undocumented ways - the blwf feature will not apply across a Tibetan sequence of vowel above (VBlw) followed by vowel below (Vabv or CMBlw), but the blws feature will, even if a dotted circle has been added at the boundary. > It is very frustrating that Windows cannot correctly support the > display of Tibetan in normalized form, yet Harfbuzz does not have any > problems. Personally, I think USE is a failed experiment, and I wish > Microsoft would simply adopt Harfbuzz as the default rendering engine. >From what I've seen from discussions on HarfBuzz, the USE seems to work well for non-Indic scripts and Devanagari clones - possibly even for Bengali clones. It's also a definition that HarfBuzz can fall back on. The problems is that it doesn't address the quirks of scripts, and its anti-spoofing measures are draconian and overdone. There may well be an issue of funding for the USE - for all I know, it may in part be charity work. If Microsoft gave up on rendering engines, who would write the rendering specifications for HarfBuzz? I was wondering how the USE might be modified to handle canonical equivalence. The simplest way may be to permute the canonical combining classes, normalise (NFD) according to these classes, and process the rearranged string. That's roughly what HarfBuzz does. Another technique would be to derive regular expressions that would match any string canonically equivalent to a string matching the original regular expressions and use them instead. (It may be simpler to derive a regular expression that finds matches from amongst normalised strings - that's what my canonical equivalence respecting regular expression does.) Using a different canonical equivalent to the present one could 'break' fonts whose sets of properly handled strings were not closed under canonical equivalence - which is why I asked the original question. Richard. From unicode at unicode.org Sat Aug 10 10:37:48 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Sat, 10 Aug 2019 16:37:48 +0100 Subject: Fonts and Canonical Equivalence In-Reply-To: <20190810154404.0c8ebd69@JRWUBU2> References: <20190810082634.7e163948@JRWUBU2> <20190810154404.0c8ebd69@JRWUBU2> Message-ID: On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode wrote: > > > Just retested on Windows 10 with > > a Tibetan font that supports both sequences of vowels, and both > > sequences display correctly under Harfbuzz (as expected), but only > > vowel-below followed by vowel-above displays correctly when using > > built-in Windows rendering. > > Does vowel above before vowel below yield a dotted circle? Yes. Attached are screenshots for two real world examples, one which is logically spelled as i + u, and one as u + i: 1. ??? <0F49 0F72 0F74> [nyiu] as a contraction for ????? [nyi shu] "twenty" 2. ????? <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for ???????? [bcu gcig] "eleven" Andrew -------------- next part -------------- A non-text attachment was scrubbed... Name: Harfbuzz.png Type: image/png Size: 18845 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Notepad.png Type: image/png Size: 14915 bytes Desc: not available URL: From unicode at unicode.org Sat Aug 10 12:50:00 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 10 Aug 2019 18:50:00 +0100 Subject: Fonts and Canonical Equivalence In-Reply-To: References: <20190810082634.7e163948@JRWUBU2> <20190810154404.0c8ebd69@JRWUBU2> Message-ID: <20190810185000.4a7fd4aa@JRWUBU2> On Sat, 10 Aug 2019 16:37:48 +0100 Andrew West via Unicode wrote: > On Sat, 10 Aug 2019 at 15:46, Richard Wordingham via Unicode > wrote: > > Does vowel above before vowel below yield a dotted circle? > > Yes. Attached are screenshots for two real world examples, one which > is logically spelled as i + u, and one as u + i: > > 1. ??? <0F49 0F72 0F74> [nyiu] as a contraction for ????? [nyi shu] > "twenty" > > 2. ????? <0F56 0F45 0F74 0F72 0F42> [bcuig] as a contraction for > ???????? [bcu gcig] "eleven" Thanks for the clarification. I must have done something wrong when I tried to break Tibetan rendering by an above-below sequence - unless MS Edge denormalises Tibetan text so that it will render. However, we may be able to redress the balance between the renderers by inserting CGJ between the vowels to preserve the order when the strings are copied: nyiu ???? 0F49 0F72 034F 0F74 bcuig ?????? 0F56 0F45 0F74 034F 0F72 0F42 On my machine they display without dotted circles in Claws-Mail and LibreOffice, but I may be using too old a version of HarfBuzz. However, the ligaturing is missing in _nyiu_ with CGJ. LibreOffice at least is using Tibetan Machine Uni. However, in a snapshot of HarfBuzz I pulled in the past few days, both were rendered with dotted circles. This issue is apparently being worked on - (https://github.com/harfbuzz/harfbuzz/issues/483). The forms without CGJ render fine in the two applications Richard. From unicode at unicode.org Sat Aug 10 23:07:05 2019 From: unicode at unicode.org (Robert Wheelock via Unicode) Date: Sun, 11 Aug 2019 00:07:05 -0400 Subject: PUA (BMP) planned characters HTML tables Message-ID: Hello! I remember that a website that has tables for certain PUA precomposed accented characters that aren?t yet in Unicode (thing like: Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I still want to get the information. Thank You! Robert Lloyd Wheelock -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 11 00:27:48 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 11 Aug 2019 06:27:48 +0100 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: References: Message-ID: <20190811062748.3a1af5ce@JRWUBU2> On Sun, 11 Aug 2019 00:07:05 -0400 Robert Wheelock via Unicode wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren?t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital > H-underbar, acute accented Cyrillic vowels, Cyrillic > ER/er-caron, ...). Where was it at?! I still want to get the > information. Thank You! You may mean https://www.eki.ee/letter. Once there, you'll want to make a query by Unicode range, e.g. e000-f8ff. It doesn't seem to refer to the relevant agreement. You could start hunting for agreements at https://scripts.sil.org/cms/scripts/page.php?item_id=VendorUseOfPUA Most of the characters you mention are scheduled to be assigned their own codepoint on the Greek kalends. They are precluded by policy because they would need to be composition exclusions to avoid making text in NFC cease to be in NFC. I first thought of the SIL PUA at https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=PUA_home , but they knew better than to include most of them. Richard. From unicode at unicode.org Sun Aug 11 03:57:30 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Sun, 11 Aug 2019 08:57:30 +0000 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: References: Message-ID: <8a320a0a-48a6-7f8b-a970-1a4cab25ac82@gmail.com> On 2019-08-11 4:07 AM, Robert Wheelock via Unicode wrote: > Hello! > I remember that a website that has tables for certain PUA precomposed > accented characters that aren?t yet in Unicode (thing like: Marshallese > M/m-cedilla, H/h-acute, capital T-dieresis, capital H-underbar, acute > accented Cyrillic vowels, Cyrillic ER/er-caron, ...). Where was it at?! I > still want to get the information. Thank You! > > It sounds familiar but I can't place it.? I tried the SIL pages first, as did Richard Wordingham apparently. https://blogfonts.com/dehuti.font This font has material in the PUA including: Marshallese glyphs with cedillas: L (E382 & E394), M (E3A6 & E3BB), N (E3CE & E3DE), O (E429 & E465) These appear to be PUA characters which the font developer has mapped in addition to the SIL PUA mappings. From unicode at unicode.org Sun Aug 11 12:26:02 2019 From: unicode at unicode.org (via Unicode) Date: Sun, 11 Aug 2019 11:26:02 -0600 Subject: PUA (BMP) planned characters HTML tables Message-ID: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> Robert Wheelock wrote: > I remember that a website that has tables for certain PUA precomposed > accented characters that aren?t yet in Unicode (thing like: > Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- > underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Sun Aug 11 20:21:42 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Mon, 12 Aug 2019 01:21:42 +0000 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> Message-ID: <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: > If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. Good point. There was a time when populating the PUA with precomposed glyphs was necessary for printing or display, but that time has passed. Hopefully anyone seeking charts is transcoding older data into proper Unicode. This can be illustrated with the Marshallese combos mentioned earlier. PUA:? ???????? Standard:? L?l?M?m?N?n?O?o? Well, that didn't work out as well as expected.? But the standard Unicode is supported (more or less) by some of the core fonts installed here.? Nothing installed here displays anything useful for the PUA characters.? A decent OpenType font designed with Marshallese in mind should work just fine with the combiners. The fact is that the standard characters will survive and can be universally exchanged.? And there's plenty of web page charts showing the standard characters. From unicode at unicode.org Mon Aug 12 02:26:07 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 12 Aug 2019 08:26:07 +0100 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> Message-ID: <20190812082607.0eaf408d@JRWUBU2> On Mon, 12 Aug 2019 01:21:42 +0000 James Kass via Unicode wrote: > There was a time when populating the PUA with precomposed glyphs was > necessary for printing or display, but that time has passed. There is still the issue that in pure X one can't put sequences of characters on a key; if the application doesn't invoke an input method one is stuck. Useful 20-year old proprietary code may be totally unable to use modern font capabilities. Don't forget the Cobol Y10k joke. On Ubuntu at least, there was a period when Emacs couldn't access X-based input methods from an English locale. The work-around: Use a Japanese locale plus the vanilla lack of internationalisation in the interface, or Emacs's very convenient alternative keyboard capability for text input as opposed to commands. The bug turned out to be in the definition of the locales, i.e. in privileged data beyond the purview of Emacs. As to the need for the PUA, writing fonts to cope with Tai Tham rendering engines is not easy, and it's no surprise that the PUA is used on line for a newspaper that uses the Tai Tham script. The USE is too user-hostile for it to have helped if it had been available earlier. (It just ignored the regular expression published in 2007. (It's in L2/07-007R in the UTC document register, ISO/IEC JTC1/SC2/WG2/N3207R on ISO land.) Indeed, perhaps I should be researching the PUA encoding for Tai Tham. (My Tai Tham font Da Lekh started as proof of principle, for there is already an unpleasant amount of glyph sequence changing, some style-dependent. I couldn't see how to get rendering engine support even when it might be added. I was pleasantly surprised at how far from impossible Tai Tham layout was until the USE came along and made everything harder. I now have to work out which glyph instances have already been Indicly rearranged when I repair the clustering.) Oh, and i seem to need some PUA codepoints for vowels that get stranded when line-breaks occur between the columns of an akshara. The proposals show this phenomenon in old(?) Pali text. Or is there any chance of getting them encoded? Richard. From unicode at unicode.org Mon Aug 12 03:30:35 2019 From: unicode at unicode.org (Andrew West via Unicode) Date: Mon, 12 Aug 2019 09:30:35 +0100 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> Message-ID: On Mon, 12 Aug 2019 at 02:27, James Kass via Unicode wrote: > > On 2019-08-11 5:26 PM, [ Doug Ewell ] via Unicode wrote: > > If you are thinking of these as potential future additions to the standard, keep in mind that accented letters that can already be represented by a combination of letter + accent will not ever be encoded. This is one of the longest-standing principles Unicode has. People seem to be ignoring the fact that Marshallese and Latvian both use L and N with cedilla, but with completely different glyph shapes: > In January 2013, the Unicode Technical Committee discussed issues for the representation of > Marshallese orthography. In particular, Marshallese uses the Latin script and requires the letters l, > m, n, and o with cedilla. Latvian orthography uses the Latin script and requires the letters g, k, l, n, > and r with comma below. For Marshallese, it is unacceptable to display cedillas as commas below. > Conversely, for Latvian, it is unacceptable to display commas below as cedillas. However, as fonts have been following Latvian practice for these letters (cedilla is displayed as a comma below) since before Unicode, Marshallese users cannot get their desired outcome using standard Unicode combining diacritical marks unless they apply a font specially designed for Marshallese -- which you can never guarantee if you are writing an email or posting on twitter, etc. This issue was discussed at WG2 in 2013 (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), when there was a recommendation to encode precomposed letters L and N with cedilla *with no decomposition*, but that solution does not seem to have been taken up by the UTC. Andrew From unicode at unicode.org Mon Aug 12 13:44:34 2019 From: unicode at unicode.org (via Unicode) Date: Mon, 12 Aug 2019 20:44:34 +0200 Subject: Unihan kJapaneseKun Error Report Message-ID: I have uploaded on the dedicated GitHub?repository GitHub repository a draft version of a document called "Unihan kJapaneseKun Error Report" available both in Markdown and HTML format, which is a list of corrections of issues found in the kJapaneseKun field of the Unihan_Readings.txt data file, and that I intend to submit to the UTC around next month. There are some issues that I flagged with question marks when I'm not sure whether the correction is appropriate or not, and also I may have missed some issues since there are *11291* Unihan characters having a kJapaneseKun tag property. If you wish to help, or provide criticisms, comments or suggestions, please feel free to chime in, either by opening an issue on the GitHub repository, or replying to this mailing list. TIA. Best regards, --Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 12 13:48:53 2019 From: unicode at unicode.org (via Unicode) Date: Mon, 12 Aug 2019 20:48:53 +0200 Subject: Unihan kJapaneseKun Error Report [Follow-up: Fixed wrong URL] Message-ID: <95FECB3F-7398-419C-BD69-80F3BF451067@ouvaton.org> Follow-up: [Fixed wrong URL] I have uploaded on the dedicated GitHub?repository a draft version of a document called "Unihan kJapaneseKun Error Report" available both in Markdown and HTML format, which is a list of corrections of issues found in the kJapaneseKun field of the Unihan_Readings.txt data file, and that I intend to submit to the UTC around next month. There are some issues that I flagged with question marks when I'm not sure whether the correction is appropriate or not, and also I may have missed some issues since there are *11291* Unihan characters having a kJapaneseKun tag property. If you wish to help, or provide criticisms, comments or suggestions, please feel free to chime in, either by opening an issue on the GitHub repository, or replying to this mailing list. TIA. Best regards, --Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 14 04:05:02 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Aug 2019 09:05:02 +0000 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> Message-ID: <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> On 2019-08-12 8:30 AM, Andrew West wrote: > This issue was discussed at WG2 in 2013 > (https://www.unicode.org/L2/L2013/13128-latvian-marshal-adhoc.pdf), > when there was a recommendation to encode precomposed letters L and N > with cedilla*with no decomposition*, but that solution does not seem > to have been taken up by the UTC. Group One dots their lowercase "i" letters with little flowers and Group Two dots theirs with little hearts.? Group Two considers flowers unacceptable and Group One rejects hearts.? Because of legacy character sets there's a precomposed character encoded called "LATIN LOWER CASE I WITH HEART", but it was misnamed and is normally drawn with a flower instead.? Group Two tries to encode "LATIN LOWER CASE I" plus "COMBINING HEART" to get the thing to display properly.? But because there's a decomposition involved, the font engine substitutes the glyph mapped to "LATIN LOWER CASE I WITH HEART" in the display for the string "LATIN LOWER CASE I" plus "COMBINING HEART".? This thwarts Group Two because they still get the flower. The solution is to deprecate "LATIN LOWER CASE I WITH HEART".? It's only in there because of legacy.? It's presence guarantees round-tripping with legacy data but it isn't needed for modern data or display.? Urge Groups One and Two to encode their data with the desired combiner and educate font engine developers about the deprecation.? As the rendering engines get updated, the system substitution of the wrongly named precomposed glyph will go away. This presumes that the premise of user communities feeling strongly about the unacceptable aspect of the variants is valid.? Since it has been reported and nothing seems to be happening, perhaps the casual users aren't terribly concerned.? It's also possible that the various user communities have already set up their systems to handle things acceptably by installing appropriate fonts. From unicode at unicode.org Wed Aug 14 14:50:46 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 14 Aug 2019 20:50:46 +0100 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> Message-ID: <20190814205046.42f5407c@JRWUBU2> On Wed, 14 Aug 2019 09:05:02 +0000 James Kass via Unicode wrote: > The solution is to deprecate "LATIN LOWER CASE I WITH HEART".? It's > only in there because of legacy.? It's presence guarantees > round-tripping with legacy data but it isn't needed for modern data > or display.? Urge Groups One and Two to encode their data with the > desired combiner and educate font engine developers about the > deprecation.? As the rendering engines get updated, the system > substitution of the wrongly named precomposed glyph will go away. I think you'd also have to change the reference glyph of LATIN LOWER CASE I WITH HEART to show a heart. That's valid because the UCD trumps the code charts, and and no Unicode-compliant process may deliberately render differently from LATIN LOWER CASE I WITH HEART. Richard. From unicode at unicode.org Wed Aug 14 18:32:37 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 14 Aug 2019 23:32:37 +0000 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <20190814205046.42f5407c@JRWUBU2> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> <20190814205046.42f5407c@JRWUBU2> Message-ID: On 2019-08-14 7:50 PM, Richard Wordingham via Unicode wrote: > I think you'd also have to change the reference glyph of LATIN LOWER > CASE I WITH HEART to show a heart. That's valid because the UCD trumps > the code charts, and and no Unicode-compliant process may deliberately > render differently from LATIN LOWER CASE I WITH > HEART. U+0149 has a compatibility decomposition.? It has been deprecated and is not rendered identically on my system. 'n ? ( ?n ) If a character gets deprecated, can its decomposition type be changed from canonical to compatibility? From unicode at unicode.org Wed Aug 14 18:42:21 2019 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 14 Aug 2019 16:42:21 -0700 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> <20190814205046.42f5407c@JRWUBU2> Message-ID: On 8/14/2019 4:32 PM, James Kass via Unicode wrote: > If a character gets deprecated, can its decomposition type be changed > from canonical to compatibility? Simple answer: No. --Ken From unicode at unicode.org Wed Aug 14 19:25:59 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Wed, 14 Aug 2019 17:25:59 -0700 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> Message-ID: <114e713a-953e-1ed5-53fe-5734a537a994@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Aug 14 21:49:34 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 15 Aug 2019 02:49:34 +0000 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <114e713a-953e-1ed5-53fe-5734a537a994@ix.netcom.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> <114e713a-953e-1ed5-53fe-5734a537a994@ix.netcom.com> Message-ID: <5a732269-fa8c-d8a4-602b-5123cfd59886@gmail.com> On 2019-08-15 12:25 AM, Asmus Freytag via Unicode wrote: > Empirically, it has been observed that some distinctions that are claimed by > users, standards developers or implementers were de-facto not honored by type > developers (and users selecting fonts) as long as the native text doesn't > contain minimal pairs. Quickly checked a couple of older on-line PDFs and both used the comma below unabashedly. Quoting from this page (which appears to be more modern than the PDFs), http://www.trussel2.com/MOD/peloktxt.htm "Ij keememej ??k w?t ke ikar uwe ipp?n Jema kab ruo ???aan ilo juon booj jidikdik eo ro?oul ruo ne aitokan im jiljino ne depakpakin. Ilo iien in eor jiljilimjuon ak rualit?k a? ii??Ij jab kanooj ememej. Wa in ???kaj kar ..." It seems that users are happy to employ a dot below in lieu of either a comma or cedilla.? This newer web page is from a book published in 1978.? There's a scan of the original book cover. Although the book title is all caps hand printing it appears that commas were used.? The Marshallese orthography which uses commas/cedillas is fairly recent, replacing an older scheme devised by missionaries.? Perhaps the actual users have already resolved this dilemma by simply using dots below. From unicode at unicode.org Thu Aug 15 05:39:57 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 15 Aug 2019 11:39:57 +0100 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> <20190814205046.42f5407c@JRWUBU2> Message-ID: <20190815113957.49387831@JRWUBU2> On Wed, 14 Aug 2019 23:32:37 +0000 James Kass via Unicode wrote: > U+0149 has a compatibility decomposition.? It has been deprecated and > is not rendered identically on my system. > 'n ? > ( ?n ) Compatibility decompositions are quite a mix, but are generally expected to render differently. If they were expected to render the same, they would normally be canonical decompositions. U+0149 and its decomposition naturally render very differently with a monospaced font. The same goes for the Roman numerals that the Far East gave us. Richard. From unicode at unicode.org Thu Aug 15 14:13:51 2019 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 15 Aug 2019 12:13:51 -0700 Subject: PUA (BMP) planned characters HTML tables In-Reply-To: <5a732269-fa8c-d8a4-602b-5123cfd59886@gmail.com> References: <000001d55069$d9dd12f0$8d9738d0$@ewellic.org> <8615775e-74ed-1a2f-2b6d-caec82788b03@gmail.com> <8a141497-3b52-9671-ec4d-e0e5cd88c0ca@gmail.com> <114e713a-953e-1ed5-53fe-5734a537a994@ix.netcom.com> <5a732269-fa8c-d8a4-602b-5123cfd59886@gmail.com> Message-ID: <9a06f6c3-0c90-bd41-27b4-2279779e38ba@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 19 21:27:13 2019 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 20 Aug 2019 04:27:13 +0200 Subject: =?UTF-8?Q?Re=3A_Acute=2Fapostrophe_diacritic_in_V=C3=B5ro_for_palatali?= =?UTF-8?Q?zed_consonants?= In-Reply-To: References: Message-ID: I must add that the current version of Wikipedia in V?ro, seems to have completely renounced to encode this combining mark (no acute, no apostrophe), probably because of lack of proper encoding in Unicode and difficulty to harmonize its orthography. It may be a good argument for the addition of the missing combining palatal accent and to restore the correct expected typography. I'm curious also about other existing styles (notably with blackletters aka "Gothic", or ISO 15924 "Latf" in historic texts: was that diacritic ever handwritten, or typesetted in printed books, and how?) Le mar. 20 ao?t 2019 ? 04:17, Philippe Verdy a ?crit : > > I'm curious about this statement in English Wikipedia about V?ro: > >> Palatalization of consonants is marked with an acute accent (?) or apostrophe ('). In proper typography and in handwriting, the palatalisation mark does not extend above the cap height (except uppercase letters ?, ?, ?, V? etc.), and it is written above the letter if the letter has no ascender (?, ?, ?, ?, ?, ?, v? etc.) but written to the right of it otherwise (b?, d?, f?, h?, k?, l?, t?). In computing, it is not usually possible to enter these character combinations or to make them look esthetically pleasing with most common fonts, so the apostrophe is generally placed after the letter in all cases. This convention is followed in this article as well. > > > The problem is the encoding of this acute/apostrophe which changes depending on lettercase or even depending on letterform for specific styles (i.e. when there are ascenders or not for lowercase letters). -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 20 17:29:27 2019 From: unicode at unicode.org (Vinodh Rajan via Unicode) Date: Wed, 21 Aug 2019 00:29:27 +0200 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar Message-ID: Hi, I was wondering how to encode the Sanskrit medial sequences -vy - and -ry- in Myanmar as they occur in Sanskrit syllables such as dvya and trya regularly. I tried to encode the above as ??? (Cons U+103D U+103B) and ??? (Cons U+103C U+103B) but all rendering engines insert a dotted circle before the medial YA. Interestingly, reversing the order of the medials seem to work: ??? ??? Am I supposed to invert the order of the medials to render them appropriately? Is this the correct way to represent them in Unicode? Cheers, Vinodh Rajan -- http://www.virtualvinodh.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 20 17:43:43 2019 From: unicode at unicode.org (Andrew Glass via Unicode) Date: Tue, 20 Aug 2019 22:43:43 +0000 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: References: Message-ID: Hi Vinodh, The order of medials in Myanmar clusters is constrained by UTN #11. So yes, you do need to follow the preferred order for Myanmar even if the sequence does not match phonetic order. Is it the case the representation is ??? ??? matches the visual output you were expecting? Cheers, Andrew From: Unicode On Behalf Of Vinodh Rajan via Unicode Sent: 20 August 2019 15:29 To: Unicode Mailing List Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar Hi, I was wondering how to encode the Sanskrit medial sequences -vy - and -ry- in Myanmar as they occur in Sanskrit syllables such as dvya and trya regularly. I tried to encode the above as ??? (Cons U+103D U+103B) and ??? (Cons U+103C U+103B) but all rendering engines insert a dotted circle before the medial YA. Interestingly, reversing the order of the medials seem to work: ??? ??? Am I supposed to invert the order of the medials to render them appropriately? Is this the correct way to represent them in Unicode? Cheers, Vinodh Rajan -- http://www.virtualvinodh.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 20 18:33:00 2019 From: unicode at unicode.org (Vinodh Rajan via Unicode) Date: Wed, 21 Aug 2019 01:33:00 +0200 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: References: Message-ID: Hi Andrew, Thanks for your quick reply. I did have a quick look at the Sanskrit section but the examples given didn't cover these sequences. I now see that the order of diacritics is present in the beginning. //Is it the case the representation is ??? ??? matches the visual output you were expecting?// Yup. The exactly match what I was expecting. Cheers, Vinodh Rajan On Wed, 21 Aug 2019, 00:43 Andrew Glass, wrote: > Hi Vinodh, > > > > The order of medials in Myanmar clusters is constrained by UTN #11 > . So yes, you do need to follow the > preferred order for Myanmar even if the sequence does not match phonetic > order. > > > > > > > Cheers, > > > > Andrew > > > > *From:* Unicode *On Behalf Of *Vinodh Rajan > via Unicode > *Sent:* 20 August 2019 15:29 > *To:* Unicode Mailing List > *Subject:* Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar > > > > Hi, > > > > I was wondering how to encode the Sanskrit medial sequences -vy - and -ry- > in Myanmar as they occur in Sanskrit syllables such as dvya and trya > regularly. > > > > I tried to encode the above as ??? (Cons U+103D U+103B) and ??? (Cons > U+103C U+103B) but all rendering engines insert a dotted circle before the > medial YA. > > > > Interestingly, reversing the order of the medials seem to work: ??? ??? > > > > Am I supposed to invert the order of the medials to render them > appropriately? Is this the correct way to represent them in Unicode? > > > > Cheers, > > > > Vinodh Rajan > > > > > > > -- > > http://www.virtualvinodh.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 20 21:08:37 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 21 Aug 2019 03:08:37 +0100 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: References: Message-ID: <20190821030837.07b2bdbe@JRWUBU2> On Tue, 20 Aug 2019 22:43:43 +0000 Andrew Glass via Unicode wrote: > The order of medials in Myanmar clusters is constrained by UTN > #11. So yes, you do need to > follow the preferred order for Myanmar even if the sequence does not > match phonetic order. Are we are allowed to write Llangollen as the definition of the Unicode Collation Algorithm implies we should, with an invisible CGJ between the 'n' and the 'g', so that it will collate correctly in Welsh? That CGJ is necessary so that it will collate *after* Llanberis. (The problem is that the letter 'ng' comes before the letter 'n'.) Welsh is a European language, so I believe it has the right to strive to have its words collated correctly. But perhaps I'm wrong. Richard. From unicode at unicode.org Tue Aug 20 21:40:09 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 21 Aug 2019 02:40:09 +0000 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: <20190821030837.07b2bdbe@JRWUBU2> References: <20190821030837.07b2bdbe@JRWUBU2> Message-ID: <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote: > Are we are allowed to write Llangollen as the definition of the > Unicode Collation Algorithm implies we should, with an invisible CGJ > between the 'n' and the 'g', so that it will collate correctly in > Welsh? That CGJ is necessary so that it will collate*after* > Llanberis. (The problem is that the letter 'ng' comes before the letter > 'n'.) So that it won't collate correctly in anything other than Welsh? Isn't it better to use an application which enables Welsh collation?? Here's how BabelPad handles Welsh: http://www.babelstone.co.uk/Software/BabelPad_Sort_Lines.html From unicode at unicode.org Tue Aug 20 21:47:28 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 21 Aug 2019 02:47:28 +0000 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> References: <20190821030837.07b2bdbe@JRWUBU2> <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> Message-ID: <5cb40765-0646-8bc9-e2aa-2f0da0e4d67f@gmail.com> On 2019-08-21 2:40 AM, James Kass wrote: > Are we are allowed to write Llangollen as the definition of the > Unicode Collation Algorithm implies we should, with an invisible CGJ > between the 'n' and the 'g', so that it will collate correctly in > Welsh?? That CGJ is necessary so that it will collate*after* > Llanberis. (The problem is that the letter 'ng' comes before the letter > 'n'.) (This is off-list).? If 'ng' comes before 'n', shouldn't Llangollen collate *before* Llanberis in a Welsh listing? From unicode at unicode.org Tue Aug 20 23:03:19 2019 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 21 Aug 2019 04:03:19 +0000 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: <5cb40765-0646-8bc9-e2aa-2f0da0e4d67f@gmail.com> References: <20190821030837.07b2bdbe@JRWUBU2> <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> <5cb40765-0646-8bc9-e2aa-2f0da0e4d67f@gmail.com> Message-ID: <2a051603-05ab-36a1-8b1f-6ad2086fb26f@gmail.com> Well, it was intended to be off list.? It seems that this has been mentioned before, for example; http://www.unicode.org/mail-arch/unicode-ml/y2011-m07/0029.html Maybe it's time for a new thread/subject title? From unicode at unicode.org Wed Aug 21 02:29:21 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 21 Aug 2019 08:29:21 +0100 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> References: <20190821030837.07b2bdbe@JRWUBU2> <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> Message-ID: <20190821082921.31977692@JRWUBU2> On Wed, 21 Aug 2019 02:40:09 +0000 James Kass via Unicode wrote: > On 2019-08-21 2:08 AM, Richard Wordingham via Unicode wrote: > > Are we are allowed to write Llangollen as the definition of the > > Unicode Collation Algorithm implies we should, with an invisible CGJ > > between the 'n' and the 'g', so that it will collate correctly in > > Welsh? That CGJ is necessary so that it will collate*after* > > Llanberis. (The problem is that the letter 'ng' comes before the > > letter 'n'.) > So that it won't collate correctly in anything other than Welsh? CGJ has zero weight in most, if not all standard UCA or CLDR-like collations. Richard. From unicode at unicode.org Wed Aug 21 02:52:35 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 21 Aug 2019 08:52:35 +0100 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: <5cb40765-0646-8bc9-e2aa-2f0da0e4d67f@gmail.com> References: <20190821030837.07b2bdbe@JRWUBU2> <8c6e578c-0974-7dc4-8c89-89aaa5b6d486@gmail.com> <5cb40765-0646-8bc9-e2aa-2f0da0e4d67f@gmail.com> Message-ID: <20190821085235.4ffe5f3a@JRWUBU2> On Wed, 21 Aug 2019 02:47:28 +0000 James Kass via Unicode wrote: > > Are we are allowed to write Llangollen as the definition of the > > Unicode Collation Algorithm implies we should, with an invisible CGJ > > between the 'n' and the 'g', so that it will collate correctly in > > Welsh?? That CGJ is necessary so that it will collate*after* > > Llanberis. (The problem is that the letter 'ng' comes before the > > letter 'n'.) > If 'ng' comes before 'n', shouldn't Llangollen > collate *before* Llanberis in a Welsh listing? I'm not quite sure of the question. There are two possible answers: (a) I used the English spelling because, so far as I am aware, the US keyboard lacks CGJ. (I'm using a US keyboard layout so as to get the keycaps engraved with Thai.) (b) No, because 'Llangollen' doesn't contain the letter 'ng'. It's spelt 'll', 'a', 'n', 'g', 'o', 'll', 'e', 'n' (8 letters) not 'll', 'a', 'ng', 'o', 'll', 'e', 'n' (7 letters). There are a very few look-alikes, where one is spelt with 'ng' and the other with 'n', 'g'. Richard. From unicode at unicode.org Wed Aug 21 12:16:51 2019 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Wed, 21 Aug 2019 10:16:51 -0700 Subject: PUA (BMP) planned characters HTML tables Message-ID: <20190821101651.665a7a7059d7ee80bb4d670165c8327d.4cd9be3c0a.wbe@email03.godaddy.com> On August 11, I replied to Robert Wheelock: >> I remember that a website that has tables for certain PUA precomposed >> accented characters that aren?t yet in Unicode (thing like: >> Marshallese M/m-cedilla, H/h-acute, capital T-dieresis, capital H- >> underbar, acute accented Cyrillic vowels, Cyrillic ER/er-caron, ...). > > If you are thinking of these as potential future additions to the > standard, keep in mind that accented letters that can already be > represented by a combination of letter + accent will not ever be > encoded. This is one of the longest-standing principles Unicode has. I missed the possible significance of the Latvian comma below vs. Marshallese cedilla, which captured most of the ensuing discussion and morphed into a discussion about different user communities and group identity. I'd like to restate, since I think the point may have been lost, that for the OTHER characters Robert mentioned: > H/h-acute, capital T-dieresis, capital H-underbar, acute accented > Cyrillic vowels, Cyrillic ER/er-caron, ... there does not appear to be any conflicting usage between different user communities, and no particular difficulty in rendering or otherwise processing these as combining sequences, using up-to-date fonts and rendering engines. I suppose Philippe's example of V?ro might factor into whether different groups prefer different appearances for h?, but otherwise these user-perceived characters seem to be non-controversial. So to reiterate, these characters appear vanishingly unlikely to be atomically encoded, "yet" or ever, for good reason. -- Doug Ewell | Thornton, CO, US | ewellic.org From unicode at unicode.org Wed Aug 21 14:59:33 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 21 Aug 2019 20:59:33 +0100 Subject: Rendering Sanskrit Medial Sequences -vy- and -ry- in Myanmar In-Reply-To: References: Message-ID: <20190821205933.2687c274@JRWUBU2> On Tue, 20 Aug 2019 22:43:43 +0000 Andrew Glass via Unicode wrote: > The order of medials in Myanmar clusters is constrained by UTN > #11. So yes, you do need to > follow the preferred order for Myanmar even if the sequence does not > match phonetic order. If the spelling matters, there may be a partial solution. It depends on being able to have . This combination is intended for use in Pali and Sanskrit when the shape of Burmese is unacceptable - see http://www.unicode.org/L2/L2006/06077-n3043r-myanmar.pdf for the justification for adding U+103D as an *alternative* to <1039, 101D>. Now, UTN#11, which is explicitly not endorsed by the UTC, does not allow the sequence and the Padauk font does not support it. I can't find anything in the Myanmar rendering description https://docs.microsoft.com/en-gb/typography/script-development/myanmar that indicates that that renderer might reject the combination. Consequently, you may be able to find an ordinary font that will render the sequence to give you an acceptable rendering of 'dvya' that will be analysed as having the right spelling. I can't see any way of helping you to get a renderable unambiguous Myanmar script spelling for 'trya' - unless you're prepared to supply alternative renderers. (I presume Zawgyi fonts are not an acceptable alternative.) Richard. From unicode at unicode.org Thu Aug 22 08:45:30 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Thu, 22 Aug 2019 16:45:30 +0300 Subject: The native name of Tai Viet script and language(s) Message-ID: <83ef1dco05.fsf@gnu.org> Could someone "in the know" please help me make the Tai Viet script documentation in Emacs accurate? The current short description we have is in the file lisp/language/tai-viet.el in the Emacs source tree. You can see it here: http://git.savannah.gnu.org/cgit/emacs.git/tree/lisp/language/tai-viet.el My concern is with the text under "sample-text" (line 40) and in the documentation string following that (starting on line 48), which states the name of the script and the language expressed with Tai Viet characters. However, that text is from long ago, before Unicode had a Tai Viet block, so it still uses at least one PUA character, whuch I think is incorrect. In addition, I didn't find any place where I could copy/paste the current accurate name of the script and at least one of the languages that use that script. Could someone please help me set this text straight? Bonus points for also telling how to say "hello" (or any similar greeting) in one of the Tai Viet languages, so that we could add that to the etc/HELLO file. (I think the sample-text attempts to include such a greeting, but again, I'm not sure it is correct.) Thanks in advance. From unicode at unicode.org Fri Aug 23 09:55:52 2019 From: unicode at unicode.org (Ellen Mastros via Unicode) Date: Fri, 23 Aug 2019 07:55:52 -0700 Subject: Want to be individual member In-Reply-To: References: Message-ID: You may join by following instructions on this page: https://www.unicode.org/consortium/joinform.html Please let me know if you have any other questions. Ellen Mastros Office Manager Unicode Inc. PO Box 391476 Mountain View, CA 94039 (408)401-8915 You can support us whenever you shop at Amazon.com On Fri, Aug 23, 2019 at 2:00 AM Kamfisht Universe Engineering < kamfisht at gmail.com> wrote: > Dear sir > > With due respect, I would like to be individual member of Unicode. Please > cooperate me. > > Best wishes > > Muhammad Ariful Haque (Shohagh) > MSc in Disaster Management, University of Dhaka > BSc (Engg) in Computer Science & Engineering, Darul Ihsan University > Officer, Janata Bank Limited (Card Management Department, Head Office) > DAIBB from Institute of Bankers Bangladesh > CEO(non-profit), Kamfisht Universe Engineering > Partner, Global Water Partnership > Partner, Global Soil Partnership of FAO, UN > Partner, GSIN of NASA > Ex-ICT Consultant, Government Teachers Training College, Feni > Adopter Member, Bluetooth Special Interest Group > Member, International Desalination Association > Life Member, Bangladesh Computer Society > Ex-Member, International Phonetic Association > Ex-Student, BKIAC, Bangladesh University of Engineering and Technology > (BUET) > Briefly trained on Fire management, Fire Service and Civil Defense > Interned in Network Simulation in NPED of BAEC > Cell: +8801710822509, +8801819462549, +8801575086759 > > On Tuesday, September 20, 2016, Kamfisht Universe Engineering < > kamfisht at gmail.com> wrote: > >> Dear Sir >> >> With due respect, I want to share my thinking with you. I think one of >> the significant disabilities/ disaster of indigenous, tribal, and >> backward people is linguistics. UN Official Linguistics people have >> more scope to develop themselves rather than that of non-UN Official >> Linguistics peoples. To save the endangered languages (intangible >> heritage of our world) as well as providing the global standard >> education, we might develop such global standard digital phonetic >> system so that every pronunciation of human could be perfectly write >> down on their own alphabets with adding global phonetic alphabets. >> This system will provide shortest path to learn more foreign languages >> for everybody of the world and enhance scope to acquire latest global >> knowledge. Again if any one can write down his/her native language >> with foreign script, then learning foreign language will be easier. >> For example, writing Japanese language using Bangla alphabets and >> vice-versa will let learning foreign language easier but it needed a >> computerized global standard format. So I think Digital transliteration >> services for all languages will enhance earlier achievement of UN declared >> Sustainable Development Goal. >> >> >> Best wishes >> >> >> Muhammad Ariful Haque (Shohagh) >> MSc (Student) in Disaster Management, University of Dhaka >> BSc (Engg) in Computer Science & Engineering, Darul Ihsan University >> AEO, Janata Bank Limited (Card Management Department, Head Office) >> CEO(non-profit), Kamfisht Universe Engineering >> Founder, YBIT -a concern of Kamfisht Universe Engineering >> Partner, Global Water Partnership >> Partner, Global Soil Partnership of FAO, UN >> Partner, GSIN of NASA >> Voluntary ICT Consultant, Government Teachers Training College, Feni >> Adopter Member, Bluetooth Special Interest Group >> Member, Young Leader Program Committee of International Desalination >> Association >> Member, International Phonetic Association >> Student, BKIAC, Bangladesh University of Engineering and Technology >> Expectant: PhD in Sustainable Development >> Address: Jha-44/15, Khilgaon Taltola, Dhaka-1219, Bangladesh >> Cell: +88 01710822509, +88 01819462549 >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Aug 25 05:26:53 2019 From: unicode at unicode.org (Vinodh Rajan via Unicode) Date: Sun, 25 Aug 2019 12:26:53 +0200 Subject: Tai Laing JHA and SA Message-ID: Hi, I was looking at UTN 11 and it seems both JHA and SA in Tai Laing is supposed to be represented by U+A9EC. Am I reading it correctly? Or is it a mistake in the consonant table? L2/11-130R (page 7) also seems to show the same character for JHA and SA. I find it a bit strange that an alphabet which innovated for the missing Pali consonants would end up using the same character for two distinct Pali consonants. V -- http://www.virtualvinodh.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Aug 26 23:56:35 2019 From: unicode at unicode.org (Peter Constable via Unicode) Date: Tue, 27 Aug 2019 04:56:35 +0000 Subject: The native name of Tai Viet script and language(s) In-Reply-To: <83ef1dco05.fsf@gnu.org> References: <83ef1dco05.fsf@gnu.org> Message-ID: " As the proposal for TaiViet script to the Unicode is still on the progress, we use the Private Use Area for TaiViet characters (U+F000..U+F07E). " Er... The script has been in Unicode for about 10 years, since Unicode 5.2. The block description in 16.8 of Unicode 12 provides useful info: https://www.unicode.org/versions/Unicode12.0.0/ch16.pdf What may be helpful to understand is that "Tai" refers, at one level, to an entire language family that encompasses languages spoken from southern Thailand in the south to central China in the north, and from Vietnam in the east to eastern India in the west. "Tai" can also be used at a different level as the name for individual languages in that family (with either an un-aspirated /t/ as in /tai/, or an aspirated /t?/ as in /t?ai/ ? and in China /t/ is usually written with ?d?), though usually a distinguishing qualifier is added to the name, as in ?Tai Dam? or ?Dehong Dai?. Thai, aka Siamese, is a particular exception. So, Tai Viet is used for writing various Tai languages in Vietnam and Laos, and reportedly also in Central Thailand. These are all distinct languages. IIRC, the script name ?Tai Viet? was coined because of predominant use in Vietnam, not because that?s what any user community historically would call the script. The script _is_ related to Thai script, but I?m not sure I would say it has ?the same origin as that of Thai language/script used in Thailand?, as that is too simplistic a view of the historic connections: it suggests that Thai script and Tai Viet developed directly from the same precursor, which isn?t really accurate. And the mentions of language reflect misunderstanding. ?TaiViet refers to the Tai language used by Tai people in Vietnam?? No, it does not refer to a language at all. And ?_the_ Tai language? in Vietnam? is misunderstanding the language situation: of over 100 languages spoken in Vietnam, there are 32 languages from the Tai-Kadai language family, and 12 from the Southwestern Tai branch, which is the branch that includes Thai (Siamese). To say ?the language [has] the same origin as that of Thai?? isn?t correct in that there isn?t _one_ language involved. It would be accurate to say that the languages written with the Tai Viet script are closely-related to Thai (in the same sense that French, Spanish, Italian, etc. are closely-related to one another). For more on the Southwestern Tai languages, see https://www.ethnologue.com/subgroups/southwestern. Hope that?s of some help. Peter -----Original Message----- From: Unicode On Behalf Of Eli Zaretskii via Unicode Sent: Thursday, August 22, 2019 6:46 AM To: unicode at unicode.org Subject: The native name of Tai Viet script and language(s) Could someone "in the know" please help me make the Tai Viet script documentation in Emacs accurate? The current short description we have is in the file lisp/language/tai-viet.el in the Emacs source tree. You can see it here: https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.savannah.gnu.org%2Fcgit%2Femacs.git%2Ftree%2Flisp%2Flanguage%2Ftai-viet.el&data=02%7C01%7Cpetercon%40microsoft.com%7C443ad2416ab54b6697df08d72707bbc4%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637020786486232455&sdata=s1TRwKN6lDw%2FAN3wPwrwxPdJanwx8YKwN9yX4bVhuIs%3D&reserved=0 My concern is with the text under "sample-text" (line 40) and in the documentation string following that (starting on line 48), which states the name of the script and the language expressed with Tai Viet characters. However, that text is from long ago, before Unicode had a Tai Viet block, so it still uses at least one PUA character, whuch I think is incorrect. In addition, I didn't find any place where I could copy/paste the current accurate name of the script and at least one of the languages that use that script. Could someone please help me set this text straight? Bonus points for also telling how to say "hello" (or any similar greeting) in one of the Tai Viet languages, so that we could add that to the etc/HELLO file. (I think the sample-text attempts to include such a greeting, but again, I'm not sure it is correct.) Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Aug 27 01:33:31 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 27 Aug 2019 09:33:31 +0300 Subject: The native name of Tai Viet script and language(s) In-Reply-To: (message from Peter Constable on Tue, 27 Aug 2019 04:56:35 +0000) References: <83ef1dco05.fsf@gnu.org> Message-ID: <83v9ujdsn8.fsf@gnu.org> > From: Peter Constable > Date: Tue, 27 Aug 2019 04:56:35 +0000 > > " As the proposal for TaiViet script to the Unicode is still on > the progress, we use the Private Use Area for TaiViet > characters (U+F000..U+F07E). " > > Er... The script has been in Unicode for about 10 years, since Unicode 5.2. Yes, it's an old and outdated text (Emacs is around since 1985, and supports multilingual text editing since 1997). Easy to fix, and I will fix it, but my main difficulty is with the text that uses the script itself, which is why I asked here. I couldn't find copy/paste-able text for that anywhere on the Internet. Thanks. From unicode at unicode.org Tue Aug 27 02:33:21 2019 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Tue, 27 Aug 2019 08:33:21 +0100 Subject: The native name of Tai Viet script and language(s) In-Reply-To: References: <83ef1dco05.fsf@gnu.org> Message-ID: <20190827083321.0db83736@JRWUBU2> On Tue, 27 Aug 2019 04:56:35 +0000 Peter Constable via Unicode wrote: > The script _is_ related to Thai script, but I?m not sure I would say > it has ?the same origin as that of Thai language/script used in > Thailand?, as that is too simplistic a view of the historic > connections: it suggests that Thai script and Tai Viet developed > directly from the same precursor, which isn?t really accurate. Can you elaborate on that? There seems to be a chasm when we reach back beyond the Sukhothai script, which embodies a failed reform. (There seems to be evidence that the writing system is not a 19th century fake - motive and opportunity had seemed available.) Incidentally, is there a consensus view on whether the Sukhothai script is mostly encoded, and if so, in which Unicode script(s)? What is true is that both Thai and Tai Viet use consonants to record the difference between two sets of three tones (though later mergers and splits can result in 3 + 3 = 5 or 3 + 3 = 7 = 6 = 5); this seems to be a register difference as in Cham and still in a few Khmer dialects, going back to an ancient voicing difference. @Eli: Ideally, you need to check that default font and language are consistent. There are some regional differences which make it necessary to calibrate the writing system, and one word may not suffice. Richard. From unicode at unicode.org Tue Aug 27 03:28:42 2019 From: unicode at unicode.org (Eli Zaretskii via Unicode) Date: Tue, 27 Aug 2019 11:28:42 +0300 Subject: The native name of Tai Viet script and language(s) In-Reply-To: <20190827083321.0db83736@JRWUBU2> (message from Richard Wordingham via Unicode on Tue, 27 Aug 2019 08:33:21 +0100) References: <83ef1dco05.fsf@gnu.org> <20190827083321.0db83736@JRWUBU2> Message-ID: <83k1azdnb9.fsf@gnu.org> > Date: Tue, 27 Aug 2019 08:33:21 +0100 > From: Richard Wordingham via Unicode > > @Eli: Ideally, you need to check that default font and language are > consistent. There are some regional differences which make it > necessary to calibrate the writing system, and one word may not suffice. Emacs doesn't yet have a useful notion of "the default language". It is a problematic notion in a multilingual editor anyway. We do have a mechanism in place to prefer some fonts over others give "the current language", though, see fontset.el. Specifically, for this context, I'd be glad to have the right text in the Tai Viet script in _any_ of the languages that use this script, as those greetings in the HELLO file are just a kind of show-off.