From unicode at unicode.org Wed Jan 1 05:17:11 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Wed, 1 Jan 2020 11:17:11 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> Message-ID: <20200101111711.07152260@JRWUBU2> On Wed, 1 Jan 2020 01:19:02 +0000 James Kass via Unicode wrote: > A workaround until some kind of satisfactory adjustment is made might > be to simply use COLON for VISARGA.? Or... > > ?VISARGA ? U+02F8 MODIFIER LETTER RAISED COLON > ANUSVARA?U+02D9 DOT ABOVE > > ...as long as the font(s) included both those characters. > > ?? ??? > > ??? -- anusvara last > ???? -- " > > ??: -- colon last > ???: -- " > > ??? -- raised colon modifier last > ???? -- " > > ??? -- spacing dot above last > ???? -- " > That's exactly the sort of mess that jack-booted renderers are trying to minimise. Their principle is that there should be only one encoding per shape, though to be fair: 1) some renderers accept canonical equivalents. 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). 3) Superseded chillu encodings are still supported. Richard. From unicode at unicode.org Wed Jan 1 05:44:02 2020 From: unicode at unicode.org (Marius Spix via Unicode) Date: Wed, 1 Jan 2020 12:44:02 +0100 Subject: emojis for mouse buttons? In-Reply-To: References: <7ec43d0f.10d1.16f5d7099d1.Webtop.45@btinternet.com> Message-ID: <20200101124337.01780947@spixxi> Cecause the middle button of many mice is a scroll button, I think, we need five different characters: LEFT MOUSE BUTTON CLICK (mouse with left button black) MIDDLE MOUSE BUTTON CLICK (mouse with middle button black) RIGHT MOUSE BUTTON CLICK (mouse with right button black) MOUSE SCROLL UP (mouse with middle button black and white triangle pointing up inside) MOUSE SCROLL DOWN (mouse with middle button black and white triangle pointing down inside) These characters are pretty useful in software manuals, training materials and user interfaces. Happy New Year, Marius On Tue, 31 Dec 2019 23:04:39 +0100 Philippe Verdy via Unicode WROTE: > Playing with the fiolling of the middle cell to mean a double click > is a bad idea, it would be better to add one or two rounded borders > separated from the button, or simply display two icons in sequence > for a double click). > > Note that the glyphs do not necessarily have to show a mouse, it > could as well be a square with its lower third part split into two or > three squares, like a touchpad (see the notification icons displayed > by Synaptics touchpad drivers). The same rounded borders could also > mean the number of clicks. As well, if a ouse is represented, it may > or may not have a wire. > > Emoji-styles could use more realistic 3D-like rendering with extra > shadows... > > Le mar. 31 d?c. 2019 ? 22:16, wjgo_10009 at btinternet.com via Unicode < > unicode at unicode.org> a ?crit : > > > How about the following. > > > > A filled upper cell to mean click, > > > > a filled upper cell and a filled middle cell to mean double click, > > > Note that clicking and maintaining the button is just like the > convention of using "+" after a key modifier before the actual key > (both key may be styled separately to decorate their glyphs into a > keycap, but such styling should not be applied in the distinctive > glyph; there may also be emoji sequences to combine an anonymous > keycap base emoji with the following characters, using joiner > controls, but this is more difficult for keys whose labels are texts > made of multiple letters like "End" or words like "Print Screen", > after a possible Unicode symbol for keys like Page Up, Home, End, > NumLock; styling the text offers better option and accessibility even > if symbols are used and a whole translatable string is surrounded by > deocrating styles to create a visual keycap). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 488 bytes Desc: Digitale Signatur von OpenPGP URL: From unicode at unicode.org Wed Jan 1 09:08:42 2020 From: unicode at unicode.org (John W Kennedy via Unicode) Date: Wed, 1 Jan 2020 10:08:42 -0500 Subject: emojis for mouse buttons? In-Reply-To: <20200101124337.01780947@spixxi> References: <20200101124337.01780947@spixxi> Message-ID: <0FC15872-F75B-42D7-BB0D-4CF5121E8931@gmail.com> As I have already said, this will not do. Mouses do not have ?left? and ?right? buttons; they have ?primary? buttons, which may be on the left or right, and ?secondary? buttons, which may be on the right or left. If this goes through, users with left-handed mouse setups will curse you forever. -- John W. Kennedy "Compact is becoming contract, Man only earns and pays." -- Charles Williams. "Bors to Elayne: On the King's Coins" > On Jan 1, 2020, at 6:43 AM, Marius Spix via Unicode wrote: > > ?Cecause the middle button of many mice is a scroll button, I think, we > need five different characters: > > LEFT MOUSE BUTTON CLICK (mouse with left button black) > MIDDLE MOUSE BUTTON CLICK (mouse with middle button black) > RIGHT MOUSE BUTTON CLICK (mouse with right button black) > MOUSE SCROLL UP (mouse with middle button black and white triangle > pointing up inside) > MOUSE SCROLL DOWN (mouse with middle button black and white triangle > pointing down inside) > > These characters are pretty useful in software manuals, training > materials and user interfaces. > > Happy New Year, > > Marius > > > >> On Tue, 31 Dec 2019 23:04:39 +0100 >> Philippe Verdy via Unicode WROTE: >> >> Playing with the fiolling of the middle cell to mean a double click >> is a bad idea, it would be better to add one or two rounded borders >> separated from the button, or simply display two icons in sequence >> for a double click). >> >> Note that the glyphs do not necessarily have to show a mouse, it >> could as well be a square with its lower third part split into two or >> three squares, like a touchpad (see the notification icons displayed >> by Synaptics touchpad drivers). The same rounded borders could also >> mean the number of clicks. As well, if a ouse is represented, it may >> or may not have a wire. >> >> Emoji-styles could use more realistic 3D-like rendering with extra >> shadows... >> >> Le mar. 31 d?c. 2019 ? 22:16, wjgo_10009 at btinternet.com via Unicode < >> unicode at unicode.org> a ?crit : >> >>> How about the following. >>> >>> A filled upper cell to mean click, >>> >>> a filled upper cell and a filled middle cell to mean double click, >>> >> Note that clicking and maintaining the button is just like the >> convention of using "+" after a key modifier before the actual key >> (both key may be styled separately to decorate their glyphs into a >> keycap, but such styling should not be applied in the distinctive >> glyph; there may also be emoji sequences to combine an anonymous >> keycap base emoji with the following characters, using joiner >> controls, but this is more difficult for keys whose labels are texts >> made of multiple letters like "End" or words like "Print Screen", >> after a possible Unicode symbol for keys like Page Up, Home, End, >> NumLock; styling the text offers better option and accessibility even >> if symbols are used and a whole translatable string is surrounded by >> deocrating styles to create a visual keycap). > From unicode at unicode.org Wed Jan 1 09:24:36 2020 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Wed, 1 Jan 2020 16:24:36 +0100 Subject: emojis for mouse buttons? In-Reply-To: <0FC15872-F75B-42D7-BB0D-4CF5121E8931@gmail.com> References: <20200101124337.01780947@spixxi> <0FC15872-F75B-42D7-BB0D-4CF5121E8931@gmail.com> Message-ID: this is user's settings; the OS and softwares will automatically adapt to these settings to display the proper label or icon, as well they'll be able to document them accordingly. Primary/secondary/ternary buttons are not used, even in the OS itself (the mouse drivers will remap the internal events when configuring the mouse for left-hand). If needed (when they want to document the difference for right-hand or left-hand, they will change the label, icon or character; there's no reason to not use the left vs. right indication of the button for the mouse buttons (I think it's definitely better to force applications to change the character accordingly; usually "left click" is just named "click" (not "primary click"), but "right click" is used everywhere (may be contextually changed to "left click where appropriate to document the left-hand behavior). Also I do not advocate a glyph limited to a mouse, the character being encoded as well if it shows a square touchpad. And the wired vs. wireless is not relevant here as we just want to be able to conventiently document key mappings used by applications and present them the same way as other keys on a keyboard (even iof the keyboard is virtual on a tactile screen). Those that want a real mouse and real wired or wireless distinction or touchpad, do not need a distinction of clicked buttons, and they already have characters encoded for them including for emojis, but these are NOT usable to document key mappings that are so frequently needed in apps (e.g. menus showing shortcuts) and their documentation. Le mer. 1 janv. 2020 ? 16:08, John W Kennedy a ?crit : > As I have already said, this will not do. Mouses do not have ?left? and > ?right? buttons; they have ?primary? buttons, which may be on the left or > right, and ?secondary? buttons, which may be on the right or left. If this > goes through, users with left-handed mouse setups will curse you forever. > > -- > John W. Kennedy > "Compact is becoming contract, > Man only earns and pays." > -- Charles Williams. "Bors to Elayne: On the King's Coins" > > > On Jan 1, 2020, at 6:43 AM, Marius Spix via Unicode > wrote: > > > > ?Cecause the middle button of many mice is a scroll button, I think, we > > need five different characters: > > > > LEFT MOUSE BUTTON CLICK (mouse with left button black) > > MIDDLE MOUSE BUTTON CLICK (mouse with middle button black) > > RIGHT MOUSE BUTTON CLICK (mouse with right button black) > > MOUSE SCROLL UP (mouse with middle button black and white triangle > > pointing up inside) > > MOUSE SCROLL DOWN (mouse with middle button black and white triangle > > pointing down inside) > > > > These characters are pretty useful in software manuals, training > > materials and user interfaces. > > > > Happy New Year, > > > > Marius > > > > > > > >> On Tue, 31 Dec 2019 23:04:39 +0100 > >> Philippe Verdy via Unicode WROTE: > >> > >> Playing with the fiolling of the middle cell to mean a double click > >> is a bad idea, it would be better to add one or two rounded borders > >> separated from the button, or simply display two icons in sequence > >> for a double click). > >> > >> Note that the glyphs do not necessarily have to show a mouse, it > >> could as well be a square with its lower third part split into two or > >> three squares, like a touchpad (see the notification icons displayed > >> by Synaptics touchpad drivers). The same rounded borders could also > >> mean the number of clicks. As well, if a ouse is represented, it may > >> or may not have a wire. > >> > >> Emoji-styles could use more realistic 3D-like rendering with extra > >> shadows... > >> > >> Le mar. 31 d?c. 2019 ? 22:16, wjgo_10009 at btinternet.com via Unicode < > >> unicode at unicode.org> a ?crit : > >> > >>> How about the following. > >>> > >>> A filled upper cell to mean click, > >>> > >>> a filled upper cell and a filled middle cell to mean double click, > >>> > >> Note that clicking and maintaining the button is just like the > >> convention of using "+" after a key modifier before the actual key > >> (both key may be styled separately to decorate their glyphs into a > >> keycap, but such styling should not be applied in the distinctive > >> glyph; there may also be emoji sequences to combine an anonymous > >> keycap base emoji with the following characters, using joiner > >> controls, but this is more difficult for keys whose labels are texts > >> made of multiple letters like "End" or words like "Print Screen", > >> after a possible Unicode symbol for keys like Page Up, Home, End, > >> NumLock; styling the text offers better option and accessibility even > >> if symbols are used and a whole translatable string is surrounded by > >> deocrating styles to create a visual keycap). > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 1 11:36:50 2020 From: unicode at unicode.org (Marius Spix via Unicode) Date: Wed, 1 Jan 2020 18:36:50 +0100 Subject: emojis for mouse buttons? In-Reply-To: <0FC15872-F75B-42D7-BB0D-4CF5121E8931@gmail.com> References: <20200101124337.01780947@spixxi> <0FC15872-F75B-42D7-BB0D-4CF5121E8931@gmail.com> Message-ID: <20200101183623.1fbf9eaf@spixxi> Unicode characters are named after their appearance, not their semantics. For example the diaresis and the umlaut share the code-point U+0308. A printed booklet cannot be aware if the user is right- or left-handed. This is the same issue as with U+2BEA and U+2BEB, which are designed for ltr and rtl writing. On Wed, 1 Jan 2020 10:08:42 -0500 John W Kennedy via Unicode wrote: > As I have already said, this will not do. Mouses do not have ?left? > and ?right? buttons; they have ?primary? buttons, which may be on the > left or right, and ?secondary? buttons, which may be on the right or > left. If this goes through, users with left-handed mouse setups will > curse you forever. > From unicode at unicode.org Wed Jan 1 14:11:04 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 1 Jan 2020 20:11:04 +0000 Subject: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara) In-Reply-To: <20200101111711.07152260@JRWUBU2> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> Message-ID: <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > That's exactly the sort of mess that jack-booted renderers are trying > to minimise.? Their principle is that there should be only one encoding > per shape, though to be fair: > > 1) some renderers accept canonical equivalents. > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), collating > (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). > 3) Superseded chillu encodings are still supported. There was never any need for atomic chillu form characters.? The principle of only one encoding per shape is best achieved when every shape gets an atomic encoding.? Glyph-based encoding is incompatible with Unicode character encoding principles. It?s too bad that ISCII didn?t accomodate the needs of Vedic Sanskrit, but here we are. From unicode at unicode.org Wed Jan 1 17:09:49 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Wed, 1 Jan 2020 23:09:49 +0000 Subject: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara) In-Reply-To: <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> Message-ID: On 2020-01-01 8:11 PM, James Kass wrote: > It?s too bad that ISCII didn?t accomodate the needs of Vedic Sanskrit, > but here we are. Sorry, that might be wrong to say.? It's possible that it's Unicode's adaptation of ISCII that hinders Vedic Sanskrit. From unicode at unicode.org Wed Jan 1 19:04:27 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 2 Jan 2020 01:04:27 +0000 Subject: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara) In-Reply-To: References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> Message-ID: <20200102010427.41766290@JRWUBU2> On Wed, 1 Jan 2020 23:09:49 +0000 James Kass via Unicode wrote: > On 2020-01-01 8:11 PM, James Kass wrote: > > It?s too bad that ISCII didn?t accomodate the needs of Vedic > > Sanskrit, but here we are. > > Sorry, that might be wrong to say.? It's possible that it's Unicode's > adaptation of ISCII that hinders Vedic Sanskrit. Have you found a definition of the ISCII handling of Vedic characters? The problem lies in Unicode's failure to standardise the encoding of Devanagari text. But for the consistent failure to include a standardisation of text in a script in TUS, one might wonder if the original idea was to duck the issue by resorting to canonical equivalence. I've been looking at Microsoft's specification of Devanagari character order. In https://docs.microsoft.com/en-us/typography/script-development/devanagari, the consonant syllable ends [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] where N is nukta A is anudatta (U+0952) H is halant/virama M is matra SM is syllable modifier signs VD is vedic "Syllable modifier signs" and "vedic" are not defined. It appears that SM includes U+0903 DEVANAGARI SIGN VISARGA. I note that even ??? is given a dotted circle by HarfBuzz. Now, this might not be an entirely fair test; I suspect anudatta is assigned this position because originally the Sindhi implosives were encoded as consonant plus nukta and anudatta, though rendering still fails with HarfBuzz when nukta is inserted (????). Richard. From unicode at unicode.org Wed Jan 1 19:05:26 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 2 Jan 2020 01:05:26 +0000 Subject: One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara) In-Reply-To: <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> Message-ID: <20200102010526.12aaa7f6@JRWUBU2> On Wed, 1 Jan 2020 20:11:04 +0000 James Kass via Unicode wrote: > On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote: > > > That's exactly the sort of mess that jack-booted renderers are > > trying to minimise.? Their principle is that there should be only > > one encoding per shape, though to be fair: > > > > 1) some renderers accept canonical equivalents. > > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ), > > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ). > > 3) Superseded chillu encodings are still supported. > > There was never any need for atomic chillu form characters.? > The > principle of only one encoding per shape is best achieved when every > shape gets an atomic encoding. I should have written per-word shape. I should also have added that most renderers attempt to handle Mongolian, despite its encoding Middle Mongolian phonetics rather than characters. Also, they don't attempt to sort the Arabic script per-language subsets out, which leads to a bad mess at Wiktionary when Unicode characters differ only in a few forms. >?Glyph-based encoding is incompatible > with Unicode character encoding principles. Visual encoding sometimes works - phonetic order for Thai is so complicated that it is unsurprising that its definition is partly missing from Unicode 1.0. The official history hides behind incompatibility with the Thai national standard, but phonetic order was simply too complicated for Thai. Additionally, Thais don't agree on where preposed vowels go relative to Pali consonant clusters - they don't agree that all of them should appear in the middle of the cluster. (I suppose the positioning rule could have been made a stylistic feature of fonts.) An analogue is Lao collation. While syllable boundaries can overwhelmingly be discerned in modern Lao, Lao collations are too complicated to be accepted for ICU if they are to support anything but single syllables. CLDR collation (interpreted as a specification with the normal use of specification language for the form of definitions) can just cope, whereas the UCA can't, but the tables are huge. Richard. From unicode at unicode.org Thu Jan 2 01:52:55 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Thu, 2 Jan 2020 07:52:55 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <20200102010427.41766290@JRWUBU2> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> Message-ID: <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> On 2020-01-02 1:04 AM, Richard Wordingham wrote in a thread deriving from this one, > Have you found a definition of the ISCII handling of Vedic characters? No.? It would be helpful.? ISCII apparently wasn't really used much.? It would also be helpful to know the encoding order in any legacy ISCII data using the Vedic characters with respect to VISARGA/ANUSVARA.? Although such legacy data seems unlikely, I'd expect VISARGA/ANUSVARA to be entered/stored post-syllable. > I've been looking at Microsoft's specification of Devanagari character > order.? In > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > the consonant syllable ends > > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] > > where > N is nukta > A is anudatta (U+0952) > H is halant/virama > M is matra > SM is syllable modifier signs > VD is vedic > > "Syllable modifier signs" and "vedic" are not defined.? It appears that > SM includes U+0903 DEVANAGARI SIGN VISARGA. What action should Microsoft take to satisfy the needs of the user community? 1.? No action, maintain status quo. 2.? Swap SM and VD in the specs ordering. 3.? Make new category PS (post-syllable) and move VISARGA/ANUSVARA there. 4.? ? What kind of impact would there be on existing data if Microsoft revised the ordering? Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE DOTTED CIRCLE so that users can suppress unwanted and unexpected dotted circles by adding superfluous characters to the text stream? > I note that even ??? is > given a dotted circle by HarfBuzz. Same on Win 7.? And (???) breaks the mark positioning as expected. From unicode at unicode.org Thu Jan 2 11:36:46 2020 From: unicode at unicode.org (Karl Williamson via Unicode) Date: Thu, 2 Jan 2020 10:36:46 -0700 Subject: Call for feedback on UTS #18: Unicode Regular Expressions In-Reply-To: <5DD72E68.6060805@unicode.org> References: <5DD72E68.6060805@unicode.org> Message-ID: <123c34a6-bf7e-4a88-eb6a-d58b44221241@khwilliamson.com> One thing I noticed in reviewing this is the removal of text about loose matching of the name property. But I didn't see an explanation for this removal. Please point me to the explanation, or tell me what it is. Specifically these lines were removed: As with other property values, names should use a loose match, disregarding case, spaces and hyphen (the underbar character "_" cannot occur in Unicode character names). An implementation may also choose to allow namespaces, where some prefix like "LATIN LETTER" is set globally and used if there is no match otherwise. There are, however, three instances that require special-casing with loose matching, where an extra test shall be made for the presence or absence of a hyphen. U+0F68 TIBETAN LETTER A and U+0F60 TIBETAN LETTER -A U+0FB8 TIBETAN SUBJOINED LETTER A and U+0FB0 TIBETAN SUBJOINED LETTER -A U+116C HANGUL JUNGSEONG OE and U+1180 HANGUL JUNGSEONG O-E From unicode at unicode.org Thu Jan 2 13:22:00 2020 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Thu, 2 Jan 2020 20:22:00 +0100 Subject: Call for feedback on UTS #18: Unicode Regular Expressions In-Reply-To: <123c34a6-bf7e-4a88-eb6a-d58b44221241@khwilliamson.com> References: <5DD72E68.6060805@unicode.org> <123c34a6-bf7e-4a88-eb6a-d58b44221241@khwilliamson.com> Message-ID: The line just above that is: Name matching rules follow Matching Rules from [UAX44#UAX44-LM2 ]. The deletion was based on feedback that the deleted text was a recap of the above line, but a recap that didn't have precisely the same description. It's best to point to the exact description, and have that be in one place. Mark On Thu, Jan 2, 2020 at 6:40 PM Karl Williamson via Unicode < unicode at unicode.org> wrote: > One thing I noticed in reviewing this is the removal of text about loose > matching of the name property. But I didn't see an explanation for this > removal. Please point me to the explanation, or tell me what it is. > > Specifically these lines were removed: > > As with other property values, names should use a loose match, > disregarding case, spaces and hyphen (the underbar character "_" cannot > occur in Unicode character names). An implementation may also choose to > allow namespaces, where some prefix like "LATIN LETTER" is set globally > and used if there is no match otherwise. > > There are, however, three instances that require special-casing with > loose matching, where an extra test shall be made for the presence or > absence of a hyphen. > > U+0F68 TIBETAN LETTER A and > U+0F60 TIBETAN LETTER -A > U+0FB8 TIBETAN SUBJOINED LETTER A and > U+0FB0 TIBETAN SUBJOINED LETTER -A > U+116C HANGUL JUNGSEONG OE and > U+1180 HANGUL JUNGSEONG O-E > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 2 14:20:34 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 2 Jan 2020 20:20:34 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> Message-ID: <20200102202034.6a9eec98@JRWUBU2> On Thu, 2 Jan 2020 07:52:55 +0000 James Kass via Unicode wrote: > > I've been looking at Microsoft's specification of Devanagari > > character order.? In > > > https://docs.microsoft.com/en-us/typography/script-development/devanagari, > > the consonant syllable ends > > > > [N]+[A] + [< H+[] | {M}+[N]+[H]>]+[SM]+[(VD)] > > > > where > > N is nukta > > A is anudatta (U+0952) > > H is halant/virama > > M is matra > > SM is syllable modifier signs > > VD is vedic > > > > "Syllable modifier signs" and "vedic" are not defined.? It appears > > that SM includes U+0903 DEVANAGARI SIGN VISARGA. > > What action should Microsoft take to satisfy the needs of the user > community? > 1.? No action, maintain status quo. > 2.? Swap SM and VD in the specs ordering. > 3.? Make new category PS (post-syllable) and move VISARGA/ANUSVARA > there. > 4.? ? There's a project whose basis I can't find to convert Indian Indic rendering at least to use the USE. Now, according to the specification of the USE, visarga, anusvara and cantillation marks are all classified as vowel modifiers, and are so ordered relative to one another in the Indian Indic order: left, top, bottom, right. So, the problem should already be solved for Grantha, and, if the plans come to fruition, will work with a font whose Devanagari script tag is 'dev3'. However, I may have overlooked a set of overrides to the USE categorisations. > What kind of impact would there be on existing data if Microsoft > revised the ordering? A good question that *I* can't answer. > Or should Unicode encode a new character like ZERO-WIDTH INVISIBLE > DOTTED CIRCLE so that users can suppress unwanted and unexpected > dotted circles by adding superfluous characters to the text stream? It would be useful to be able to suppress inappropriate dotted circles without disrespecting the character identity of U+25CC. (Doable in HarfBuzz, but not in OpenType.) There's actually been a suggestion that dotted circles should be applied after global substitutions have been applied, so as to prevent the overcoming of renderer faults. On Sat, 21 Dec 2019 11:57:53 +0530 Shriramana Sharma via Unicode wrote: > This is all the more so since in some Vedic contexts (Sama Gana) the > visarga is far separated from the syllable by other syllables like > digits (themselves carrying combining marks) or spacing anusvara, as > seen in examples from my Grantha proposal L2/09-372 p 40. I presume you referring to the middle picture. I'm having difficulty reading it. Could you please tell us its transcription and encoding. A minimal change would be to extend the range of base characters to include digits - I'm surprised matras don't frequently get added to them. Richard. From unicode at unicode.org Thu Jan 2 20:02:03 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 3 Jan 2020 02:02:03 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <2178EBE7-CB67-4F5F-A15C-157B21A8F9FC@lindenbergsoftware.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> <20200102202034.6a9eec98@JRWUBU2> <2178EBE7-CB67-4F5F-A15C-157B21A8F9FC@lindenbergsoftware.com> Message-ID: <20200103020203.775f1882@JRWUBU2> On Thu, 2 Jan 2020 15:07:04 -0800 Norbert Lindenberg wrote: >> On Jan 2, 2020, at 12:20, Richard Wordingham via Unicode >> wrote: >> So, the problem should already be solved for Grantha, and, >> if the plans come to fruition, will work with a font whose >> Devanagari script tag is 'dev3'. However, I may have overlooked a >> set of overrides to the USE categorisations. > You can create Indic 3 fonts that get processed by the USE today, and > use them with Harfbuzz (Chrome, Firefox, Android, ?) and with > CoreText (Apple platforms). I don?t know if anybody has already > created such fonts. > https://lindenbergsoftware.com/en/notes/brahmic-script-support-in-opentype/ Is there a script tag registry, or is it now a free-for-all as with font names? (I suppose it is implicitly constrained by what the individual renderers recognise.) The nearest to a registry I can find is at https://docs.microsoft.com/en-us/typography/opentype/spec/ttoreg, but that appears to be limited to what Microsoft supports - "The tag registry defines the OpenType Layout tags that Microsoft supports". None of the Indic 3 script tags are there. Richard. From unicode at unicode.org Sat Jan 4 06:50:09 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 4 Jan 2020 12:50:09 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <20200102202034.6a9eec98@JRWUBU2> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> <20200102202034.6a9eec98@JRWUBU2> Message-ID: <20200104125009.36b23e34@JRWUBU2> On Thu, 2 Jan 2020 20:20:34 +0000 Richard Wordingham via Unicode wrote: > There's a project whose basis I can't find to convert Indian Indic > rendering at least to use the USE. Now, according to the > specification of the USE, visarga, anusvara and cantillation marks > are all classified as vowel modifiers, and are so ordered relative to > one another in the Indian Indic order: left, top, bottom, right. So, > the problem should already be solved for Grantha, and, if the plans > come to fruition, will work with a font whose Devanagari script tag > is 'dev3'. However, I may have overlooked a set of overrides to the > USE categorisations. I've now knocked up a partial* representation* of a Devanagari dev3 and a Grantha font (which I'm dubbing 'Mock Indic 3'). The supported orders of COMBINING DIGIT ONE and VISARGA, as in Firefox on Linux, are: dev2: ??? dev3: ??? Grantha: (1) ?????? (2) ?????? The second Grantha spelling is enabled by a Harfbuzz-only change to the USE categorisations. It treats Grantha visarga and spacing anusvara as though inpc=Top rather than inpc=Right. As I am using Ubuntu 16.04, this override isn't supported in applications that use the system HarfBuzz library, such as my email client. We are now establishing incompatible Devanagari font-specific encodings fully compliant with TUS! Richard. * Partial = much is not handled * Representation = glyphs are wrong, merely showing arrangement. (I've actually re-used a Tai Tham font.) From unicode at unicode.org Sat Jan 4 16:15:59 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Sat, 4 Jan 2020 22:15:59 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <20200104125009.36b23e34@JRWUBU2> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> <20200102202034.6a9eec98@JRWUBU2> <20200104125009.36b23e34@JRWUBU2> Message-ID: <12baf44e-6141-5c7d-6b03-7afa4ea074a4@gmail.com> On 2020-01-04 12:50 PM, Richard Wordingham via Unicode wrote: > dev2: ??? DIGIT ONE> > > dev3: ??? > Grantha: (1) ?????? ONE, U+11303 VISARGA> > (2) ?????? > The second Grantha spelling is enabled by a Harfbuzz-only change to > the USE categorisations. It treats Grantha visarga and spacing > anusvara as though inpc=Top rather than inpc=Right. As I am using > Ubuntu 16.04, this override isn't supported in applications that use the > system HarfBuzz library, such as my email client. > > We are now establishing incompatible Devanagari font-specific > encodings fully compliant with TUS! This seems to be a very bad approach.? And apparently it isn't limited to the Devanagari script. For the Grantha examples above, Grantha (1) displays much better here.? It seems daft to put a spacing character between a base character and any mark which is supposed to combine with the base character. -------------- next part -------------- A non-text attachment was scrubbed... Name: 20200104_Grantha.PNG Type: image/png Size: 36348 bytes Desc: not available URL: From unicode at unicode.org Sat Jan 4 21:28:04 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 5 Jan 2020 03:28:04 +0000 Subject: Long standing problem with Vedic tone markers and post-base visarga/anusvara In-Reply-To: <12baf44e-6141-5c7d-6b03-7afa4ea074a4@gmail.com> References: <356457cb-57c2-0a0c-a6d7-148dfb2b52d1@gmail.com> <412c4d8e-0120-76ef-531a-db04e8055168@gmail.com> <20200101111711.07152260@JRWUBU2> <5ef3d138-858c-f1e9-609e-f137b610eb02@gmail.com> <20200102010427.41766290@JRWUBU2> <1a3881b3-a173-769e-42fe-1922695d242c@gmail.com> <20200102202034.6a9eec98@JRWUBU2> <20200104125009.36b23e34@JRWUBU2> <12baf44e-6141-5c7d-6b03-7afa4ea074a4@gmail.com> Message-ID: <20200105032804.03765bf2@JRWUBU2> On Sat, 4 Jan 2020 22:15:59 +0000 James Kass via Unicode wrote: > For the Grantha examples above, Grantha (1) displays much better > here. It seems daft to put a spacing character between a base > character and any mark which is supposed to combine with the base > character. Although it's not related to this issue, that happens in the USE scheme. It puts vowels before vowel modifiers, which has this problem if any of the vowel modifiers precede a vowel in visual order, as happens in Thai and closely related writing systems. Richard. From unicode at unicode.org Mon Jan 6 02:36:51 2020 From: unicode at unicode.org (=?utf-8?B?TWFydGluIEouIETDvHJzdA==?= via Unicode) Date: Mon, 6 Jan 2020 08:36:51 +0000 Subject: Call for Papers: G21C Grapholinguistics in the 21st century, Paris June 2020 Message-ID: <84a82014-8cab-78fc-cf3f-ed5f4f253654@it.aoyama.ac.jp> Happy New Year to everybody on this list! Except for the Internationalization and Unicode Conference (see https://www.unicodeconference.org/; submission deadline March 6, 2020), this list very rarely sees calls for papers, but this one should definitely be of interest at least to a subset of people on this list (mostly those with academic/theoretic inclinations). The submission deadline is very close (January 13), but I have heard there may be an extension. #####CfP2 message##### [Apologies if you receive multiple copies of this message] ************************************************ Second CALL FOR PAPERS G21C Grapholinguistics in the 21st century?From graphemes to knowledge June 17-18-19, 2020 Paris, France Contact: Yannis Haralambous grafematik2020 at sciencesconf.org or grafematik2020 at easychair.org ************************************************ G21C (Grapholinguistics in the 21st Century) is a biennial conference bringing together disciplines concerned with grapholinguistics and more generally the study writing systems and their representation in written communication. The conference aims to reflect on the current state of research in the area, and on the role that writing and writing systems play in neighboring disciplines like computer science and information technology, communication, typography, psychology, and pedagogy. In particular it aims to study the effect of the growing importance of Unicode with regard to the future of reading and writing in human societies. Reflecting the richness of perspectives on writing systems, G21C is actively interdisciplinary, and welcomes proposals from researchers from the fields of computer science and information technology, linguistics, communication, pedagogy, psychology, history, and the social sciences. G21C aims to create a space for the discussion of the range of approaches to writing systems, and specifically to bridge approaches in linguistics, informatics, and other fields. It will provide a forum for explorations in terminology, methodology, and theoretical approaches relating to the delineation of an emerging interdisciplinary area of research that intersects with intense activity in practical implementations of writing systems. The first edition of G21C was held in Brest, France, on June 14-15, 2018. All presentations have been recorded and can be watched on http://conferences.telecom-bretagne.eu/grafematik/ *********************** Keynote speakers *********************** Jessica Coon, Associate Professor, Department of Linguistics, McGill University, Montr?al, Canada: ?The Linguistics of Arrival: What an alien writing system can teach us about human language? Martin Neef, Professor, Institut f?r Germanistik, TU Braunschweig, Braunschweig, Germany: ?What is it that ends with a full stop?? *********************** Main topics of interest *********************** We welcome original proposals from all disciplines concerned with the study of written language, writing systems, and their implementation in information systems. Examples of topics include, but are not limited to: Epistemology of grapholinguistics: history, onomastics, topics, interaction with other disciplines Foundations of grapholinguistics, graphemics and graphetics History and typology of writing systems, comparative graphemics/graphetics Semiotics of writing and of writing systems Computational/formal graphemics/graphetics Grapholinguistic theory of Unicode encoding Orthographic reforms, theory and practice Graphemics/graphetics and multiliteracy Writing and Art / Writing in Art Sinographemics Typographemics, typographetics Texting, latinization, new forms of written language ASCII art, emoticons and other pictorial uses of graphemes The future of writing, of writing systems and styles Graphemics/graphetics and font technologies Graphemics/graphetics in steganography and computer security (phishing, typosquatting, etc.) Graphemics/graphetics in art, media and communication / Aesthetics of writing in the digital era Graphemics/graphetics in experimental psychology and cognitive sciences Teaching graphemics/graphetics, the five Ws and one H Grapholinguistic applications in natural language processing and text mining Grapholinguistic applications in optical character recognition and information technologies ************************ Program committee ************************ Gabriel Altmann, formerly at Ruhr-Universit?t Bochum, Germany Jannis Androutsopoulos, Universit?t Hamburg, Germany Vlad Atanasiu, Universit? de Fribourg, Switzerland Kristian Berg, Universit?t Oldenburg, Germany Peter Bilak, Typoth?que, The Hague, The Netherlands Florian Coulmas, Universit?t Duisburg, Germany Jacques David, Universit? de Cergy-Pontoise, France Mark Davis, Unicode Consortium & Google Inc., Switzerland Joseph Dichy, Universit? Lumi?re Lyon 2, France Christa D?rscheid, Universit?t Z?rich, Switzerland Martin D?rst, Aoyama Gakuin University & W3C, Sagamihara, Japan Caroline Fontaine, IMT Atlantique & CNRS Lab-STICC, Brest, France Claude Gruaz, formerly at CNRS, Rouen, France Yannis Haralambous, IMT Atlantique & CNRS Lab-STICC, Brest, France Keisuke Honda, Imperial College London and University of Oxford, United Kingdom Shu-Kai Hsieh, National Taiwan University, Taipei, Taiwan Dejan Ivkovi?, York University, Toronto, Canada Jean-Pierre Jaffr?, formerly at Universit? Paris 5, France Terry Joyce, Tama University, Fujisawa, Kanagawa, Japan George Kiraz, Rutgers University, Piscataway, New Jersey, USA Marc W. K?ster, Office de traduction de l'Union europ?enne, Luxembourg Gerry Leonidas, University of Reading, United Kingdom Kamal Mansour, Monotype Imaging, Los Altos, California, USA Klimis Mastoridis, University of Nicosia, Cyprus Dimitrios Meletis, Karl-Franzens-Universit?t Graz, Austria Tomi S. Melka, formerly at Parkland College, Champaign, Illinois, USA James Myers, National Chung Cheng University, Taiwan Panchanan Mohanty, University of Hyderabad, India Lisa Moore, Unicode Consortium, USA Shigeki Moro, Hanazono University, Kyoto, Japan J.R. Osborn, Georgetown University, Washington DC, USA Jean-Christophe Pellat, Universit? de Strasbourg, France Miquel Peyr?, Universidad de Sevilla, Spain Claude Puech, Universit? de la Sorbonne nouvelle, Paris, France Fran?ois Rastier, formerly at CNRS, Paris, France Cornelia Schindelin, Universit?t Mainz, Germany Virach Sornlertlamvanich, SIIT, Thammasat University, Phatum Thani, Thailand J?rgen Spitzm?ller, Universit?t Wien, Vienna, Austria Susanne Wehde, MRC Managing Research GmbH, M?nchen, Germany Kenneth Whistler, Unicode Consortium, Berkeley, California, USA ******************* *Important dates* ******************* * Submission deadline: January 13, 2020 * Notification of acceptance: March 30, 2020 * Conference: June 17-19, 2020 We invite you to submit original contributions in the form of extended abstracts (not exceeding 1,000 words), written in English and anonymized. All submissions will be peer-reviewed on the basis of relevance, originality, importance and clarity. To submit please use the EasyChair site: https://easychair.org/conferences/?conf=grafematik2020 For more information on the conference please visit https://grafematik2020.sciencesconf.org and follow https://twitter.com/grafematik2020 The Proceedings of the 2018 Edition of the conference have been published by Fluxus Editions: From unicode at unicode.org Mon Jan 6 05:26:34 2020 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Mon, 6 Jan 2020 12:26:34 +0100 Subject: Aw: Re: Re: NBSP supposed to stretch, right? In-Reply-To: References: <8b6ca6cc-ea1f-c860-7b3a-2c37638df11f@ix.netcom.com> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 6 11:39:58 2020 From: unicode at unicode.org (Johannes Bergerhausen via Unicode) Date: Mon, 6 Jan 2020 18:39:58 +0100 Subject: Videos on YouTube In-Reply-To: <76437c43.e1.16f46ec27d4.Webtop.56@btinternet.com> References: <76437c43.e1.16f46ec27d4.Webtop.56@btinternet.com> Message-ID: <56471C45-BDAA-4CCF-ACA2-2D616C74D422@bergerhausen.com> Dear William, dear list, the conference Alphabetica 2019 was held in The Hague at West, the museum of contemporary art: http://www.westdenhaag.nl/exhibitions/19_11_Alphabetica West is located in the former US embassy, a brutalist building by Marcel Breuer (Bauhaus): www.westdenhaag.nl The Missing Scripts Project is a joint effort of Institut Designlabor Gutenberg (IDG), Hochschule Mainz, where I am teaching, Atelier National de Recherche Typographique (ANRT) Nancy and Script Encoding Initiative (SEI) Berkeley. Here you can find videos of all talks: www.westdenhaag.nl/exhibitions/19_11_Alphabetica/more2 First step of the Missing Scripts Project is this website: www.worldswritingsystems.org All the best, Johannes > Am 27.12.2019 um 11:34 schrieb wjgo_10009 at btinternet.com via Unicode : > > I searched on YouTube for > > Gutenberg Mainz > > and filtered for > > This week > > and I found 12 videos uploaded 3 days ago about a symposium called Alphabetica 2019. > > Apparently held in Amsterdam. > > It seems that the videos were listed for that search as the notes include > > "Presented in collaboration with the Institut Designlabor Gutenberg (Hochshule Mainz)," ? [and several other organizations] > > so both the words Gutenberg and Mainz were matched to the search. > > So, a serendipitous discovery. > > There is an interesting section in one video about Bliss and a new interesting development relating to the (possible) encoding of Bliss characters into Unicode. > > https://www.youtube.com/watch?v=mwj2KilAXmo > > Here are links to two videos of continuous walks through Mainz: each of them includes the Statue of Gutenberg and the outside of the Gutenberg Museum yet are otherwise almost non-overlapping in their routes. > > https://www.youtube.com/watch?v=scjLxGh17rA > > https://www.youtube.com/watch?v=izqBUQkfByw > > William Overington > > Frisday 27 December 2019 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 8 11:28:18 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 8 Jan 2020 17:28:18 +0000 (GMT) Subject: Free emoji (from Re: Videos on YouTube) In-Reply-To: <56471C45-BDAA-4CCF-ACA2-2D616C74D422@bergerhausen.com> References: <76437c43.e1.16f46ec27d4.Webtop.56@btinternet.com> <56471C45-BDAA-4CCF-ACA2-2D616C74D422@bergerhausen.com> Message-ID: <656e817f.9d0.16f86332e80.Webtop.218@btinternet.com> Johannes Bergerhausen wrote: > West is located in the former US embassy, a brutalist building by > Marcel Breuer (Bauhaus): www.westdenhaag.nl On that web page is a link to the following web page. http://www.westdenhaag.nl/exhibitions/20_02_Alphabetum_6 The title of the exhibition is FREE EMOJI There is some interesting text on the page. How would such emoji be encoded? I am wondering how this relates, if at all, to QID emoji. The concept of QID emoji has been put forward, and it is far-reaching in its implications for the future. However, it does mean that someone wanting a new emoji would need to go through the QID database process. If, in the United Kingdom, someone writes a poem, or a novel, or indeed anything, and publishes it, whether in hardcopy or on the web, then no permission is needed to do so, though the content is subject to legal constraints. There are certain requirements relating to Legal Deposit. https://www.bl.uk/legal-deposit So what if such freedom were to apply to introducing a new emoji? For example, if I produce an ebook and I want to include a reference code, I have the option, in the Serif PagePlus X7 desktop publishing software that I use, of using a UUID (Universally Unique Identifier), or an ISBN (International Standard Book Number), or something custom. I have not produced any ebooks other than, as learning exercises, a few tests that I have not published. I looked at UUID and it seems to me that a randomly generated UUID code is not unique at an absolute level. ISBN needs registration with payment being involved. Yet there is always custom. So if there were to be free emoji as mentioned in the text for that exhibition, how could they be encoded for interoperability? Does the exhibition address that issue? Maybe publish a PDF and send it for legal deposit with a code of some sort and then that is regarded as a precedent? Or what? What would a custom code be like? Maybe the author's initials followed by a serial number, then interchange being by using a tag sequence after a (new, not yet encoded?) base character of the tag character version each of those characters that are in the custom code? Lots of potential problems there too. What are the options? If someone on this list is visiting the exhibition, a write up posted in this mailing list would be welcome please, at least, by me, and maybe by some other participants too. William Overington Wednesday 8 January 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Jan 10 12:30:38 2020 From: unicode at unicode.org (dinar qurbanov via Unicode) Date: Fri, 10 Jan 2020 21:30:38 +0300 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: References: <20180517201255.5da51fa5@JRWUBU2> Message-ID: hello. you can browse to replies that are not quoted below from https://unicode.org/mail-arch/unicode-ml/y2018-m05/0039.html . where can i write some bug reports or feature requests in order to get custom diacritic marks automatically positioned at right place above and below arabic letters, and also without having to put beginning / middle / end forms of arabic letters manually, but using just "simple" arabic letter unicode codes. and, where should i submit bug reports for what, what is responsible for what. seems users of unicode should be able to use private use area like this, to develop their own arabic and other diacritics, not only latin / greek / cyrillic... though i am even not tried to make latin/cyrillic/greek custom diacritics yet... i used custom latin and cyrillic scripts, but i need not to develop custom diacritics, because there are plenty of ready diacritics to use with them. 2018-05-19 13:22 GMT+03:00, dinar qurbanov : > this is a test i made that time: http://tmf.org.ru/arabic.html . look > at second line. my custom mark is located too left on the most left > "B", and is located too right on the middle (that is of middle form of > B) and on the most righ "B" (that is of starter form of B). it should > be located right above the below dot. > > - this was the problem that i could not solve. > > also there are problems that i could solve by using 1) rtl override > mark; 2) and using start, middle, end, separate B characters instead > of using simple arabic B, that would be easier. (you can see in the > example that that characters are used). (using different forms of > letter can also be achieved by using php or javascript, etc). > > > > > 2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode > : >> On Thu, 17 May 2018 09:49:55 +0300 >> dinar qurbanov via Unicode wrote: >> >>> how to make custom combining diacritical marks for arabic letters? >>> should only font drivers and programs support it, or should also >>> unicode support it, for example, have special area for them? >>> >>> as i know, private use area can be used to make combining diacritical >>> marks for latin script without problems. >>> >>> but when i tried, several years ago, to make that for arabic script, >>> with fontforge, i had to use right to left override mark, and manually >>> insert beginning, middle, ending forms of arabic letters, and even >>> then, my custom marks were not located very properly above letters. >> >> I'm offering suggestions, but I don't that they will work. >> >> The one thing that may help you is that these marks cannot appear in >> plain text. There are a number of things you need to do: >> >> 1) Persuade the renderer to treat your character as being a run in a >> single script. You might be able to do this by: >> >> a) Not having any lookups for the Arabic script. >> >> b) Using RLM to persuade the renderer that you have a right-to-left run. >> >> It is just possible that his may fail with OpenType fonts but work >> with Graphite or AAT fonts. If it works, you will then have to >> implement all the Arabic shaping yourself. >> >> 2) If OpenType fonts will treat the data as a single script run, you >> will need to ensure that there is an OpenType substitution feature that >> the renderer will support. Fortunately, many modern text applications >> will allow you to force the ccmp feature to be enabled - I have used >> such feature forcing with OpenType in LibreOffice and also in HTML, >> which renders accordingly in all the modern browsers I have tested - MS >> Edge on Windows 10, Firefox and, on iPhones, Safari. While the ccmp >> feature is enabled for the PUA in Firefox, it is disabled in MS Edge on >> Windows 10. >> >> 3) I believe AAT will soon be available for products using the HarfBuzz >> layout engine, so it is likely to become available on Firefox and >> LibreOffice. If AAT looks like a solution, you may need to research the >> attitudes of Chrome and OpenOffice, for I believe they have chosen not >> to support Graphite. >> >> A totally different solution would be to recompile your application so >> that it believes that your diacritics are in the Arabic script. >> >> Richard. > From unicode at unicode.org Fri Jan 10 16:48:22 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 10 Jan 2020 22:48:22 +0000 Subject: New Unicode Working Group: Message Formatting In-Reply-To: <5E18F2D7.20505@unicode.org> References: <5E18F2D7.20505@unicode.org> Message-ID: On 2020-01-10 9:55 PM, announcements at unicode.org wrote: > But until now we have not had a syntax for localizable message strings > standardized by Unicode. What is the difference between ?localizable message strings? and ?localized sentances??? Asking for a friend. From unicode at unicode.org Fri Jan 10 16:50:56 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 10 Jan 2020 22:50:56 +0000 Subject: New Unicode Working Group: Message Formatting In-Reply-To: References: <5E18F2D7.20505@unicode.org> Message-ID: <27d358e8-3695-097c-f23f-043e76a24963@gmail.com> * sentences On 2020-01-10 10:48 PM, James Kass wrote: > > On 2020-01-10 9:55 PM, announcements at unicode.org wrote: >> But until now we have not had a syntax for localizable message >> strings standardized by Unicode. > > What is the difference between ?localizable message strings? and > ?localized sentances??? Asking for a friend. > From unicode at unicode.org Fri Jan 10 17:17:12 2020 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Fri, 10 Jan 2020 15:17:12 -0800 Subject: New Unicode Working Group: Message Formatting In-Reply-To: References: <5E18F2D7.20505@unicode.org> Message-ID: James, A localizable message string is one similar to those given in the example: English: ?The package will arrive at {time} on {date}.? German: ?Das Paket wird am {date} um {time} geliefert.? The message string may contain any number of complete sentences, including zero ( ?Arrival: {time}? ). The Message Format Working Group is to define the *format* of the strings, not their *repertoire*. That is, should the string be ?Arrival: %s? or ?Arrival: ${date}? or ?Arrival: {0}?? Does that answer your question? -- Steven R. Loomis | @srl295 | git.io/srl295 > El ene. 10, 2020, a las 2:48 p. m., James Kass via Unicode escribi?: > > > On 2020-01-10 9:55 PM, announcements at unicode.org wrote: >> But until now we have not had a syntax for localizable message strings standardized by Unicode. > > What is the difference between ?localizable message strings? and ?localized sentences?? Asking for a friend. > From unicode at unicode.org Fri Jan 10 17:31:01 2020 From: unicode at unicode.org (James Kass via Unicode) Date: Fri, 10 Jan 2020 23:31:01 +0000 Subject: New Unicode Working Group: Message Formatting In-Reply-To: References: <5E18F2D7.20505@unicode.org> Message-ID: <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> Yes, thank you, that answers the question.? Format rather than repertoire.? Please note, though, that the example given of a localizable message string is also an example of a localized sentence. On 2020-01-10 11:17 PM, Steven R. Loomis wrote: > James, > > A localizable message string is one similar to those given in the example: > English: ?The package will arrive at {time} on {date}.? > German: ?Das Paket wird am {date} um {time} geliefert.? > > The message string may contain any number of complete sentences, including zero ( ?Arrival: {time}? ). > > The Message Format Working Group is to define the *format* of the strings, not their *repertoire*. That is, should the string be ?Arrival: %s? or ?Arrival: ${date}? or ?Arrival: {0}?? > > > Does that answer your question? > > -- > Steven R. Loomis | @srl295 | git.io/srl295 > > > >> El ene. 10, 2020, a las 2:48 p. m., James Kass via Unicode escribi?: >> >> >> On 2020-01-10 9:55 PM, announcements at unicode.org wrote: >>> But until now we have not had a syntax for localizable message strings standardized by Unicode. >> What is the difference between ?localizable message strings? and ?localized sentences?? Asking for a friend. >> From unicode at unicode.org Sat Jan 11 13:37:50 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 11 Jan 2020 19:37:50 +0000 (GMT) Subject: New Unicode Working Group: Message Formatting In-Reply-To: <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> References: <5E18F2D7.20505@unicode.org> <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> Message-ID: <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> A person in England, who knows no German, wants to send the parcel to a person in Germany, who knows no English. The person in England wants to send a message about the delivery to the person in Germany.. > English: ?The package will arrive at {time} on {date}.? The person want to send the message by email. > German: ?Das Paket wird am {date} um {time} geliefert.? Where does the translation of the text take place please, and by whom or by which computer? During the actual transmission from the computer in England to the computer in Germany, is the text of the string in English, or German, or in a language-independent form please? ---- If the parcel were being sent from France to Germany by a person who knows only French, during the transmission of the message about the parcel, is the text of the string in French, or English, or German, or in a language-independent form please? William Overington Saturday 11 January 2020 From unicode at unicode.org Sat Jan 11 14:42:47 2020 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Sat, 11 Jan 2020 21:42:47 +0100 Subject: New Unicode Working Group: Message Formatting In-Reply-To: <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> References: <5E18F2D7.20505@unicode.org> <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> Message-ID: You seem to have never seen how translation packages work and are used in common projects (not just CLDR, but you could find them as well in Wikimedia projects, or translation packages for lot of open source packages). The purpose is to allow translating the UI of these applications for user's demanded language. Internally the application can use whatever representation it needs : it may be in any language or could be just an identifier, here this does not matter as they are independant of the final translation rendered. In CLDR, identifiers are used (more or less based on simplified English, sometimes abbreviations or conventional codes). In typical .po(t) packages the identifiers are the source language from which the software was built and its strings extracted, and to be replaced by calling an API. Various projects do not always use English as the source of their translation and even if this is the source, the strings themselves are not always the unique identifiers used. If you send you package and need to print it, of course you'll print the label in a chosern language. Nothing forbifs the print to display both languages, i.e. two copies of the message translated in two languages (English or German in your example; just look at printed noticed you find in your purchase packages: the booklets frequently include multiple copies, one per language, often a dozen for products imported from China to Europe; even food is frequently labeled in several languages for international brands). If needed, products descriptions or source and delivery addresses will be accessible via an online web app by printing a barcode or QRcode on the label (they will be converted to an URI): an URI by itself has no language, it's also an identifier, allowing to retrive the texts in multiple languages or the language of user's choice. So your question is non-sense with the example you give. Le sam. 11 janv. 2020 ? 21:21, wjgo_10009 at btinternet.com via Unicode < unicode at unicode.org> a ?crit : > A person in England, who knows no German, wants to send the parcel to a > person in Germany, who knows no English. > > The person in England wants to send a message about the delivery to the > person in Germany.. > > > English: ?The package will arrive at {time} on {date}.? > > The person want to send the message by email. > > > German: ?Das Paket wird am {date} um {time} geliefert.? > > Where does the translation of the text take place please, and by whom or > by which computer? > > During the actual transmission from the computer in England to the > computer in Germany, is the text of the string in English, or German, or > in a language-independent form please? > > ---- > > If the parcel were being sent from France to Germany by a person who > knows only French, during the transmission of the message about the > parcel, is the text of the string in French, or English, or German, or > in a language-independent form please? > > William Overington > > Saturday 11 January 2020 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 13 05:20:52 2020 From: unicode at unicode.org (Thomas Spehs (MonMap) via Unicode) Date: Mon, 13 Jan 2020 19:20:52 +0800 Subject: Geological symbols Message-ID: <000e01d5ca03$856f1560$904d4020$@monmap.mn> Hi, I would like to ask if there is any way to create geological ?symbols? with Unicode such as: Q????, but with the two ?1?s over each other, without a space. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 13 10:07:02 2020 From: unicode at unicode.org (Oren Watson via Unicode) Date: Mon, 13 Jan 2020 11:07:02 -0500 Subject: Geological symbols In-Reply-To: <000e01d5ca03$856f1560$904d4020$@monmap.mn> References: <000e01d5ca03$856f1560$904d4020$@monmap.mn> Message-ID: This is not possible in unicode plaintext as far as I can tell, since Unicode doesn't allow overstriking arbitrary characters over each other the way more advanced layout systems, e.g. LaTeX do. It is however possible to engineer a font to arrange those characters like that by using aggressive kerning. On Mon, Jan 13, 2020 at 10:14 AM Thomas Spehs (MonMap) via Unicode < unicode at unicode.org> wrote: > Hi, I would like to ask if there is any way to create geological ?symbols? > with Unicode such as: Q????, but with the two ?1?s over each other, > without a space. Thanks! > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 13 10:11:10 2020 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Mon, 13 Jan 2020 17:11:10 +0100 Subject: Aw: Geological symbols In-Reply-To: <000e01d5ca03$856f1560$904d4020$@monmap.mn> References: <000e01d5ca03$856f1560$904d4020$@monmap.mn> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 13 11:15:27 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Mon, 13 Jan 2020 17:15:27 +0000 (GMT) Subject: New Unicode Working Group: Message Formatting In-Reply-To: References: <5E18F2D7.20505@unicode.org> Message-ID: <277c6d47.bc5.16f9fe73656.Webtop.55@btinternet.com> I notice that in the web page https://github.com/unicode-org/message-format-wg/issues/3 there is a request to add more features. One of those requested features is as follows > Inflections (genders, articles, delensions, etc.) So I am wondering quite what formats will be covered by the project and how those formats can be applied, in various contexts, not necessarily only those initially considered. William Overington Monday 13 January 2020 From unicode at unicode.org Mon Jan 13 18:07:32 2020 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 14 Jan 2020 01:07:32 +0100 Subject: Geological symbols In-Reply-To: References: <000e01d5ca03$856f1560$904d4020$@monmap.mn> Message-ID: It is possible with some other markup languages, including HTML by using ruby notation and other interlinear notations for creating special vertical layouts inside an horizontal line. There are difficulties however caused by line wraps which may occur before the vertical layout, or even inside it for each stacked item, and for managing the lineheight for the whole line. Finally you could endup with the same problems as those found in mathematical formulas... and for composing Egyptian hieroglyphs of Visiblespeech, for which a markup language has to be defined (with a convention, similar to an orthographic or typographic convention) in addition to the core characters that are used to build up the composition, and possibly some extra styling (to adjust the size of individual items, or to align them properly in the stack and fit them cleanly in the composition area (e.g. an ideographic square). Final difficulties are added by bidirectionality Not all texts are purely linear (unidimensional) and a linear representation is difficult to interpret without adding the markup syntax inside the source text and sometimes aven adding extra symbols (or punctuation) in the linear composition, which would not be needed in a true bidimensional layout. Unicode does not encode characters for the second dimension and the layout, so it's up to markup languages (or orthographic conventions) to define the extra semantics and/or layout. A font alone cannot guess without these conventions, and even if these conventions are used, assumptions made could infer sometimes the incorrect layout. Le lun. 13 janv. 2020 ? 17:16, Oren Watson via Unicode a ?crit : > This is not possible in unicode plaintext as far as I can tell, since > Unicode doesn't allow overstriking arbitrary characters over each other the > way more advanced layout systems, e.g. LaTeX do. It is however possible to > engineer a font to arrange those characters like that by using aggressive > kerning. > > > On Mon, Jan 13, 2020 at 10:14 AM Thomas Spehs (MonMap) via Unicode < > unicode at unicode.org> wrote: > >> Hi, I would like to ask if there is any way to create geological >> ?symbols? with Unicode such as: Q????, but with the two ?1?s over each >> other, without a space. Thanks! >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Jan 13 18:29:44 2020 From: unicode at unicode.org (Steven R. Loomis via Unicode) Date: Mon, 13 Jan 2020 16:29:44 -0800 Subject: New Unicode Working Group: Message Formatting In-Reply-To: <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> References: <5E18F2D7.20505@unicode.org> <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> Message-ID: > El ene. 11, 2020, a las 11:37 a. m., wjgo_10009 at btinternet.com via Unicode escribi?: > > A person in England, ? As noted in the blog, the scope of this working group is a syntax for "adapting programs?. It is not intended for individual communication between two persons. > Where does the translation of the text take place please, and by whom or by which computer? The question of when and how the message translation takes place is also out of scope for the Working Group. Mr. Verdy has given a great summary introduction to the process in a separate reply. -- Steven R. Loomis | @srl295 | git.io/srl295 From unicode at unicode.org Mon Jan 13 23:44:07 2020 From: unicode at unicode.org (via Unicode) Date: Tue, 14 Jan 2020 13:44:07 +0800 Subject: AW: Geological symbols In-Reply-To: References: <000e01d5ca03$856f1560$904d4020$@monmap.mn> Message-ID: <001901d5ca9d$a7fb7b10$f7f27130$@monmap.mn> Thanks for your reply. I think actually LaTeX is not a good option for our purpose, because we want to create and disseminate datasets which are easy to use and do not require any software or special font installation. Thus, we?ll live with the little bit uglier version. Anyway, thanks! Thomas Von: "J?rg Knappen" Gesendet: Dienstag, 14. Januar 2020 00:11 An: thomas at monmap.mn Cc: unicode at unicode.org Betreff: Aw: Geological symbols Hallo Thomas, Unicode delegates this (combined superscripts and subscripts) to higher level markup languages or Rich Text Editors. I don't know how widespread the use of LateX is among geologists, but notation like this is a perfect use case for LaTeX. --J?rg Knappen Gesendet: Montag, 13. Januar 2020 um 12:20 Uhr Von: "Thomas Spehs (MonMap) via Unicode" > An: unicode at unicode.org Betreff: Geological symbols Hi, I would like to ask if there is any way to create geological ?symbols? with Unicode such as: Q????, but with the two ?1?s over each other, without a space. Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 14 03:21:22 2020 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Tue, 14 Jan 2020 10:21:22 +0100 Subject: Geological symbols In-Reply-To: <001901d5ca9d$a7fb7b10$f7f27130$@monmap.mn> References: <000e01d5ca03$856f1560$904d4020$@monmap.mn> <001901d5ca9d$a7fb7b10$f7f27130$@monmap.mn> Message-ID: <392AC1FB-B91E-4397-8E72-ECA8C2945740@telia.com> For rendering, you might have a look at ConTeXt, because I recall it has an option whereby Unicode super- and sub-scripts can be displayed over each other without extra processing. > On 14 Jan 2020, at 06:44, via Unicode wrote: > > Thanks for your reply. I think actually LaTeX is not a good option for our purpose, because we want to create and disseminate datasets which are easy to use and do not require any software or special font installation. Thus, we?ll live with the little bit uglier version. > Anyway, thanks! > Thomas > From unicode at unicode.org Tue Jan 14 07:38:39 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Tue, 14 Jan 2020 13:38:39 +0000 (GMT) Subject: New Unicode Working Group: Message Formatting In-Reply-To: References: <5E18F2D7.20505@unicode.org> <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> Message-ID: <3965d3fe.14c6.16fa44716b9.Webtop.224@btinternet.com> The reply from Mr Verdy has indeed been helpful, as indeed has also been an offlist private reply from someone who has, thus far, not been a participant in this thread. Mr Verdy wrote: > You seem to have never seen how translation packages work and are used > in common projects (not just CLDR, but you could find them as well in > Wikimedia projects, or translation packages for lot of open source > packages). What seems to be the case to Mr Verdy is in fact the actual situation. I do not satisfy the second of the two conditions of the invitation to join the working group. I am, in fact, retired and I have never worked in the i18n/l10n industry. Also, from the explanations it is not as close to my research interests as I had thought, and indeed hoped. I just do what I can on my research project from time to time using a home computer, a personal webspace hosted by an internet service provider, some budget software, mainly High-Logic FontCreator, and Serif PagePlus desktop publishing package, together with the software bundled with Windows 10. Older people are often advised to try to keep the mind active, so my research activity at least does that. If the research itself has benefits more generally in making progress in the application of information technology then that is an additional benefit. One thing that of which you might like to take account and specifically "build-out" in computer formatting is a tendency that can occur in some computer systems software and also in everyday transactions also before computers became widespread, namely of not allowing a person to be recorded or listed with more that two initials before his or her surname, to the extent that some people even have a practice of not using more than two initials even when the document, such as a letter, or a form, before them specifically uses three or more initials. Common explanations are that "It's for the computer" and "Two initials is enough to identify someone" and "Someone could have many names". Yet the second is not true and the first is only because somewhere along the line someone has decided that that is how it to be done: the third is true, but the fact that that is the person's name on his or her birth certificate is the legal fact of the matter and so needs to be properly accommodated in systems recording names. Also, the United Kingdom and United States format of a given name, one or more additional given names, then a surname is not suitable for some other cultures. I remember some registration forms for college courses that would ask for surname and forenames, with a panel for each, together with a printed note on every such form "If your name cannot be expressed in that format, please write your whole name in the box labelled 'surname'". However, with localization there are other issues. I seem to remember somewhere that people whose name is correctly expressed in a script other than Latin script often have a transliterated "Romanized form" of their name as well for use on travel documents. So will your format system include provision for this please, such as by allowing both to be linked together in a document please? Another feature is that I have known people from various countries who have, in everyday use, chosen to be known in everyday workplace situations by an English first name rather than their official given name, while using their original surname, perhaps transliterated. So it would be good if the name format accounts for that too please, in a manner that does not give the possible impression of that use being for some questionable purpose. Maybe a new term such as ChosenSocialName could be used for that please. An interesting facet of transliteration is that the name of a famous mathematician whose name was properly written using Cyrillic characters, was transliterated into English as Chebyshev, whereas the set of polynomials named after him are each designated by including the letter T. The transliteration of the name of the mathematician into German starts with a T rather than the C used in English. There was a short thread that explored within it this topic in this mailing list around the year 2000, not necessarily in the year 2000 itself, but I have not been able to locate it. William Overington Tuesday 14 January 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Tue Jan 14 10:23:18 2020 From: unicode at unicode.org (Nelson H. F. Beebe via Unicode) Date: Tue, 14 Jan 2020 09:23:18 -0700 Subject: New Unicode Working Group: Message Formatting Message-ID: William, this is off the Unicode list. See http://mathreader.livejournal.com/9239.html for a list of 207 variants of Chebyshev's name. ------------------------------------------------------------------------------- - Nelson H. F. Beebe Tel: +1 801 581 5254 - - University of Utah FAX: +1 801 581 4148 - - Department of Mathematics, 110 LCB Internet e-mail: beebe at math.utah.edu - - 155 S 1400 E RM 233 beebe at acm.org beebe at computer.org - - Salt Lake City, UT 84112-0090, USA URL: http://www.math.utah.edu/~beebe/ - ------------------------------------------------------------------------------- From unicode at unicode.org Tue Jan 14 11:02:23 2020 From: unicode at unicode.org (Lorna Evans via Unicode) Date: Tue, 14 Jan 2020 11:02:23 -0600 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: References: <20180517201255.5da51fa5@JRWUBU2> Message-ID: <9dd234fb-a14b-d030-a76f-36ac90060e62@sil.org> What are the combining marks supposed to look like? Are they your creation or do you have samples of usage? It is true that you will not likely get combining marks to work if either they or the base character are PUA. Adding the complexity of RTL makes the issue worse. Lorna On 1/10/2020 12:30 PM, dinar qurbanov via Unicode wrote: > hello. > > you can browse to replies that are not quoted below from > https://unicode.org/mail-arch/unicode-ml/y2018-m05/0039.html . > > where can i write some bug reports or feature requests in order to get > custom diacritic marks automatically positioned at right place above > and below arabic letters, and also without having to put beginning / > middle / end forms of arabic letters manually, but using just "simple" > arabic letter unicode codes. and, where should i submit bug reports > for what, what is responsible for what. > > seems users of unicode should be able to use private use area like > this, to develop their own arabic and other diacritics, not only latin > / greek / cyrillic... though i am even not tried to make > latin/cyrillic/greek custom diacritics yet... i used custom latin and > cyrillic scripts, but i need not to develop custom diacritics, because > there are plenty of ready diacritics to use with them. > > > 2018-05-19 13:22 GMT+03:00, dinar qurbanov : >> this is a test i made that time: http://tmf.org.ru/arabic.html . look >> at second line. my custom mark is located too left on the most left >> "B", and is located too right on the middle (that is of middle form of >> B) and on the most righ "B" (that is of starter form of B). it should >> be located right above the below dot. >> >> - this was the problem that i could not solve. >> >> also there are problems that i could solve by using 1) rtl override >> mark; 2) and using start, middle, end, separate B characters instead >> of using simple arabic B, that would be easier. (you can see in the >> example that that characters are used). (using different forms of >> letter can also be achieved by using php or javascript, etc). >> >> >> >> >> 2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode >> : >>> On Thu, 17 May 2018 09:49:55 +0300 >>> dinar qurbanov via Unicode wrote: >>> >>>> how to make custom combining diacritical marks for arabic letters? >>>> should only font drivers and programs support it, or should also >>>> unicode support it, for example, have special area for them? >>>> >>>> as i know, private use area can be used to make combining diacritical >>>> marks for latin script without problems. >>>> >>>> but when i tried, several years ago, to make that for arabic script, >>>> with fontforge, i had to use right to left override mark, and manually >>>> insert beginning, middle, ending forms of arabic letters, and even >>>> then, my custom marks were not located very properly above letters. >>> I'm offering suggestions, but I don't that they will work. >>> >>> The one thing that may help you is that these marks cannot appear in >>> plain text. There are a number of things you need to do: >>> >>> 1) Persuade the renderer to treat your character as being a run in a >>> single script. You might be able to do this by: >>> >>> a) Not having any lookups for the Arabic script. >>> >>> b) Using RLM to persuade the renderer that you have a right-to-left run. >>> >>> It is just possible that his may fail with OpenType fonts but work >>> with Graphite or AAT fonts. If it works, you will then have to >>> implement all the Arabic shaping yourself. >>> >>> 2) If OpenType fonts will treat the data as a single script run, you >>> will need to ensure that there is an OpenType substitution feature that >>> the renderer will support. Fortunately, many modern text applications >>> will allow you to force the ccmp feature to be enabled - I have used >>> such feature forcing with OpenType in LibreOffice and also in HTML, >>> which renders accordingly in all the modern browsers I have tested - MS >>> Edge on Windows 10, Firefox and, on iPhones, Safari. While the ccmp >>> feature is enabled for the PUA in Firefox, it is disabled in MS Edge on >>> Windows 10. >>> >>> 3) I believe AAT will soon be available for products using the HarfBuzz >>> layout engine, so it is likely to become available on Firefox and >>> LibreOffice. If AAT looks like a solution, you may need to research the >>> attitudes of Chrome and OpenOffice, for I believe they have chosen not >>> to support Graphite. >>> >>> A totally different solution would be to recompile your application so >>> that it believes that your diacritics are in the Arabic script. >>> >>> Richard. From unicode at unicode.org Tue Jan 14 12:05:59 2020 From: unicode at unicode.org (Philippe Verdy via Unicode) Date: Tue, 14 Jan 2020 19:05:59 +0100 Subject: New Unicode Working Group: Message Formatting In-Reply-To: <3965d3fe.14c6.16fa44716b9.Webtop.224@btinternet.com> References: <5E18F2D7.20505@unicode.org> <02a8e996-a86b-d865-d29a-1a22ac966f22@gmail.com> <69f57f26.13b5.16f961cda8c.Webtop.211@btinternet.com> <3965d3fe.14c6.16fa44716b9.Webtop.224@btinternet.com> Message-ID: People name are NOT transliterated freely. It's up to each person to document his romanized name, it should not be invented by automatic processes. And frequently the romanized name (officialized) does noit match the original name in another script: this is very frequent for Chinese people, as well as trademarks). There are also common but informal names, not always official but commonly use in the press/medias and their orthography varies across countries/languages. If these people are "wellknown" (notably historic personalities, or artists), they may have their page in some Wikipedia and Wikidata. There's no need to "translate" them, you'll use a database query to retrieve names (including the preferred/most frequent one, the official one). In some countries several orthographies may be used (e.g. for streets named after people's: these names are not translatable, except if locally the streets are multilingual: this is not a database of people names but a geographic database for other purposes, even if these originate from people they are still geographic names *derived* from people names). For this you'll still use placeholders in the messages and the value of the placeholder may be queried in the relevant database for the relevant target language; variable forms for these names (e.g. genitives) may be found but are not easily derived). If these are geographic names, they may be transliterated but there are competing standards for transliterations of toponyms, so you'll also need to tune your application to select the romanization system relevant for the target language (the international standards are language neutral, but not relevant for specific countries that have their own officialized terminology, or for the Unioted Nations that need to cite them in several official working languages), if the geographic database does not already contain an officialized/prefered romanization (there are also needs for transliteration from Latin to other scripts). Anyway proper names are to be treated specially, there's nothing that can be used in message format API to select what will be the effective replacement value of a placeholder. But the replacement may, or may not, specify alternate forms for correct formatting when multiple forms are possible (genitives, capitalisation, elisions and contextual mutations). for the same selected name coming from an external database. MessageFormat API and translator tools should not have to manage the external databases, which will be "translated" separately with enough forms relevant for their presentation and composition in larger messages. Why this group exist now in CLDR ? most probably because there are already difficulties to manage translations in existing CLDR data (which is focused on a small part of what is translatable). CLDR is concerned by only a few geographic items : countries, some subnational regions, continents, and some cities used for timezones. But the main problem is the proliferation of variant forms in CLDR, added only for a few languages that need them, and no evident fallback to the common form used in most other languages that don't need that distinction or not the same kind of distinctions (e.g. plural forms, grammatical gender or personal gender not always matching together, politeness/formal forms). Once again I suggest you start contributing to a translation project and experiment with them before continuing. Look at Wikimedia wikis (translation templates, the translation extension, and the companion Translatewiki.net wiki), Transifex, Google Translator, RessourceBundle and formatting API in Java, .po/.pot for Gettext in many opensource projects, Facebook translation tool, internationalization APIs in Windows, iOS, MacOS, and the ICU library which is the de facto base for CLDR... Le mar. 14 janv. 2020 ? 16:11, wjgo_10009 at btinternet.com via Unicode < unicode at unicode.org> a ?crit : > The reply from Mr Verdy has indeed been helpful, as indeed has also been > an offlist private reply from someone who has, thus far, not been a > participant in this thread. > > > Mr Verdy wrote: > > > > You seem to have never seen how translation packages work and are used > in common projects (not just CLDR, but you could find them as well in > Wikimedia projects, or translation packages for lot of open source > packages). > > What seems to be the case to Mr Verdy is in fact the actual situation. > > I do not satisfy the second of the two conditions of the invitation to > join the working group. I am, in fact, retired and I have never worked in > the i18n/l10n industry. Also, from the explanations it is not as close to > my research interests as I had thought, and indeed hoped. I just do what I > can on my research project from time to time using a home computer, a > personal webspace hosted by an internet service provider, some budget > software, mainly High-Logic FontCreator, and Serif PagePlus desktop > publishing package, together with the software bundled with Windows 10. > Older people are often advised to try to keep the mind active, so my > research activity at least does that. If the research itself has benefits > more generally in making progress in the application of information > technology then that is an additional benefit. > > > One thing that of which you might like to take account and specifically > "build-out" in computer formatting is a tendency that can occur in some > computer systems software and also in everyday transactions also before > computers became widespread, namely of not allowing a person to be recorded > or listed with more that two initials before his or her surname, to the > extent that some people even have a practice of not using more than two > initials even when the document, such as a letter, or a form, before them > specifically uses three or more initials. Common explanations are that > "It's for the computer" and "Two initials is enough to identify someone" > and "Someone could have many names". Yet the second is not true and the > first is only because somewhere along the line someone has decided that > that is how it to be done: the third is true, but the fact that that is the > person's name on his or her birth certificate is the legal fact of the > matter and so needs to be properly accommodated in systems recording names. > Also, the United Kingdom and United States format of a given name, one or > more additional given names, then a surname is not suitable for some other > cultures. I remember some registration forms for college courses that would > ask for surname and forenames, with a panel for each, together with a > printed note on every such form "If your name cannot be expressed in that > format, please write your whole name in the box labelled 'surname'". > > > However, with localization there are other issues. I seem to remember > somewhere that people whose name is correctly expressed in a script other > than Latin script often have a transliterated "Romanized form" of their > name as well for use on travel documents. So will your format system > include provision for this please, such as by allowing both to be linked > together in a document please? > > > Another feature is that I have known people from various countries who > have, in everyday use, chosen to be known in everyday workplace situations > by an English first name rather than their official given name, while using > their original surname, perhaps transliterated. So it would be good if > the name format accounts for that too please, in a manner that does not > give the possible impression of that use being for some questionable > purpose. Maybe a new term such as ChosenSocialName could be used for that > please. > > > An interesting facet of transliteration is that the name of a famous > mathematician whose name was properly written using Cyrillic characters, > was transliterated into English as Chebyshev, whereas the set of > polynomials named after him are each designated by including the letter T. > The transliteration of the name of the mathematician into German starts > with a T rather than the C used in English. There was a short thread that > explored within it this topic in this mailing list around the year 2000, > not necessarily in the year 2000 itself, but I have not been able to locate > it. > > > William Overington > > > Tuesday 14 January 2020 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Jan 15 04:30:54 2020 From: unicode at unicode.org (dinar qurbanov via Unicode) Date: Wed, 15 Jan 2020 13:30:54 +0300 Subject: how to make custom combining diacritical marks for arabic letters? In-Reply-To: <9dd234fb-a14b-d030-a76f-36ac90060e62@sil.org> References: <20180517201255.5da51fa5@JRWUBU2> <9dd234fb-a14b-d030-a76f-36ac90060e62@sil.org> Message-ID: "What are the combining marks supposed to look like?" as you can see in http://tmf.org.ru/arabic.html , i have tested reversed fatkha. also i have ideas to make reversed kasra, different reversed dhammas, and vertical variants of them all, and maybe totally other diacritics, like caron, circumflex. some of that ideas already available in unicode. see https://en.wikipedia.org/wiki/Arabic_script_in_Unicode#Compact_table line U+065x . i see there are reversed and inverted dammas, small v and inverted small v, and others, but probably they are not enough for me. i have just read https://en.wikipedia.org/wiki/Arabic_diacritics and i have seen and remembered that there are different levels of arabic diacritics. consonant modifiers, "ijam", are more close to main part/line , and diacritics for short vowels, haraka, are further. also there is tashdid that is usually between them by its distance... i would like to be able to make several more symbols to extend short vowels. "Are they your creation or do you have samples of usage?" i have an idea to use arabic script for tatar language (it is turkic language), and that is also usable for other languages, with using harakas instead of full/long/"main line" vowel letters. this would make writing shorter with possibility of omitting some of the vowels... there are only 3 short vowels in arabic language and 3 long vowels. long vowels are written with main line like consonants, shorts with diacritics. i have checked in https://en.wikipedia.org/wiki/Uyghur_language and then in https://en.wikipedia.org/wiki/Arabic_script#Special_letters , and as i know and as i see languages with arabic script use "whole" letters to represent their additional vowels, for example, ?? , ? in uyghur language, these are made with using diacritic, but the "ijam" diacritic, consonant modifier. logically, short vowel diacritics still can be put above or below them, though that has no usage in that languages, and it probably works in unicode (ie probably the consonant modifiers and the 3 short vowels do not intersect/cross, if put together). how many vowels i need for tatar language: ?????, their "thin" pairs ?????, their "russian" pairs ????, and 2 "russian" vowels "?" and "?". so, i need 16 diacritics to put them above or below consonant letters. this my idea is not used anywhere, only in a short handwriting example. it is here http://qdb.narod.ru/tattyazmagif/qaradaft07.gif . so, yes, this is my creation, a constructed script, and it is not developed completely, but just a sketch. so, i would like to use private use area for that. 2020-01-14 20:02 GMT+03:00, Lorna Evans : > What are the combining marks supposed to look like? Are they your > creation or do you have samples of usage? It is true that you will not > likely get combining marks to work if either they or the base character > are PUA. Adding the complexity of RTL makes the issue worse. > > Lorna > > On 1/10/2020 12:30 PM, dinar qurbanov via Unicode wrote: >> hello. >> >> you can browse to replies that are not quoted below from >> https://unicode.org/mail-arch/unicode-ml/y2018-m05/0039.html . >> >> where can i write some bug reports or feature requests in order to get >> custom diacritic marks automatically positioned at right place above >> and below arabic letters, and also without having to put beginning / >> middle / end forms of arabic letters manually, but using just "simple" >> arabic letter unicode codes. and, where should i submit bug reports >> for what, what is responsible for what. >> >> seems users of unicode should be able to use private use area like >> this, to develop their own arabic and other diacritics, not only latin >> / greek / cyrillic... though i am even not tried to make >> latin/cyrillic/greek custom diacritics yet... i used custom latin and >> cyrillic scripts, but i need not to develop custom diacritics, because >> there are plenty of ready diacritics to use with them. >> >> >> 2018-05-19 13:22 GMT+03:00, dinar qurbanov : >>> this is a test i made that time: http://tmf.org.ru/arabic.html . look >>> at second line. my custom mark is located too left on the most left >>> "B", and is located too right on the middle (that is of middle form of >>> B) and on the most righ "B" (that is of starter form of B). it should >>> be located right above the below dot. >>> >>> - this was the problem that i could not solve. >>> >>> also there are problems that i could solve by using 1) rtl override >>> mark; 2) and using start, middle, end, separate B characters instead >>> of using simple arabic B, that would be easier. (you can see in the >>> example that that characters are used). (using different forms of >>> letter can also be achieved by using php or javascript, etc). >>> >>> >>> >>> >>> 2018-05-17 22:12 GMT+03:00 Richard Wordingham via Unicode >>> : >>>> On Thu, 17 May 2018 09:49:55 +0300 >>>> dinar qurbanov via Unicode wrote: >>>> >>>>> how to make custom combining diacritical marks for arabic letters? >>>>> should only font drivers and programs support it, or should also >>>>> unicode support it, for example, have special area for them? >>>>> >>>>> as i know, private use area can be used to make combining diacritical >>>>> marks for latin script without problems. >>>>> >>>>> but when i tried, several years ago, to make that for arabic script, >>>>> with fontforge, i had to use right to left override mark, and manually >>>>> insert beginning, middle, ending forms of arabic letters, and even >>>>> then, my custom marks were not located very properly above letters. >>>> I'm offering suggestions, but I don't that they will work. >>>> >>>> The one thing that may help you is that these marks cannot appear in >>>> plain text. There are a number of things you need to do: >>>> >>>> 1) Persuade the renderer to treat your character as being a run in a >>>> single script. You might be able to do this by: >>>> >>>> a) Not having any lookups for the Arabic script. >>>> >>>> b) Using RLM to persuade the renderer that you have a right-to-left >>>> run. >>>> >>>> It is just possible that his may fail with OpenType fonts but work >>>> with Graphite or AAT fonts. If it works, you will then have to >>>> implement all the Arabic shaping yourself. >>>> >>>> 2) If OpenType fonts will treat the data as a single script run, you >>>> will need to ensure that there is an OpenType substitution feature that >>>> the renderer will support. Fortunately, many modern text applications >>>> will allow you to force the ccmp feature to be enabled - I have used >>>> such feature forcing with OpenType in LibreOffice and also in HTML, >>>> which renders accordingly in all the modern browsers I have tested - MS >>>> Edge on Windows 10, Firefox and, on iPhones, Safari. While the ccmp >>>> feature is enabled for the PUA in Firefox, it is disabled in MS Edge on >>>> Windows 10. >>>> >>>> 3) I believe AAT will soon be available for products using the HarfBuzz >>>> layout engine, so it is likely to become available on Firefox and >>>> LibreOffice. If AAT looks like a solution, you may need to research >>>> the >>>> attitudes of Chrome and OpenOffice, for I believe they have chosen not >>>> to support Graphite. >>>> >>>> A totally different solution would be to recompile your application so >>>> that it believes that your diacritics are in the Arabic script. >>>> >>>> Richard. > From unicode at unicode.org Fri Jan 17 07:03:00 2020 From: unicode at unicode.org (Michel Mariani via Unicode) Date: Fri, 17 Jan 2020 14:03:00 +0100 Subject: Unihan variants information In-Reply-To: References: Message-ID: <293F26FC-5057-419B-93F1-6004EF69AB14@ouvaton.org> FYI, the "Unihan Variants" utility has been recently added to the open-source application Unicopedia Plus . It provides both the linear and structured informations planned about one year ago. I think that the graph view available in SVG format can be especially useful to spot possible inconsistencies between variant properties... HTH, --Michel MARIANI > I've developed an open-source, multi-platform desktop application called Unicode Plus , which is a set of utilities related to Unicode, Unihan and emoji. > > The basic Unihan-related utilities are almost completed, and now I would like to add more useful information about the Unihan variants: > > 1. First option: "Linear Information" > > - A linear list of all the variants *related* to one given Unihan character would be displayed, similar to what can be found in Apple's Character Viewer (or Palette), or in the "Unihan Variant Dictionary" application. > > - Two sources of data could be merged: > > 1. The information provided by the "Variants table for Unicode" data file UniVariants.txt by Prof. K?ichi Yasuoka. > > 2. The information extracted from the relevant Unihan DB tag properties: kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kTraditionalVariant, kZVariant. > > - Discarding self-variants, assuming that Z-variants are somehow symmetrical, and possibly merge the different types of variants tags would result into independant sets of *related* Unihan characters. Accessing the info would then simply imply testing which set a given character belongs to, and omit the character itself for display. > > - This kind of information is most certainly user-friendly, however it lacks structural information about the relationships between the different variants. > > 2. Second option: "Structured Information" > > - This is probably more ambitious and challenging: ideally, the information could be displayed graphically as a diagram of characters joined by arrowed links, indicating the type of variant. It would support one-to-one, one-to-many and many-to-one relationships... > > > Any ideas, comments, suggestions are most welcome... > > -- Michel MARIANI -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: unihan-variants-turtle-screenshot.png Type: image/png Size: 127856 bytes Desc: not available URL: From unicode at unicode.org Fri Jan 17 10:34:40 2020 From: unicode at unicode.org (jenkins via Unicode) Date: Fri, 17 Jan 2020 09:34:40 -0700 Subject: [unihan] Unihan variants information In-Reply-To: <293F26FC-5057-419B-93F1-6004EF69AB14@ouvaton.org> References: <293F26FC-5057-419B-93F1-6004EF69AB14@ouvaton.org> Message-ID: <94C577BB-880A-49E8-8076-72BBB5E8BCD5@apple.com> Very impressive! Thank you for this. > On Jan 17, 2020, at 6:03 AM, Michel Mariani via Unihan wrote: > > FYI, the "Unihan Variants" utility has been recently added to the open-source application Unicopedia Plus . > It provides both the linear and structured informations planned about one year ago. > I think that the graph view available in SVG format can be especially useful to spot possible inconsistencies between variant properties... > HTH, > > --Michel MARIANI > > > >> I've developed an open-source, multi-platform desktop application called Unicode Plus , which is a set of utilities related to Unicode, Unihan and emoji. >> >> The basic Unihan-related utilities are almost completed, and now I would like to add more useful information about the Unihan variants: >> >> 1. First option: "Linear Information" >> >> - A linear list of all the variants *related* to one given Unihan character would be displayed, similar to what can be found in Apple's Character Viewer (or Palette), or in the "Unihan Variant Dictionary" application. >> >> - Two sources of data could be merged: >> >> 1. The information provided by the "Variants table for Unicode" data file UniVariants.txt by Prof. K?ichi Yasuoka. >> >> 2. The information extracted from the relevant Unihan DB tag properties: kSemanticVariant, kSimplifiedVariant, kSpecializedSemanticVariant, kTraditionalVariant, kZVariant. >> >> - Discarding self-variants, assuming that Z-variants are somehow symmetrical, and possibly merge the different types of variants tags would result into independant sets of *related* Unihan characters. Accessing the info would then simply imply testing which set a given character belongs to, and omit the character itself for display. >> >> - This kind of information is most certainly user-friendly, however it lacks structural information about the relationships between the different variants. >> >> 2. Second option: "Structured Information" >> >> - This is probably more ambitious and challenging: ideally, the information could be displayed graphically as a diagram of characters joined by arrowed links, indicating the type of variant. It would support one-to-one, one-to-many and many-to-one relationships... >> >> >> Any ideas, comments, suggestions are most welcome... >> >> -- Michel MARIANI > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Jan 23 12:32:56 2020 From: unicode at unicode.org (Marius Spix via Unicode) Date: Thu, 23 Jan 2020 19:32:56 +0100 Subject: Stop words for CLDR Message-ID: <20200123193219.67a71018@spixxi> I wonder if there is any interest in adding stop words to CLDR? Stop words are ignored by natural language processing algorithms, with use cases like search engines, word clouds and text classification. There are already existing collections with stop words like [1] or [2] which could be used, but I believe that Unicode CLDR would be the best place for such lists. Regards, Marius Spix [1] https://pypi.org/project/stop-words/ [2] https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip From unicode at unicode.org Sat Jan 25 12:41:27 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 25 Jan 2020 18:41:27 +0000 Subject: Adding Experimental Control Characters for Tai Tham Message-ID: <20200125184127.0e8b5b43@JRWUBU2> This topic is very similar to the recent topic "How to make custom combining diacritical marks for arabic letters?". There is a suggestion that the encoding of Tai Tham syllables be changed (https://www.unicode.org/L2/L2019/19365-tai-tham-structure.pdf, by Martin Hosken), and there is a strong desire to experiment with it. However, unless it is to proscribe good rendering, it needs at least two extra 'control' characters, which have been suggested as: 1A8E TAI THAM SIGN INITIAL 1A8F TAI THAM SIGN FINAL These would follow a subscript character. In simple cases, they would indicate whether the subscript is part of the onset or part of the coda of a syllable. The idea that has been floated is that the experimentation be done by changing the renderer, which is invoked by various applications. However, there is the problem of script runs - these characters are not yet in the Tai Tham script, and most applications lack a mechanism for assigning PUA characters to a script. However, there is a set of inherited characters which in a Tai Tham context have not yet been assigned any meaning - the variation selectors. I have experimented with them, and at least in the older versions of the HarfBuzz renderer (near Version 1.2.7), they do not cause any problems with the implementation of the USE - no dotted characters arise, and they can interact in shaping as suggested by a font. How inappropriate would it be to usurp a pair of variation selectors for this purpose? For mnemonic purposes, I would suggest usurping FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL I can think of the follow relevant factors: (a) It is a maxim of English law that a person intends the reasonable foreseeable consequences of his actions. By allowing grapheme cluster boundaries between script changes, the UTC can hardly complain loudly about inherited characters being usurped. (b) Most subscript consonants are defined by SAKOT plus a base consonant, and therefore the suggested control characters have the nature of variation sequences. The effect of these characters is, though, mostly on how other characters are positioned relative to them, rather than directly on the subscript characters themselves. (c) There are 7 subscript consonants that are represented by single characters: U+1A55 TAI THAM CONSONANT SIGN MEDIAL RA This seems not to need marking for position relative to the nucleus. If it did, the marking up of logical order ??????? /huai/ 'brook' as semi-visual order would not be so simple, as SIGN FINAL should not apply to the leftmost character, MEDIAL RA. U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA This will have to be excluded from the experiment. It is very rare as a final consonant, and I suspect its exclusion will have no effect on the experiment. U+1A57 TAI THAM CONSONANT SIGN LA TANG LAI This appears to be restricted to a single word, so its exclusion should not matter at all. U+1A5B TAI THAM CONSONANT SIGN HIGH RATHA OR LOW PA Bizarrely, L2-19/365 treats this as a consonant modifier! As the USE does not require consonant modifiers to be applied to the base consonant, this ought to have no adverse effects. The combination frequently acts as a single consonant trespassing on the territory of HIGH RATHA, but my suggestion that the sequence be encoded as a precomposed character was rejected. As far as I can tell, U+1A5B is always part of the phonetic onset. As the only case where one might need these control characters would be an implausible contraction *?????? /rat t?a?/ logical order parallel to Lao contraction ????? /k?an wa?/ 'if' logical order undisambiguated semi-visual order , which for Lao is rendered differently to ????? /k?wa?k/ lo?ical order . Now, the disambiguated semi-visual order encoding for *?????? is . This is consistent with the USE if SIGN FINAL is a variation selector, but is a seemingly needless flaw in L2-19/365 Section 5.1.1. U+1A5C TAI THAM CONSONANT SIGN MA This character seems only to occur immediately following akshara-initial MA, so I think there are no issues. U+1A5D TAI THAM CONSONANT SIGN BA This sign is of very limited occurrence in Northern Thai. In Lao, it can occur as the subscript of a base consonant acting as a mater lectionis, but I cannot see any scope for needing to mark the role of the mark for proper rendering. U+1A5E TAI THAM CONSONANT SIGN SA As this is a non-spacing mark principally used as a coda consonant, it seems unlikely that we would need to mark the role at the experimental stage. (d) This scheme does not address the representation of the sequences and . The best ideas I have is the totally hacky sequences and . Richard. From unicode at unicode.org Wed Jan 29 17:31:14 2020 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Wed, 29 Jan 2020 15:31:14 -0800 Subject: Adding Experimental Control Characters for Tai Tham In-Reply-To: <20200125184127.0e8b5b43@JRWUBU2> References: <20200125184127.0e8b5b43@JRWUBU2> Message-ID: Richard, Given that those particular two variation selectors have already given very specific semantics for emoji sequences, and would now be expected to occur *only* in emoji sequences: https://www.unicode.org/reports/tr51/#def_text_presentation_selector usurping them to do something unrelated would probably not be a good idea. For experimentation purposes, VS13 and VS14 would be safer. --Ken On 1/25/2020 10:41 AM, Richard Wordingham via Unicode wrote: > How inappropriate would it be to usurp a pair of variation selectors > for this purpose? For mnemonic purposes, I would suggest usurping > > FE0E VARIATION SELECTOR-15 for *1A8E TAI THAM SIGN INITIAL > FE0F VARIATION SELECTOR-16 for *1A8F TAI THAM SIGN FINAL