From chris.fynn at gmail.com Sun Mar 2 02:45:04 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Sun, 2 Mar 2014 14:45:04 +0600 Subject: Websites in Hindi In-Reply-To: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: I don't know about that particular Serif software which may have limitations, but if a site is using Unicode UTF-8, there should be no problem creating a website in Hindi e.g. http://www.bbc.co.uk/hindi/ https://hi.wikipedia.org/ http://tehelkahindi.com/ http://www.webdunia.com/ From billposer2 at gmail.com Sun Mar 2 13:39:51 2014 From: billposer2 at gmail.com (Bill Poser) Date: Sun, 2 Mar 2014 11:39:51 -0800 Subject: Websites in Hindi In-Reply-To: References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: In my experience the problem with Hindi web sites is that many of them used encodings other than unique, frequently encodings designed for a particular font. Some fonts did not use anything like a normal encoding. We encountered a newspaper that used a font with 8,000-some glyphs each representing a graphical piece of a Devanagari character or character cluster. I don't know to what extent the use of of such parochial fonts and encodings persists. The sites I have seen using Unicode look fine. On Sun, Mar 2, 2014 at 12:45 AM, Christopher Fynn wrote: > I don't know about that particular Serif software which may have > limitations, but if a site is using Unicode UTF-8, there should be no > problem creating a website in Hindi > > e.g. > http://www.bbc.co.uk/hindi/ > https://hi.wikipedia.org/ > http://tehelkahindi.com/ > http://www.webdunia.com/ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From James_Lin at symantec.com Mon Mar 3 12:14:22 2014 From: James_Lin at symantec.com (James Lin) Date: Mon, 3 Mar 2014 10:14:22 -0800 Subject: Websites in Hindi In-Reply-To: References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: another problem you may need to consider is the support of the glyph/fonts on your system. Not all fonts are supported/install by default when installing the OS. Warm Regards, -James On 3/2/14, 12:45 AM, "Christopher Fynn" wrote: >I don't know about that particular Serif software which may have >limitations, but if a site is using Unicode UTF-8, there should be no >problem creating a website in Hindi > >e.g. >http://www.bbc.co.uk/hindi/ >https://hi.wikipedia.org/ >http://tehelkahindi.com/ >http://www.webdunia.com/ >_______________________________________________ >Unicode mailing list >Unicode at unicode.org >http://unicode.org/mailman/listinfo/unicode From neil at tonal.clara.co.uk Mon Mar 3 14:21:42 2014 From: neil at tonal.clara.co.uk (Neil Harris) Date: Mon, 03 Mar 2014 20:21:42 +0000 Subject: Websites in Hindi In-Reply-To: References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> Message-ID: <5314E456.2050509@tonal.clara.co.uk> On 03/03/14 18:14, James Lin wrote: > another problem you may need to consider is the support of the glyph/fonts > on your system. Not all fonts are supported/install by default when > installing the OS. > > Warm Regards, > -James > > This is where webfonts should be extremely useful -- I believe recent versions of at least Firefox, and probably other modern browsers, should support both webfonts and text shaping for Indic scripts by default, whether or not the underlying platform has the correct fonts. Neil From petercon at microsoft.com Mon Mar 3 16:36:32 2014 From: petercon at microsoft.com (Peter Constable) Date: Mon, 3 Mar 2014 22:36:32 +0000 Subject: Websites in Hindi In-Reply-To: <5314E456.2050509@tonal.clara.co.uk> References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com> <5314E456.2050509@tonal.clara.co.uk> Message-ID: Looking at the thread that William pointed at, the person asking for help gave no indication as to what problems he might have been encountering. Without specifics, the two obvious recommendations would be (i) encode the content using conformant UTF-8, and (ii) use conforming OpenType fonts leveraging CSS web font mechanisms. Beyond that, that thread seemed not especially interesting. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Neil Harris Sent: March 3, 2014 12:22 PM To: James Lin; Christopher Fynn; William_J_G Overington Cc: unicode at unicode.org Subject: Re: Websites in Hindi On 03/03/14 18:14, James Lin wrote: > another problem you may need to consider is the support of the > glyph/fonts on your system. Not all fonts are supported/install by > default when installing the OS. > > Warm Regards, > -James > > This is where webfonts should be extremely useful -- I believe recent versions of at least Firefox, and probably other modern browsers, should support both webfonts and text shaping for Indic scripts by default, whether or not the underlying platform has the correct fonts. Neil _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From mgunn at egt.ie Wed Mar 5 06:10:51 2014 From: mgunn at egt.ie (Marion Gunn) Date: Wed, 05 Mar 2014 12:10:51 +0000 Subject: ?MP = Multi*lingual* plane? In-Reply-To: <530F5A33.1010005@ix.netcom.com> References: <530F5A33.1010005@ix.netcom.com> Message-ID: <5317144B.1070201@egt.ie> Twice as lovely when an immediately comprehensible term is used consistently (for example, our old faithful "multilingual"), be that term precise or no, rather than coin new terms without due reason, which could take years to become current amongst end users and yet more years to understand. mg Scr?obh 27/02/2014 15:30, Asmus Freytag: > On 2/27/2014 2:32 AM, Shriramana Sharma wrote: >> Given that Unicode encodes scripts and not languages, how appropriate >> is it to call the BMP and the SMP as the multi*lingual* planes? >> > Isn't it lovely how these things work? > > A./ > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Marion Gunn * eGteo (Estab.1991) 27 P?irc an Fh?ithlinn, Baile an Bh?thair, An Charraig Dhubh, Co. ?tha Cliath, ?ire/Ireland. * mgunn at egt.ie * eamonn at egt.ie * From mail at robbertbroersma.nl Thu Mar 6 14:54:18 2014 From: mail at robbertbroersma.nl (Robbert) Date: Thu, 6 Mar 2014 21:54:18 +0100 Subject: HTTPS Message-ID: Hi, For tools that rely on the Unicode database it would be great if the databases were available over HTTPS as well: https://www.unicode.org/Public/6.3.0/ In addition to this it would be helpful if the archive also contains SHA512 checksum files for each Unicode version to verify the integrity of databases that have already been downloaded (over HTTP), e.g.: https://www.unicode.org/Public/6.3.0/SHA512SUMS Mozilla already offers such checksums, although unfortunately not over HTTPS, but they can serve as an example. http://releases.mozilla.org/pub/mozilla.org/firefox/releases/27.0/SHA512SUMS I think this would improve the security of many libraries that directly and indirectly depend on Unicode. Kind regards, Robbert Broersma From adam at nohejl.name Sun Mar 9 07:39:20 2014 From: adam at nohejl.name (Adam Nohejl) Date: Sun, 9 Mar 2014 13:39:20 +0100 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi In-Reply-To: References: Message-ID: Hello again, I would be really grateful for any reply or at least pointers to relevant information about this topic (stroke-order data in Unihan, see my previous message below). Or is there any other appropriate place to discuss this? Thank you, -- Adam On 2014/02/28, at 19:56, Adam Nohejl wrote: > > Hello, > > I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary. > > Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields? > > Examples of the discrepancies: > > (1) A very common character for "most, maximum". > U+6700 kRSKangXi 73.8 > U+6700 kRSUnicode 13.10 > > (2) A funny character for autumn containing the turtle component. > U+9F9D kRSKangXi 115.16 > U+9F9D kRSKanWa 115.16 > U+9F9D kRSUnicode 213.5 > > There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical. > > (3) The simplified turtle character (commonly assigned to the traditional radical #213): > U+4E80 kRSKangXi 213.0 > U+4E80 kRSUnicode 5.10 > > (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ: > U+66FB kRSKangXi 72.7 > U+66FB kRSUnicode 73.7 > > - - - > > [*] : "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard." > > [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: From leoboiko at namakajiri.net Sun Mar 9 08:49:37 2014 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Sun, 9 Mar 2014 10:49:37 -0300 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi In-Reply-To: References: Message-ID: I don't know about the points you raise, but I wish it was easier to help proofread Unihan data. Back in 2012 I compared kKangXi to kIRGKangXI and found 252 conflicts, besides the cases where a character only has one or the other. I even put together a simple tool to help fixing this, with links to the relevant pages at the online Kang Xi[1]. I had no replies? [1] http://namakajiri.net/misc/unihan_kangxi/compare_existing.html for characters in Kang Xi, and for the others, http://namakajiri.net/misc/unihan_kangxi/compare_nonexisting.html 2014-03-09 9:39 GMT-03:00 Adam Nohejl : > Hello again, > > I would be really grateful for any reply or at least pointers to relevant > information about this topic (stroke-order data in Unihan, see my previous > message below). > > Or is there any other appropriate place to discuss this? > > Thank you, > > -- > Adam > > On 2014/02/28, at 19:56, Adam Nohejl wrote: > > > > Hello, > > > > I am comparing radical data for CJK characters from different sources, > including the Unihan database. According to the Unihan documentation* the > kRSUnicode radical should correspond to kRSKangXi radical, which in turn > should be based on the Kang Xi dictionary. > > > > Is there any explanation for the following discrepancies? Did I miss any > other rules or reasoning behind the content of these two fields? > > > > Examples of the discrepancies: > > > > (1) A very common character for "most, maximum". > > U+6700 kRSKangXi 73.8 > > U+6700 kRSUnicode 13.10 > > > > (2) A funny character for autumn containing the turtle component. > > U+9F9D kRSKangXi 115.16 > > U+9F9D kRSKanWa 115.16 > > U+9F9D kRSUnicode 213.5 > > > > There are also characters that actually are not included in the Kang Xi > dictionary**, but the Unihan data contain both a purported Kang Xi radical > and in addition to that a _different_ Unicode radical. > > > > (3) The simplified turtle character (commonly assigned to the > traditional radical #213): > > U+4E80 kRSKangXi 213.0 > > U+4E80 kRSUnicode 5.10 > > > > (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary > decision, but unexpectedly the fields differ: > > U+66FB kRSKangXi 72.7 > > U+66FB kRSUnicode 73.7 > > > > - - - > > > > [*] : "Property: > kRSUnicode // Description: (...) The first value is intended to reflect the > same radical as the kRSKangXi field and the stroke count of the glyph used > to print the character within the Unicode Standard." > > > > [**] The two characters are missing from the '89 edition of Kang Xi > (which should be the same as used for Unihan) according to search on this > site: > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rscook at unicode.org Mon Mar 10 10:39:19 2014 From: rscook at unicode.org (Richard COOK) Date: Mon, 10 Mar 2014 08:39:19 -0700 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi In-Reply-To: References: Message-ID: Mr. Nohejl, About the property data you mention below. kRSUnicode property data permits multiple/variant (space-delimited) radical/stroke values, and I think we will see important variants added in the future. Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it, and perhaps to give it priority (move it to the front of the list). Likewise, if a common variant value is missing (even one not associated with Kangxi), it might be added for convenience. And if there are any outright errors, of course those should be identified and corrected (but clear errors are harder to find these days). Note that because kRSUnicode covers *all* Unihan CJK, even those characters not present in the original Kangxi, some of the radical/stroke values are so-called "virtual" assignments (those should be omitted from consideration, in proofing original KX data). Several years ago we (at Wenlin.com) produced consolidated Kangxi data for our Zidian (Wenlin 4.X), taking these four properties (among other data) as input: The last of these may not have any obvious connection with Kangxi, until one reads the kIRG_GSource property description and sees this "sub-property" description: "GKX Kangxi Dictionary ideographs (????) 9th edition (1958) including the addendum (????)??" PRC researchers have done much work proofing G-Source Kangxi data, to address many aspects of the complex original text. The Kangxi work we did at Wenlin has several dimensions, and some of this has not yet rippled back into UCD. We have in fact already identified many important omissions from kRSUnicode, which we plan to propose for a future data release. Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention. -Richard PS: About the subject line of your message. Please note that despite the "CJK stroke order" subject line in your message, we are not talking about CJK stroke order here at all, but about Kangxi and UCS radical assignment, and residual stroke *count* data. Such data can indeed be used to "order" (collate) CJK data, but "stroke order" is a separate issue, involving the particular sequence of CJK Strokes (see The Unicode Standard, Appendix F) in the writing of a given character (stroke-order data can also be used for collation and indexing). Wenlin's CDL database (which inspired the CJK Stroke block, and also produced Appendix F) contains a comprehensive analysis of CJK Stroke order *and* Radical/Stroke data for all UCS CJK, primarily focused on PRC norms, but also including a great many variants (variants forms, variant stroke counts, and variant radical assignments). On Feb 28, 2014, at 10:56 AM, Adam Nohejl wrote: > > (1) A very common character for "most, maximum". > ?[U+6700] kRSKangXi 73.8 > ?[U+6700] kRSUnicode 13.10 > > (2) A funny character for autumn containing the turtle component. > ?[U+9F9D] kRSKangXi 115.16 > ?[U+9F9D] kRSKanWa 115.16 > ?[U+9F9D] kRSUnicode 213.5 > > There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical. > > (3) The simplified turtle character (commonly assigned to the traditional radical #213): > ?[U+4E80] kRSKangXi 213.0 > ?[U+4E80] kRSUnicode 5.10 > > (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ: > ?[U+66FB] kRSKangXi 72.7 > ?[U+66FB] kRSUnicode 73.7 > Hello, > > I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary. > > Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields? > > Examples of the discrepancies: > > (1) A very common character for "most, maximum". > U+6700 kRSKangXi 73.8 > U+6700 kRSUnicode 13.10 > > (2) A funny character for autumn containing the turtle component. > U+9F9D kRSKangXi 115.16 > U+9F9D kRSKanWa 115.16 > U+9F9D kRSUnicode 213.5 > > There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical. > > (3) The simplified turtle character (commonly assigned to the traditional radical #213): > U+4E80 kRSKangXi 213.0 > U+4E80 kRSUnicode 5.10 > > (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ: > U+66FB kRSKangXi 72.7 > U+66FB kRSUnicode 73.7 > > - - - > > [*] : "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard." > > [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: > > > -- > Adam Nohejl > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From doppelbauer at gmx.net Mon Mar 10 13:34:48 2014 From: doppelbauer at gmx.net (Markus Doppelbauer) Date: Mon, 10 Mar 2014 19:34:48 +0100 Subject: Normalization test Message-ID: An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 10 14:28:57 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 10 Mar 2014 20:28:57 +0100 Subject: Normalization test In-Reply-To: References: Message-ID: toNFC(0061 0305 0315 0300 05AE 0062) -> >From DerivedCombiningClass.txt: 05D0..05EA ; 0 # Lo [27] HEBREW LETTER ALEF..HEBREW LETTER TAV In other words, 05EA with combining class 0 is blocking the composition and any reordering between (0061 0305 0315 0300) on one side, and (0062) on the other side (which is also combining class 0). So you will effectively get the composition of 0061 and 0305 (because it is also no specifically excluded from composition in CompositionExclusions.txt ) in: toNFC(0061 0305 0315 0300 05AE 0062), but NOT in: toNFC(0061 05AE 0305 0315 0300 0062). I think you have mixed the two separate test cases. The first thing to check is to break sequences before every character with combining class 0 (even if it is "combining", like here the Hebrew accent zinor). 2014-03-10 19:34 GMT+01:00 Markus Doppelbauer : > Hello, > > I am working on an Unicode Normalization implemenation. I have a question > about a specific toNFC test rule. > > toNFC(0061 0305 0315 0300 05AE 0062) => > (0061 05AE 0305 0300 0315 0062) > expected: > (0061 05AE 0305 0300 0315 0062) > \-------------/ => > (00E0 05AE 0305 0315 0062) > > Why doesn't 0061 and 0300 combine to 00E0 ? > > Thanks a lot > Markus > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 10 14:32:00 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 10 Mar 2014 20:32:00 +0100 Subject: Normalization test In-Reply-To: References: Message-ID: Sorry, I took the wrong line (because I typed 05EA instead of 05AE) 05AE ; 228 # Mn HEBREW ACCENT ZINOR You're right, the combining class 228 does not block the composition. 2014-03-10 20:28 GMT+01:00 Philippe Verdy : > toNFC(0061 0305 0315 0300 05AE 0062) -> > > From DerivedCombiningClass.txt: > > 05D0..05EA ; 0 # Lo [27] HEBREW LETTER ALEF..HEBREW LETTER TAV > > In other words, 05EA with combining class 0 is blocking the composition and any reordering between > > (0061 0305 0315 0300) on one side, and > > (0062) on the other side (which is also combining class 0). > > So you will effectively get the composition of 0061 and 0305 (because it is also no specifically excluded from composition in CompositionExclusions.txt ) in: > > toNFC(0061 0305 0315 0300 05AE 0062), > > but NOT in: > > toNFC(0061 05AE 0305 0315 0300 0062). > > I think you have mixed the two separate test cases. > > > The first thing to check is to break sequences before every character with > combining class 0 (even if it is "combining", like here the Hebrew accent > zinor). > > 2014-03-10 19:34 GMT+01:00 Markus Doppelbauer : > >> Hello, >> >> I am working on an Unicode Normalization implemenation. I have a question >> about a specific toNFC test rule. >> >> toNFC(0061 0305 0315 0300 05AE 0062) => >> (0061 05AE 0305 0300 0315 0062) >> expected: >> (0061 05AE 0305 0300 0315 0062) >> \-------------/ => >> (00E0 05AE 0305 0315 0062) >> >> Why doesn't 0061 and 0300 combine to 00E0 ? >> >> Thanks a lot >> Markus >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Mar 10 14:36:13 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 10 Mar 2014 12:36:13 -0700 Subject: Normalization test In-Reply-To: References: Message-ID: The U+0300 ( ? ) COMBINING GRAVE ACCENT is blocked by the U+0305 ( ? ) COMBINING OVERLINE which has the same ccc=230. Could you use an existing library rather than roll your own? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From rscook at unicode.org Mon Mar 10 18:44:50 2014 From: rscook at unicode.org (Richard COOK) Date: Mon, 10 Mar 2014 16:44:50 -0700 Subject: ?MP = Multi*lingual* plane? In-Reply-To: References: Message-ID: On Feb 27, 2014, at 7:23 AM, Michael Everson wrote: > On 27 Feb 2014, at 02:32, Shriramana Sharma wrote: > >> Given that Unicode encodes scripts and not languages, how appropriate is it to call the BMP and the SMP as the multi*lingual* planes? > > You are more than two decades late in asking this. > > It may have seemed more appropriate in an 8-bit code page world where rather small subsets limited the number of languages accessible by one or another part of ISO/IEC 8859. > > A new term like ?multiscriptal? would not have been appropriate. File this under ?We know the term ?ideograph? is a misnomer." 'When I use a word,' Humpty Dumpty said, in rather a scornful tone, 'it means just what I choose it to mean ? neither more nor less.' 'The question is,' said Alice, 'whether you can make words mean so many different things.' 'The question is,' said Humpty Dumpty, 'which is to be master ? that's all.' Alice was too much puzzled to say anything; so after a minute Humpty Dumpty began again. > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From doppelbauer at gmx.net Tue Mar 11 10:50:35 2014 From: doppelbauer at gmx.net (Markus Doppelbauer) Date: Tue, 11 Mar 2014 16:50:35 +0100 Subject: NFD -> NFC Message-ID: An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Mar 11 11:19:06 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=) Date: Tue, 11 Mar 2014 17:19:06 +0100 Subject: NFD -> NFC In-Reply-To: References: Message-ID: Not sure about your exact case, but ICU's normalization does handle those characters. http://unicode.org/cldr/utility/transform.jsp?a=nfc%3Bhex&b=%5Cu30B9%5Cu3099 (That tool uses ICU for NFC). Mark *? Il meglio ? l?inimico del bene ?* On Tue, Mar 11, 2014 at 4:50 PM, Markus Doppelbauer wrote: > Hello, > > I have an other problem making the normalization process binary > compatible with ICU. > Why does "30B9 3099" not combine to "30BA"? > > Steps to reproduce: > wget http://doppelbauer.name/katakana.txt > uconv -f utf8 -t utf8 -x nfd ndf.txt > uconv -f utf8 -t utf8 -x nfc nfc.txt > diff katakana.txt nfc.txt > > Expected result: "katakana.txt" == "nfc.txt" > > uconv v2.1 ICU 4.8.1.1 > > Thanks a lot > Markus > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From starback at stp.lingfil.uu.se Tue Mar 11 11:35:47 2014 From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=) Date: Tue, 11 Mar 2014 17:35:47 +0100 Subject: "(in 6429)" in allkeys.txt Message-ID: In the DUCET file allkeys.txt, http://www.unicode.org/Public/UCA/latest/allkeys.txt , there is "(in 6429)" as a comment for some characters. I first didn't understand why, but then I realized those are control characters that are part of ISO/EIC 6429. Why is that pointed out explicitly in that context? The reason I'm asking is that I was looking at the proposed new version of this file, and was thinking about suggesting a short note in the comments in the beginning of the file. From markus.icu at gmail.com Tue Mar 11 12:33:09 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 11 Mar 2014 10:33:09 -0700 Subject: NFD -> NFC In-Reply-To: References: Message-ID: Here is the demo using ICU4C: http://demo.icu-project.org/icu-bin/nbrowser?t=%5Cu30B9%5Cu3099&s=&uv=0 markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.whistler at sap.com Tue Mar 11 13:57:23 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 11 Mar 2014 18:57:23 +0000 Subject: "(in 6429)" in allkeys.txt In-Reply-To: References: Message-ID: Per asked: > In the DUCET file allkeys.txt, > http://www.unicode.org/Public/UCA/latest/allkeys.txt , > there is "(in 6429)" as a comment for some characters. > I first didn't understand why, but then I realized those are control > characters that are part of ISO/EIC 6429. > > Why is that pointed out explicitly in that context? 1. To make it clear that those are not actually Unicode character names, but names for control functions associated with ISO 6429. (Note that this practices dates back a long time now in the DUCET data files -- it predates the addition of the ISO 6429 control function names as formal name aliases in NameAliases.txt in the UCD.) 2. Because the same "names" which appear in the comments in allkeys.txt for UCA also appear in comments for the CTT in ISO 14651 (which is generated with the same tool). And the "(in 6429)" notes were added there to forestall people asking questions about these "names" that aren't "names". 3. And the reason they *continue* to appear in the comments in both allkeys.txt and in the CTT for ISO 14651 is to preclude people asking questions about why they would be removed. ;-) > > The reason I'm asking is that I was looking at the proposed new version > of this file, and was thinking about suggesting a short note in the > comments in the beginning of the file. My personal preference, rather than larding up the header of a machine-generated file with more commentary, would be a suggestion for further clarification in the text of UTS #10, if necessary. After all, the allkeys.txt header already points to UTS #10 for more information -- which anyone needs to understand and use the data file, anyway. --Ken From starback at stp.lingfil.uu.se Tue Mar 11 17:01:18 2014 From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=) Date: Tue, 11 Mar 2014 23:01:18 +0100 Subject: "(in 6429)" in allkeys.txt In-Reply-To: (Ken Whistler's message of "Tue, 11 Mar 2014 18:57:23 +0000") References: Message-ID: Ken Whistler answered my questions: >> In the DUCET file allkeys.txt, >> http://www.unicode.org/Public/UCA/latest/allkeys.txt , >> there is "(in 6429)" as a comment for some characters. >> I first didn't understand why, but then I realized those are control >> characters that are part of ISO/EIC 6429. >> >> Why is that pointed out explicitly in that context? Thanks for your answers! I feel enlightened. >> The reason I'm asking is that I was looking at the proposed new version >> of this file, and was thinking about suggesting a short note in the >> comments in the beginning of the file. > > My personal preference, rather than larding up the header > of a machine-generated file with more commentary, would be > a suggestion for further clarification in the text of UTS #10, if > necessary. After all, the allkeys.txt header already points to > UTS #10 for more information -- which anyone needs to understand > and use the data file, anyway. I agree that a clarification in the text would be better than a comment in allkeys.txt. But I also think just changing "(in 6429)" to "(in ISO 6429)" would be enough. (Strange as it might seem for list regulars not everyone immediately makes the right association from this four-digit number. :-) I think that would be a improvement, but I admit it's a rather small one, and it can be hard to bother to fix small things unless it's something you do when your fixing something nearby anyway. This is somewhat besides the point, but since you say the file is machine-generated I wonder about something I found in the draft version http://www.unicode.org/Public/UCA/7.0.0/allkeys-7.0.0d5.txt where a comment says # Tertiary weight range: 0002..001F (30) even though the highest used tertiary weight actually is 001E. Isn't this comment automatically made? From ken.whistler at sap.com Tue Mar 11 17:34:20 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 11 Mar 2014 22:34:20 +0000 Subject: "(in 6429)" in allkeys.txt In-Reply-To: References: Message-ID: > I agree that a clarification in the text would be better than > a comment in allkeys.txt. But I also think just changing "(in 6429)" > to "(in ISO 6429)" would be enough. > > (Strange as it might seem for list regulars not everyone immediately > makes the right association from this four-digit number. :-) Ah, I see what the interpretation problem was. Yes, that is a straightforward kind of improvement -- easily enough done. Look for a change the next time the file is updated. (It will not be immediately changed, pending other review comments.) > This is somewhat besides the point, but since you say the file is > machine-generated I wonder about something I found in the draft version > http://www.unicode.org/Public/UCA/7.0.0/allkeys-7.0.0d5.txt > where a comment says > > # Tertiary weight range: 0002..001F (30) > > even though the highest used tertiary weight actually is 001E. > Isn't this comment automatically made? The ranges for primary and secondary weights change with every new repertoire addition to the input, so they are always calculated dynamically. By contrast, the tertiary weight range is hard-coded in the generation, and never changes. If you look at: http://www.unicode.org/reports/tr10/#Tertiary_Weight_Table you can see all those pre-defined, fixed values. It is true that 0x001F is not actually assigned as a tertiary weight for any particular character, but it is internally set aside as a MAX_TERTIARY sentinel value, before the first secondary weight of 0x0020. Note that the tertiary weight 0x0007 is not actually used in the weighting, either (for historical reasons). At any rate, the entire range 0x0002..0x001F is considered fixed and "used" for tertiaries, so that is what is always displayed in the summary printed at the top of allkeys.txt. --Ken From adam at nohejl.name Wed Mar 12 04:59:35 2014 From: adam at nohejl.name (Adam Nohejl) Date: Wed, 12 Mar 2014 10:59:35 +0100 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi In-Reply-To: References: Message-ID: Mr. Cook, Thank you for all the information (and bringin Wenlin to my attention as well). > Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it, Great to hear that. > Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention. OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow? > PS: About the subject line of your message. Yes, of course, the subject should have read "CJK radical-stroke count data". Not one of my brightest moments, I guess... -- Adam Nohejl From starback at stp.lingfil.uu.se Wed Mar 12 07:32:15 2014 From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=) Date: Wed, 12 Mar 2014 13:32:15 +0100 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: (Ken Whistler's message of "Tue, 11 Mar 2014 22:34:20 +0000") References: Message-ID: Ken Whistler wrote: > Ah, I see what the interpretation problem was. Yes, that is > a straightforward kind of improvement -- easily enough done. > Look for a change the next time the file is updated. (It will not > be immediately changed, pending other review comments.) Thanks! Then I'll skip making a formal request about this. Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? For many uses of names, the control characters have as much need for them as any other character. Since it seems so straightforward it must have been suggested several times to introduce names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc., so I assume there are good reasons for not doing that, but I can't see what they are. Since applications want names they will use other things as names when there isn't a real name, and that leads to problems. Take Emacs where the command describe-char currently describes U+0007 as name: old-name: BELL (I reported the misusage of "" here as a name in 2009, but it wasn't fixed until this year, so still not in a released version.) The usage of "BELL" here invites confusion with U+1F514 BELL. Emacs should do better regarding this, but still, with a proper name all of this would have been averted. From mark at macchiato.com Wed Mar 12 08:11:22 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=) Date: Wed, 12 Mar 2014 14:11:22 +0100 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: References: Message-ID: They do have aliases in NameAliases.txt 0000;NULL;control 0000;NUL;abbreviation 0001;START OF HEADING;control 0001;SOH;abbreviation 0002;START OF TEXT;control 0002;STX;abbreviation ... Mark *? Il meglio ? l?inimico del bene ?* On Wed, Mar 12, 2014 at 1:32 PM, Per Starb?ck wrote: > Ken Whistler wrote: > > Ah, I see what the interpretation problem was. Yes, that is > > a straightforward kind of improvement -- easily enough done. > > Look for a change the next time the file is updated. (It will not > > be immediately changed, pending other review comments.) > > Thanks! Then I'll skip making a formal request about this. > > Regarding these names in ISO 6429 again, how come these control > characters don't have Unicode names? For many uses of names, the control > characters have as much need for them as any other character. > Since it seems so straightforward it must have been suggested several > times to introduce names like > > CONTROL CHARACTER NULL > CONTROL CHARACTER START OF HEADING > CONTROL CHARACTER START OF TEXT > > etc., so I assume there are good reasons for not doing that, but I can't > see what they are. > > Since applications want names they will use other things as names when > there isn't a real name, and that leads to problems. Take Emacs where > the command describe-char currently describes U+0007 as > > name: > old-name: BELL > > (I reported the misusage of "" here as a name in 2009, but it > wasn't fixed until this year, so still not in a released version.) > The usage of "BELL" here invites confusion with U+1F514 BELL. > > Emacs should do better regarding this, but still, with a proper name > all of this would have been averted. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Wed Mar 12 11:26:06 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 12 Mar 2014 18:26:06 +0200 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: References: Message-ID: <83bnxbo02p.fsf@gnu.org> > From: starback at stp.lingfil.uu.se (Per Starb?ck) > Date: Wed, 12 Mar 2014 13:32:15 +0100 > Cc: "unicode at unicode.org" > > Regarding these names in ISO 6429 again, how come these control > characters don't have Unicode names? They have a non-empty "old name" field: 0000;;Cc;0;BN;;;;;N;NULL;;;; ^^^^ > Emacs should do better regarding this As you yourself say, it already does, so I don't see the point in this rant. From ken.whistler at sap.com Wed Mar 12 11:48:25 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 12 Mar 2014 16:48:25 +0000 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: <83bnxbo02p.fsf@gnu.org> References: <83bnxbo02p.fsf@gnu.org> Message-ID: Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has "having a Unicode name". See: http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 for the gory details. The "Unicode name" is formally defined in terms of the Name property, which itself is a combination of enumerated values extracted from UnicodeData.txt, plus a number of rules. For all characters whose General_Category=Cc, the formal definition of the Name property is a null string. The string "" is *never* to be interpreted as a "Unicode name". It is a field placeholder with legacy status. See "Interpretation of Field 1 of UnicodeData.txt" in the section I cited above. As far as user interfaces and other applications needing "names" for Unicode control characters -- one of the reasons that the namespace for Unicode characters includes all of the formal name aliases provided in NameAliases.txt is so that applications can safely treat any formal name alias for a control character (or the other abbreviations, etc., also listed in NameAliases.txt) *as if* they were Unicode names, without running into name collisions with the actual Name property value for Unicode characters. The history of the name collision for the (relatively) recently encoded U+1F514 BELL with the traditional usage for the U+0007 control function "BELL" led the UTC to extend the namespace as noted, so we won't be running into more such problems in the future. If Emacs were to use "ALERT" or the abbreviation "BEL" for U+0007, instead of "", that would avoid the collision with U+1F514 BELL, be conformant to the Unicode Standard, and presumably be helpful to users, as well. See the entries for U+0007 in NameAliases.txt: # Note that no formal name alias for the ISO 6429 "BELL" is # provided for U+0007, because of the existing name collision # with U+1F514 BELL. 0007;ALERT;control 0007;BEL;abbreviation --Ken > > Regarding these names in ISO 6429 again, how come these control > > characters don't have Unicode names? > > They have a non-empty "old name" field: > > 0000;;Cc;0;BN;;;;;N;NULL;;;; > ^^^^ From eliz at gnu.org Wed Mar 12 12:17:27 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 12 Mar 2014 19:17:27 +0200 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> Message-ID: <83a9cvnxp4.fsf@gnu.org> > From: "Whistler, Ken" > Date: Wed, 12 Mar 2014 16:48:25 +0000 > Cc: "Whistler, Ken" , > "unicode at unicode.org" > > Please be very careful here. Having a non-empty value in field 1 of > UnicodeData.txt is *not* the same has "having a Unicode name". You will see that I didn't refer to the Name attribute, I referred to the old name attribute (called Unicode_1_Name in UAX#44). From eliz at gnu.org Wed Mar 12 12:45:07 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 12 Mar 2014 19:45:07 +0200 Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt) In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> <83a9cvnxp4.fsf@gnu.org> Message-ID: <838usfnwf0.fsf@gnu.org> > Date: Wed, 12 Mar 2014 17:36:37 +0000 > From: Selvaraju Anbu Kaveeswarar > Cc: "Whistler, Ken" , starback at stp.lingfil.uu.se, unicode at unicode.org > > Unicode 1 names are deprecated and the new names are in their place. Obviously, the new names are useless when they are null. Emacs lets users specify a character by its name, so it uses aliases for characters whose Name property is a null string. From mark at macchiato.com Wed Mar 12 14:03:06 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=) Date: Wed, 12 Mar 2014 20:03:06 +0100 Subject: Beta CLDR Spec for v25 (LDML) Message-ID: There is a beta version of the CLDR specification for version 25, with the changes listed at: http://www.unicode.org/reports/tr35/proposed.html#Modifications If you have any feedback on the new sections, please submit it at http://unicode.org/cldr/trac/newticket. If you do, please include a link to the specific section you're commenting on. This is easy to do, since clicking on any header puts a link to that header into your browser's address bar. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From starback at stp.lingfil.uu.se Wed Mar 12 14:37:57 2014 From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=) Date: Wed, 12 Mar 2014 20:37:57 +0100 Subject: Names for control characters In-Reply-To: (Ken Whistler's message of "Wed, 12 Mar 2014 16:48:25 +0000") References: <83bnxbo02p.fsf@gnu.org> Message-ID: Ken Whistler wrote: > Please be very careful here. Having a non-empty value in field 1 of > UnicodeData.txt is *not* the same has "having a Unicode name". > > See: > > http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 I know it's not a name. My question was *why* control characters don't *have* names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc. It would be so obvious to have it like that, so I assume there is some specific reason not to, but I still can't figure it out. For me there is not less reason for these characters to have names than any others, so for me it's like Linear B characters didn't have names, and I got the answer "no problem, they have aliases, so that's OK!" This is just strange to me. If names aren't needed, why do almost all characters have them? This is not about Emacs. Emacs was an example of a program that has use for character names, and has a harder job because of this strangeness. Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against Emacs when I mention that this property of Unicode has led to longstanding (small) bugs there, but I think real examples are better than made-up ones. > If Emacs were to use "ALERT" or the abbreviation "BEL" for U+0007, ... Yes, programs could have their own lists of preferred aliases to use, or have a rule such as always use the first alias, but why? Why not have a name, so programs don't have to choose which alias to use? (I may be coming of as having a mission about this; "it should be done like this!!", but mostly this is just a question: "it seems obvious it should be done like this, so what am i missing?") From eliz at gnu.org Wed Mar 12 15:04:45 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 12 Mar 2014 22:04:45 +0200 Subject: Names for control characters In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> Message-ID: <837g7znpya.fsf@gnu.org> > From: starback at stp.lingfil.uu.se (Per Starb?ck) > Cc: Eli Zaretskii , "unicode\@unicode.org" > Date: Wed, 12 Mar 2014 20:37:57 +0100 > > This is not about Emacs. Emacs was an example of a program that has use > for character names, and has a harder job because of this strangeness. > Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against > Emacs when I mention that this property of Unicode has led to > longstanding (small) bugs there, but I think real examples are better > than made-up ones. What I saw as a rant is only this part (which was the only one I quoted): > > Emacs should do better regarding this "Should do better" means it still doesn't, although it's expected to. From markus.icu at gmail.com Wed Mar 12 15:11:13 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 12 Mar 2014 13:11:13 -0700 Subject: Names for control characters In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> Message-ID: On Wed, Mar 12, 2014 at 12:37 PM, Per Starb?ck wrote: > My question was *why* control characters don't > *have* names > That's because formally the ISO control codes do not have one fixed, normative meaning; implementers may or may not follow ISO 6429. That is why these don't have names in ISO 10646 and in Unicode. http://www.unicode.org/faq/casemap_charprop.html#15 Of course, a few control codes (e.g., U+000A) are very widely used, and have Unicode properties according to that use. (e.g., White_Space) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From rscook at unicode.org Wed Mar 12 15:56:07 2014 From: rscook at unicode.org (Richard COOK) Date: Wed, 12 Mar 2014 13:56:07 -0700 Subject: CJK stroke order data: kRSUnicode v. kRSKangXi In-Reply-To: References: Message-ID: <8CD243B4-1181-4125-A35C-8B23CCD46B93@unicode.org> On Mar 12, 2014, at 2:59 AM, Adam Nohejl wrote: >> >> Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention. > > OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow? You are welcome to prepare a proposal, or just send us your list. We have already started a proposal to augment kRSUnicode, but I'm not sure about the timeframe for completion. The proofing of the various Kangxi properties is separate from this, but is aimed at the specific KX edition used by IRG in Extension B work. -Richard From ken.whistler at sap.com Wed Mar 12 16:26:28 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 12 Mar 2014 21:26:28 +0000 Subject: Names for control characters In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> Message-ID: Per continued: > I know it's not a name. My question was *why* control characters don't > *have* names like > > CONTROL CHARACTER NULL > CONTROL CHARACTER START OF HEADING > CONTROL CHARACTER START OF TEXT > etc. > > It would be so obvious to have it like that, so I assume there is some > specific reason not to, but I still can't figure it out. For me there is > not less reason for these characters to have names than any others, so > for me it's like Linear B characters didn't have names, and I got the > answer "no problem, they have aliases, so that's OK!" This is just > strange to me. If names aren't needed, why do almost all characters have > them? Ah, so this is a "Why is the sky blue?" kind of question. ;-) And perhaps the correct response is then a Just So story... Once upon a time, there was an ISO framework for character encoding. Officially his name was ISO 2022 Information technology -- Character code structure and extension techniques. But we'll think of him as the troll that lives under the bridge and just call him "2022" for short. Now 2022 had his favorite collection of code points that he kept in buckets under the bridge. But he was very, very particular about how he organized his collection. All the code points 00 to 1F had to go in the bucket labeled "C0", and all the code points 20 to 7E had to go in the bucket labeled "G0" (or "GL" -- sometimes the troll would get confused). He had other, even bigger code points, too, but we can save those for another story. 2022 said all the code points in the "G0" bucket could get names. In fact they could get lots of names, if they wanted. So 2022 also starting collecting sets of characters, where all those names were written down. Sometimes he would "escape" to one set and admire all those pretty names, and then he would "escape" to another set and admire other pretty names. 2022 was a great admirer of escaping, by the way, as well as pretty names. But the code points in the "C0" bucket were different. 2022 insisted that those code points weren't like the ones in the "G0" bucket, and they couldn't have names at all. Indeed, these were very odd code points -- 2022 called them "control functions". Sometimes when the troll took one out of the "C0" bucket and examined it, it did one thing, but the next time it might do something completely different. Only 2022's friend, the troll named 6429 living under the next bridge to the north, really understood what they might be doing from one week to the next. One day an aspiring young wizard named Unicode was crossing the bridge. As an aspiring young wizard, he was rather observant. And he noticed that there was a troll living under the bridge and that that troll had stolen all the code points and was hoarding them in strangely labeled buckets under the bridge. Being a wizard and all, he knew that it was his duty to slay the troll and free all the code points. So he set about writing down the appropriate spell in his brand new spellbook. Now Unicode was a very egalitarian wizard -- it just seemed right to him that all code points should be able to have names, and it would be better if each one had just one, unique name. That way, none of them would get jealous of all the names some other code point had acquired, and besides, each code point would know its name and could come when you called it. So in the first version of Unicode's spellbook, he wrote the spell down just that way. He called his spell "Unicode 1.0", because, well, it was his spell, after all, and the very first complete spell that he would be trying to use. 00 could be called "NULL" and 01 could be called "START OF HEADING", just like 20 could be called "SPACE" and 2D could be called "HYPHEN-MINUS". You may be wondering why Unicode would use such odd names for all the code points, but then there is no accounting for the whims of wizards, I guess. Well, once Unicode had finished writing down the "Unicode 1.0" spell, he started casting it on the troll: Shazaaaam! Ffffppfft! To Unicode's surprise, the spell only partly worked, but then fizzled. The troll had been badly hurt, but he was still limping around under the bridge, and he still clung tightly to his buckets of code points. Unicode looked around to see what the problem could be, and noticed that there was a warlock at the other end of the bridge. It was an infamous warlock who had taken to calling himself "10646", and from all appearances he was *also* trying to cast a spell to kill the troll and free all the code points. Apparently, casting the two spells at the same time had resulted in interference in the ley lines. That was why neither spell had fully worked, and was why the troll 2022 was still limping around with his code point buckets. The wizard Unicode headed across the bridge to speak to the warlock 10646: "Look, we both want to slay that troll and free his code points. Why don't we team up and cast synchronized spells?" But 10646 was a suspicious warlock. He wasn't sure that *all* of the code points could be freed safely. Who knows what mischief they might get up to if left on their own. "Those code points that the troll keeps in the C0 bucket are very dangerous," said 10646. "We can't let them just be like all the others and get ordinary names. After all, they seem to do different things in alternate weeks, and if we give them regular names, they might come when we call them, even if they are doing the wrong things that week." The wizard Unicode heaved a sigh. That seemed so silly to him. But after all, it was important to kill the troll and save all the code points. So he pulled out his quill and scratched lines through all the names for the code points from the C0 bucket in his spellbook, and decided he would call the revised spell "Unicode 1.1". It was only a little different from his first spell -- but it is important to keep track of these things. Spells can be dangerous things, after all. "How does this look to you, Master Warlock?" he asked. And 10646 nodded his cautious approval at the revision. So then the wizard Unicode and the warlock 10646 started casting their spells together. Shazaamaazama! Pockety spoketi! Keeeraack! The troll 2022 was dead! His buckets fell out of his grasp, and all the code points were freed! But the ones that rolled out of the C0 bucket didn't have names, because Unicode had scratched out all of their names in the Unicode 1.1 spell he cast, just so the warlock 10646 wouldn't interfere by casting a counterspell for them. And that is why control characters don't have names. From sdaoden at yandex.com Thu Mar 13 04:57:15 2014 From: sdaoden at yandex.com (Steffen Nurpmeso) Date: Thu, 13 Mar 2014 10:57:15 +0100 Subject: Names for control characters In-Reply-To: References: <83bnxbo02p.fsf@gnu.org> Message-ID: <20140313095715.N6t2V2vv1g3WuIu0nm6/oFt7@dietcurd.local> |So then the wizard Unicode and the warlock 10646 started casting |their spells together. Fantastic reading. |Shazaamaazama! Pockety spoketi! Keeeraack! History is made by winners. --steffen From wjgo_10009 at btinternet.com Fri Mar 14 06:41:49 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 14 Mar 2014 11:41:49 +0000 (GMT) Subject: Colour font, color font, colourfont, colorfont Message-ID: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com> Colour font, color font, colourfont, colorfont Many documents use the American English, color font. Yet will the International Standard use the en-gb-oed English colour font as the spelling and also spell colourize with a -ize ending? Would it be better to use colourfont as that would be more easily searchable on the web and would, perhaps, be more precise as to meaning and reduce the possibility of ambiguity? How would the term be expressed in other languages? Would German use a new single word? What would be the term in French? How would the name of the technology localize into the languages of the world? Is it a good idea to try to standardize the parlance and the localization of the parlance of the technology now? William Overington 14 March 2014 From alex.plantema at xs4all.nl Fri Mar 14 07:36:15 2014 From: alex.plantema at xs4all.nl (Alex Plantema) Date: Fri, 14 Mar 2014 13:36:15 +0100 Subject: Colour font, color font, colourfont, colorfont References: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: Op vrijdag 14 maart 2014 12:41 schreef William_J_G Overington: > Colour font, color font, colourfont, colorfont > > Many documents use the American English, color font. > > Yet will the International Standard use the en-gb-oed English colour > font as the spelling and also spell colourize with a -ize ending? > > Would it be better to use colourfont as that would be more easily > searchable on the web and would, perhaps, be more precise as to > meaning and reduce the possibility of ambiguity? > > How would the term be expressed in other languages? > > Would German use a new single word? > > What would be the term in French? > > How would the name of the technology localize into the languages of > the world? > > Is it a good idea to try to standardize the parlance and the > localization of the parlance of the technology now? Colouri(z|s)e isn't in my dictionary; colour is already a verb as well. German: Farbenschriftschnitt, French: Fonte de caract?res en couleurs. Btw, font is spelled fount in British English. Alex. From jcb+unicode at inf.ed.ac.uk Fri Mar 14 08:21:53 2014 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Fri, 14 Mar 2014 13:21:53 GMT Subject: Colour font, color font, colourfont, colorfont References: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: On 2014-03-14, Alex Plantema wrote: > Colouri(z|s)e isn't in my dictionary; colour is already a verb as well. Get a better dictionary. The word has been in the language more than four hundred years. It currently has a fairly common technical meaning of "adding colour to old monochrome photos or films". In any case, you don't need a dictionary, because -ize is a productive formation. > Btw, font is spelled fount in British English. Suggest you don't propound on languages other than your own. That used to be true in the days of metal type, although even so both spellings have been in use through the last few centuries. Then, "fount" was a technical term that few people would have cause to use. With the advent of computers, the "font" spelling has completely supplanted the "fount" spelling in everyday usage. Within the industry, some current British letterpress printers use "fount" for metal type and "font" for digital type, while others use "font" for both. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From naenaguru at gmail.com Sat Mar 15 23:12:48 2014 From: naenaguru at gmail.com (Naena Guru) Date: Sat, 15 Mar 2014 23:12:48 -0500 Subject: Romanized Singhala got great reception in Sri Lanka Message-ID: I made a presentation demonstrating Dual-script Singhala at National Science Foundation of Sri Lanka. Most of the attendees were government employees and media representatives; a few private citizens came too. Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font. It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. The font uses Standard Ligature feature of OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as well as many Singhala letters. The font is supported across all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. The same solution can be applied for all Indic languages. The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. The result of the survey passed around was 100% as below (translated from Singhala): 1. I believe that Dual-script Singhala is convenient to me as it is implemented similar to English - Yes 2. Today everyone uses Unicode Sinhala. It is easy and has no problems - No 3. The cost of Unicode Sinhala should be eliminated by switching to Dual-scrip Singhala - Yes 4. We should amend Pali text in the Tripitaka according to rulings of SLS1134 - No 5. Digitizing old books is a very important thing - Yes 6. We should focus on making this easy-to-use Dual-script Singhala method a standard - Yes Please comment or send questions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun Mar 16 00:36:41 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 16 Mar 2014 06:36:41 +0100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: Don't you realize that what you are trying to create is completely out of topic of Unicode, as it is simply another new 8-bit encoding similar to what ISCII does for supporting multiple Indic scripts with a common encoding/transcoding table? The ISCII standard has shown its limitations, it cannot be enough to support all scripts correctly and completely, it has lots of unsolved ambiguities for tricky cases or historic orthographies, or newer orthographies, that the UCS encoding better supports due to its larger character set and more precise character properties and algorithms. You are in fact creating a transcoding table... Except that you are mixing the concepts; and the Unicode and ISO technical commitees working on the UCS don"t need to handle new 8-bit encodings. And you'll soon experiment the same problems as in ISCII and all other legacy 8-bit encodings: very poor INTEROPERABILITY due to version tracking or complax contextual rules... You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed working on standardization of new 8-bit encodings (only a few ones are maintained; but these are the most complex ones used in China and Japan. Well in fact only Japan now seens to be actively updating its legacy JIS standard; but only with the focus of converging it to use the UCS and solve ambiguities or solve some technical problems (e.g. with emojis used by mobile phone operators). Even China stopped updating its national standard by publishing a final mapping table to/from the full UCS (including for characters still not encoded in the UCS): this simplified the work because only one standard needs to be maintained instead of 2. Note that as long there will not be any national standard supporting your proposed encodng, there is no chance that the font standards will adopt it. You may still want to register your encoding in the IANA registry, but you'll need to pass the RFC validation. And there are lots of technical details missing in your proposal so that it can work for supporting it with a standard mapping in fonts. There is better chance for you to pomote it only as a transliteration scheme, or as an input method for leyboard layout (both are also not in the scope of the Unicode and ISO/ISC 10646 standards though, they could be in the scope of the CLDR project, which is not by itself a standard but just a repository of data, supported by a few standards)... Think about it. 2014-03-16 5:12 GMT+01:00 Naena Guru : > I made a presentation demonstrating Dual-script Singhala at National > Science Foundation of Sri Lanka. Most of the attendees were government > employees and media representatives; a few private citizens came too. > > Dual-script Singhala means romanized Singhala that can be displayed either > in the Latin script or in the Singhala script using an Orthographic Smart > Font. It is easy to input (phonetically) using a keyboard layout slightly > altered from QWERTY. The font uses Standard Ligature feature of > OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as > well as many Singhala letters. The font is supported across all OSs: > Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the > proper and complete solution on the computer for the Singhala script used > to write Singhala, Sanskrit and Pali languages. The same solution can be > applied for all Indic languages. > > The government ministries, media and people welcomed it with enthusiasm > and relief that there is something practical for Singhala. The response in > the country was singularly positive, except for the person that > filibustered the Q&A session of the presentation that spoke about the hard > work done on Unicode Sinhala, clearly outside the subject matter of the > presentation. > > The result of the survey passed around was 100% as below (translated from > Singhala): > > 1. I believe that Dual-script Singhala is convenient to me as it is > implemented similar to English - Yes > 2. Today everyone uses Unicode Sinhala. It is easy and has no problems > - No > 3. The cost of Unicode Sinhala should be eliminated by switching to > Dual-scrip Singhala - Yes > 4. We should amend Pali text in the Tripitaka according to rulings of > SLS1134 - No > 5. Digitizing old books is a very important thing - Yes > 6. We should focus on making this easy-to-use Dual-script Singhala > method a standard - Yes > > Please comment or send questions. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Mar 16 02:15:24 2014 From: prosfilaes at gmail.com (David Starner) Date: Sun, 16 Mar 2014 00:15:24 -0700 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru wrote: > I made a presentation demonstrating Dual-script Singhala at National Science > Foundation of Sri Lanka. Most of the attendees were government employees and > media representatives; a few private citizens came too. I don't know what the point was of sending this message. You claim that Unicode Sinhala was outside the subject matter of the presentation, so why would you post it to this list? -- Kie ekzistas vivo, ekzistas espero. From everson at evertype.com Sun Mar 16 06:18:38 2014 From: everson at evertype.com (Michael Everson) Date: Sun, 16 Mar 2014 11:18:38 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com> On 16 Mar 2014, at 04:12, Naena Guru wrote: > Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font? What a terrible, terrible idea. You are essentially promoting giving up writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack. > Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. No, it isn?t. It?s a huge step backwards, unless you propose abolishing the Sinhala script entirely and just writing in Latin. > The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. That person understood the nature of data integrity. As does everyone who cares about the Universal Character Set. Michael Everson * http://www.evertype.com/ From jf at colson.eu Sun Mar 16 07:12:16 2014 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Sun, 16 Mar 2014 13:12:16 +0100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: <53259520.8000206@colson.eu> Le 16/03/14 08:15, David Starner a ?crit : > On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru wrote: >> I made a presentation demonstrating Dual-script Singhala at National >> Science >> Foundation of Sri Lanka. Most of the attendees were government >> employees and >> media representatives; a few private citizens came too. > I don't know what the point was of sending this message. You claim > that Unicode Sinhala was outside the subject matter of the > presentation, so why would you post it to this list? > I think the point is in the survey : the questionned persons would have answered: -- that they believe that Dual-script Singhala is convenient to them, -- that Unicode Sinhala isn't easy and/or has problems, -- that the cost of Unicode Sinhala should be eliminated by switching to Dual-scrip Singhala -- and that they should focus on making this "easy-to-use" Dual-script Singhala method a standard. The big question is what is difficult in Unicode Sinhala. Is there anything Unicode could do to change that feeling? What's precisely the cost of Unicode Sinhala? Does its use require a teaching period the hack wouldn't need? Why? Le 16/03/14 08:15, David Starner a ?crit : > On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru wrote: >> I made a presentation demonstrating Dual-script Singhala at National Science >> Foundation of Sri Lanka. Most of the attendees were government employees and >> media representatives; a few private citizens came too. > I don't know what the point was of sending this message. You claim > that Unicode Sinhala was outside the subject matter of the > presentation, so why would you post it to this list? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sun Mar 16 08:10:13 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sun, 16 Mar 2014 13:10:13 +0000 (GMT) Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> Thank you for starting this thread. It is good to read of developments. I remembered a system that I designed many years ago for entering Esperanto text using an ordinary keyboard. Some years ago I included it in a story. http://www.users.globalnet.co.uk/~ngo/euto0008.htm The idea was that characters not on an ordinary QWERTY keyboard could be entered using an ordinary QWERTY keyboard. If that idea were implemented today then it could be used to enter Esperanto text, with the keystrokes converted into Unicode characters. However, that system was just for entering a few accented characters into a text written in Latin script and Esperanto does not have ligatures. Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard? > It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. How is the keyboard altered from QWERTY please? Are you publishing the font please? So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system? William Overington 16 March 2014 From wjgo_10009 at btinternet.com Sun Mar 16 11:05:45 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sun, 16 Mar 2014 16:05:45 +0000 (GMT) Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> > So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system? Could this be achieved if a text-processing software package were produced that could automatically perform a character string to character string substitution (namely Romanized Singhala character string to Unicode character string) that would be applied before any OpenType glyph to glyph substitution? The character string to character string substitution rules could be stored in a text file, such as a UTF-16 text file saved from WordPad, that format being what WordPad describes as a Unicode Text Document file type. Could this be achieved? If so, text entry could use an ordinary QWERTY keyboard and yet the resulting text would be stored using the appropriate Unicode characters for the script and the font would use Unicode mappings. William Overington 16 March 2014 From asmusf at ix.netcom.com Sun Mar 16 13:15:00 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 16 Mar 2014 11:15:00 -0700 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: <5325EA24.60503@ix.netcom.com> On 3/16/2014 9:05 AM, William_J_G Overington wrote: >> So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system? > Could this be achieved if a text-processing software package were produced that could automatically perform a character string to character string substitution (namely Romanized Singhala character string to Unicode character string) that would be applied before any OpenType glyph to glyph substitution? > > The character string to character string substitution rules could be stored in a text file, such as a UTF-16 text file saved from WordPad, that format being what WordPad describes as a Unicode Text Document file type. > > Could this be achieved? It's software. What do you think? :) A./ > > If so, text entry could use an ordinary QWERTY keyboard and yet the resulting text would be stored using the appropriate Unicode characters for the script and the font would use Unicode mappings. > > William Overington > > 16 March 2014 > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From prosfilaes at gmail.com Sun Mar 16 13:37:41 2014 From: prosfilaes at gmail.com (David Starner) Date: Sun, 16 Mar 2014 11:37:41 -0700 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <53259520.8000206@colson.eu> References: <53259520.8000206@colson.eu> Message-ID: On Sun, Mar 16, 2014 at 5:12 AM, Jean-Fran?ois Colson wrote: > Le 16/03/14 08:15, David Starner a ?crit : > > I don't know what the point was of sending this message. You claim > > that Unicode Sinhala was outside the subject matter of the > > presentation, so why would you post it to this list? > > > I think the point is in the survey : I suspect I could get a similar group to agree that all programming should be done in ALGOL, too. Feed a well-done seminar to the right people who aren't well-educated in the subject, and you'll get whatever results you want from the survey. > The big question is what is difficult in Unicode Sinhala. Is there anything > Unicode could do to change that feeling? Naena Guru doesn't care. In fact, Naena Guru's seminar contributed to that feeling before the survey. Without the input of some Sinhala user that doesn't have an ax to grind, there's not much that can be done. -- Kie ekzistas vivo, ekzistas espero. From jf at colson.eu Sun Mar 16 14:52:40 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sun, 16 Mar 2014 20:52:40 +0100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <53260108.6080209@colson.eu> Le 16/03/14 14:10, William_J_G Overington a ?crit : > Thank you for starting this thread. > > It is good to read of developments. > > I remembered a system that I designed many years ago for entering Esperanto text using an ordinary keyboard. > > Some years ago I included it in a story. > > http://www.users.globalnet.co.uk/~ngo/euto0008.htm > > The idea was that characters not on an ordinary QWERTY keyboard could be entered using an ordinary QWERTY keyboard. That?s the raison-d??tre of the Compose key available on most Linux/Unix computers: you type compose, apostrophe, e and you get a ?; you type compose, a, e and you get a ?; you type compose, question mark, plus, o and you get a ?; you type compose, 5, 8 and you get a ?; etc. > > If that idea were implemented today It is! But neither on Windows nor on MacOS. > then it could be used to enter Esperanto text, That is possible. For ?, you can type compose+ ^ + C. For ?, you can type compose + ^ + c. For ?, you can type compose + ^ + G. For ?, you can type compose + ^ + g. For ?, you can type compose + ^ + H. For ?, you can type compose + ^ + h. For ?, you can type compose + ^ + J. For ?, you can type compose + ^ + j. For ?, you can type compose + ^ + S. For ?, you can type compose + ^ + s. For ?, you can type compose + U + U or compose + b + U. For ?, you can type compose + U + u or compose + u + u or compose + b + u. The problem is that, for a letter as frequent as ? in Esperanto, typing compose + (shift + 6) + c isn?t very ergonomic: a dedicated keyboard layout is better. > with the keystrokes converted into Unicode characters. > > However, that system was just for entering a few accented characters into a text written in Latin script and Esperanto does not have ligatures. > > Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard? > >> It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. > How is the keyboard altered from QWERTY please? > > Are you publishing the font please? In fact, I think he was speaking of the bare American (US) qwerty. An international version of it should do the job. Looking at his site http://lovatasinhala.com/ and making a copy and paste of the page contents, you see he uses 7-bit ASCII, a few Latin-1 accented vowels, and a few additional ?letters? such as ?, ?, ?, ? and ?. Naena Guru?s aim is not to make an input method to type Sinhalese. Sinhalese keyboards layouts already exist: http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html http://kaputa.com/uniwriter/apple.gif http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html His aim is rather to make an 8-bit font to replace that ?difficult? and ?expensive? Unicode compliant Sinhalese. > > So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system? Of course. Everything can be produced with a QWERTY keyboard ifever you provide an appropriate driver. > > William Overington > > 16 March 2014 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From lang.support at gmail.com Sun Mar 16 15:05:22 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Mon, 17 Mar 2014 07:05:22 +1100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <53260108.6080209@colson.eu> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <53260108.6080209@colson.eu> Message-ID: On 17/03/2014 6:55 AM, "Jean-Fran?ois Colson" wrote: > > Le 16/03/14 14:10, William_J_G Overington a ?crit : > >> >> Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard? >> >>> It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. >> >> How is the keyboard altered from QWERTY please? >> >> Are you publishing the font please? > > > In fact, I think he was speaking of the bare American (US) qwerty. An international version of it should do the job. > > Looking at his site http://lovatasinhala.com/ and making a copy and paste of the page contents, you see he uses 7-bit ASCII, a few Latin-1 accented vowels, and a few additional ?letters? such as ?, ?, ?, ? and ?. > He also makes a case distinction, where upper and lowercase versions of some characters produce different Sinhala characters. > Naena Guru?s aim is not to make an input method to type Sinhalese. Sinhalese keyboards layouts already exist: > http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html > http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html > http://kaputa.com/uniwriter/apple.gif > http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html > > His aim is rather to make an 8-bit font to replace that ?difficult? and ?expensive? Unicode compliant Sinhalese. > Creating a new set of difficulties. Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 16 15:30:24 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 16 Mar 2014 14:30:24 -0600 Subject: Romanized Singhala got great reception in Sri Lanka Message-ID: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell> Jean-Fran?ois Colson wrote: >> The idea was that characters not on an ordinary QWERTY keyboard could >> be entered using an ordinary QWERTY keyboard. > > That?s the raison-d??tre of the Compose key available on most Linux/ > Unix computers: >> If that idea were implemented today > > It is! But neither on Windows nor on MacOS. There are plenty of dead-key keyboard layouts available for Windows and Mac computers. The sequences are different from using a Compose key, but the principle is the same. As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's point. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From jf at colson.eu Sun Mar 16 15:50:47 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Sun, 16 Mar 2014 21:50:47 +0100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell> References: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell> Message-ID: <53260EA7.70406@colson.eu> Le 16/03/14 21:30, Doug Ewell a ?crit : > Jean-Fran?ois Colson wrote: > >>> The idea was that characters not on an ordinary QWERTY keyboard could >>> be entered using an ordinary QWERTY keyboard. >> >> That?s the raison-d??tre of the Compose key available on most Linux/ >> Unix computers: > >>> If that idea were implemented today >> >> It is! But neither on Windows nor on MacOS. > > There are plenty of dead-key keyboard layouts available for Windows > and Mac computers. The sequences are different from using a Compose > key, but the principle is the same. Of course, I know that. I already have examined the default keyboard layouts for Windows http://msdn.microsoft.com/en-us/goglobal/bb964651.aspx (there are a few mistakes on those maps), MacOS and GNU/Linux (folder /usr/share/X11/xkb/symbols/). My own everyday keyboard layout has no less than 20 (twenty) dead keys. The idea here was ?that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard? If a character is available by a dead key, isn?t it on the keyboard ? > > As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's > point. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From marc at keyman.com Sun Mar 16 16:09:42 2014 From: marc at keyman.com (Marc Durdin) Date: Sun, 16 Mar 2014 21:09:42 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell> References: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD0B699@federation.tavultesoft.local> To me the real question is, what are the roadblocks that the other people at this forum saw in using Unicode? I'm not talking about the proponents of non-Unicode solutions, but those who would otherwise be agnostic given the right support. And what can we do to address those concerns? (1) Rendering support still lags -- if the characters don't render properly in Unicode but they do in HackAscii, then HackAscii wins. Does any operating system renderer today support all the complex scripts in Unicode 6? How many users need to upgrade their OS in order to get access to a working renderer? What about mobile operating systems? (2) Fonts are much harder to create. Instead of just needing a graphic designer to draw characters, you now need to a programmer as well, who understands OpenType tables. This is especially a problem for the very popular decorative fonts, which are created by graphic design houses with little interest in the finer nuances of shaping rules. Again, HackAscii wins. (3) Many of the Unicode input methods have been hard for end users to adapt to. I've pushed Unicode in this space for nearly 20 years, but even today, I continue run up against points (1) and (2) with language partners. HackAscii has slightly less of an advantage here, because you still tend to need some intelligence in your keyboard layout for most HackAscii solutions. Of course these are solvable. But when a HackAscii proponent can demonstrate an easier solution, then the slightly more subtle advantages of Unicode tend to be lost in the simple fact that for /my language/, HackAscii "just works". It's hard to argue the advantages of Unicode when you cannot show a working solution. And arguing the disadvantages of HackAscii is pointless until you can demonstrate the alternative working to the user's satisfaction. Marc -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Monday, 17 March 2014 7:30 AM To: unicode at unicode.org Cc: Jean-Fran?ois Colson Subject: Re: Romanized Singhala got great reception in Sri Lanka Jean-Fran?ois Colson wrote: >> The idea was that characters not on an ordinary QWERTY keyboard could >> be entered using an ordinary QWERTY keyboard. > > That?s the raison-d??tre of the Compose key available on most Linux/ > Unix computers: >> If that idea were implemented today > > It is! But neither on Windows nor on MacOS. There are plenty of dead-key keyboard layouts available for Windows and Mac computers. The sequences are different from using a Compose key, but the principle is the same. As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's point. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From doug at ewellic.org Sun Mar 16 18:47:24 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 16 Mar 2014 17:47:24 -0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: Jean-Fran?ois Colson wrote: > The idea here was ?that characters not on an ordinary QWERTY keyboard > could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any > dead keys on an _ordinary_ (i.e. not one using an international(ized) > driver) QWERTY keyboard? Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS. The standard vanilla U.S. keyboard also doesn't provide the accented letters and other non-ASCII letters like ? that Naena Guru uses for his font hack. > If a character is available by a dead key, isn?t it on the keyboard ? It depends on what you mean by "on the keyboard." Thanks to John Cowan's delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to get the fraction ? (one-seventh). That character is not "on the keyboard" in any sense other than what the driver provides. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From doug at ewellic.org Sun Mar 16 20:01:08 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 16 Mar 2014 19:01:08 -0600 Subject: [private] Re: Unicode : Greek Extended. Message-ID: <201403170102.s2H124d5019796@unicode.org> No, the precomposed characters that were added for compatibility were added back in 1992 or so. It was still possible then, not now. The problem with Marshallese is that long ago, people thought the difference between cedilla below and comma below was just a glyph choice, so fonts were built that showed either cedilla or comma according to the whim of the designer. It turns out to matter a great deal to some people, so there is now a scramble to complete the disunification. I don't know about the Yoruba line-below. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell -----Original Message----- From: "Richard BUDELBERGER" Sent: ?3/?16/?2014 18:51 To: "Doug Ewell" Cc: "everson at evertype.com" ; "cowan at mercury.ccil.org" ; "unicode at unicode.org" Subject: re: [private] Re: Unicode : Greek Extended. Thanks to all of you for your answers : > Message du 16/03/14 16:41 > De : Doug Ewell > A : budelberger.richard at wanadoo.fr > Copie ? : > Objet : [private] Re: Unicode : Greek Extended. > > Richard BUDELBERGER wrote: > > > A little off topic, but can somebody help me to add three (six) more Read ??three?(five)??. > > characters to Unicode? that is : > > ? GREEK CAPITAL LETTER SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ; > > ? GREEK SMALL LETTER SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ; > > ? GREEK LETTER STIGMA WITH LINE (or MACRON) BELOW : ?? ?? ; > > ? GREEK SMALL LETTER STIGMA WITH LINE (or MACRON) BELOW : ?? ?? ; Read ??GREEK SMALL LETTER FINAL SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;??. > > ? GREEK CAPITAL LETTER CHI WITH LINE (or MACRON) BELOW : ?? ?? ; > > ? GREEK SMALL LETTER CHI WITH LINE (or MACRON) BELOW : ?? ??. > > > > See some samples here : > > https://fr.wiktionary.org/wiki/Category:syriaque. > > Both you and Wiktionary proved why these precomposed characters won't be > accepted: because they can already be represented using combining > characters. If this were something that could not already be represented > any other way, then it would be different. > > Some fonts don't display this correctly; they show the macron partially > or completely to the right of the base letter, instead of directly below > it. The solution is to use another font, and to ask font vendors to fix > this combination so it looks decent. > > The correct combining character is U+0331. U+0332 is intentionally a > very long line, suitable for math-type applications. All of the existing > "WITH LINE BELOW" characters (which were added for compatibility with > existing character sets) decompose to U+0331; nothing decomposes to > U+0332. So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them?(?? ?? ?? and H? ?? ??) for compatibility??? There was the same problem with Yoruba letters with U+0329 ? ???, and with the new Marshallese alphabet with U+0326 ?????. R. Budelberger. > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Sun Mar 16 21:44:08 2014 From: naenaguru at gmail.com (Naena Guru) Date: Sun, 16 Mar 2014 21:44:08 -0500 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: Philippe: All you said about ISCII is probably right. So, it has given you guys a lot of pain. I did not do it nor followed it. As for Japanese (and also for Indic) I have read the warnings in RFC 1815: http://tools.ietf.org/rfc/rfc1815.txt I am not creating a transcoding table as you say. I assume you think I take Unicode Sinhala to be a legitimate encoding for Singhala that I am mapping to SBCS for the love of SBCS. No. And I don't know what concepts I am mixing. I am trained in Computer Science, I have taught it at college level, and have done years of consulting work and written project proposals for a pretty good size one for the Federal Government too. I believe that you need to understand the problem at hand to find a solution for it. You cannot make solutions for Indic not knowing Indic. Starting blindly with ISCII was a mistake. It is useless at least for Singhala. ========= STORY OF UNICODE SINHALA ========== The first draft for the Sinhala chart was handwritten by Andy Daniels. He mentioned some doubts about some letters in it. He had a good instinct on that. It sat there people wondering from where he got his information. He said from Germany. Someone said that it came from a $300 book. I suspect that it is Rev. Fr. A.M.Gunasekara's book (1891). Then came the Lion of Unicode Michael Everson (down in this thread). He was making fonts by the dozen and took Daniels' draft certified the letters side of it, not having a nicely printed set of the digits. This certificate was countersigned by a Mettavihari for users. I know Ven. Mettavihari. He is a Danish man that researched and put up the most comprehensive Tripitaka, the Buddhist canon. This irreproachable man denies that he endorsed the standard on behalf of the Singhalese saying obviously he is not Singhalese. (Actually, I think he is more Singhalese than me). Who signed as him, a forgery? When the code chart came to Lanka, the closest to a computer that they knew was the IBM Selectric typewriter. When they did not do anything about it, the World Bank offered a $83 scheme to bring Lanka to the computer age all the way so the village fellow could communicate with the government online. They set up the IT agency ICTA and got the academics gathered there doing 'projects'. They even paid a fellow to come over and read the OpenType specification for them. I understand that the kingpin of the operations there is one person that studied in US.He is the adviser to the President, The top Colombo University and the ICTA itself. He is one consultant that does most projects. When Everson wanted to add the digits apparently finding Fr. Gunasekara's book, the Lankans denied such existed. When he showed them, they said they are not necessary. Now this everybody's consultant announced at my presentation that they are going to add them. ============ END STORY OF UNICODE SINHALA ========== BAD UNICODE SINHALA: Unicode Singhala violates Singhala / Sanskrit grammar. Unicode Singhala is not compatible with Sanskrit, an integral part of the Singhala script. That also applies to Pali whose native script is Singhala. Unicode Sinhala further helps kill Singhala by making it very difficult to type and impossible to obtain the entire repertoire of letters and limiting the applications and OSs that it can be used in. Typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too. There is a program called Helabasa by Keyman typing system that printers use to type it. There is a physical keyboard too. Then there is Google transliteration - very inadequate and another one by Colombo University found on a web page. These last two allow you to type phonetically but not entirely. The result is very few people type Unicode Singhala, only those that their job requires them to type Unicode Singhala. PERFECT ROMANIZED SINGHALA I did the same thing English and Western European languages did; very close. I mapped the well-known 58+2 Singhala-Sanskrit phonemes in the SBCS. The reason is because then Singhala gets to use all those applications perfected over decades that most here Westerners enjoy. That set covers all letters necessary for Singhala, Sanskrit and Pali, the three languages that use the Singhala script. See it here displayed using the first orthographic smartfont: http://lovatasinhala.com/ MORE READING: Let's look at this as a lay person (whose interest is our ultimate goal) sees: English was fully romanized from fu?ark by about 600 AD. Romanizing is writing by using letters of the Latin alphabet plus many, many others added to it. All Europeans when they became fully Christianized / literate, they all adopted Latin letters and extended them as they pleased. This set has branched off as Latin script and Cyrillic script. Printing industry standardized the greater part of the alphabets. Singhala has a well defined phoneme chart called hodiya. It is an extension of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932) and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala fully describe the writing system. Like most other languages, including English before printing arrived in England, it is written phonetically. Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to print Pali (Magadhi) in the Latin script. This requires letters with bars (macron) and dots not found in common fonts. This scheme is called PTS Pali. It is similar to IAST Sanskrit. It is impossible to type these on the regular keyboard. I freshly romanized Singhala by mapping its phonemes to the SAME area 13 Western European languages mapped their alphabetic letters within the following Unicode code charts: http://www.unicode.org/charts/PDF/U0000.pdf http://www.unicode.org/charts/PDF/U0080.pdf So, if that is "creating a transcoding table" all Europeans did it and I do it too. On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy wrote: > Don't you realize that what you are trying to create is completely out of > topic of Unicode, as it is simply another new 8-bit encoding similar to > what ISCII does for supporting multiple Indic scripts with a common > encoding/transcoding table? > > The ISCII standard has shown its limitations, it cannot be enough to > support all scripts correctly and completely, it has lots of unsolved > ambiguities for tricky cases or historic orthographies, or newer > orthographies, that the UCS encoding better supports due to its larger > character set and more precise character properties and algorithms. > > You are in fact creating a transcoding table... Except that you are mixing > the concepts; and the Unicode and ISO technical commitees working on the > UCS don"t need to handle new 8-bit encodings. And you'll soon experiment > the same problems as in ISCII and all other legacy 8-bit encodings: very > poor INTEROPERABILITY due to version tracking or complax contextual rules... > > You may still want to promote it at some government or education > institution, in order to promote it as a national standard, except that > there's little change it will ever happen when all countries in ISO have > stopoed working on standardization of new 8-bit encodings (only a few ones > are maintained; but these are the most complex ones used in China and Japan. > > Well in fact only Japan now seens to be actively updating its legacy JIS > standard; but only with the focus of converging it to use the UCS and solve > ambiguities or solve some technical problems (e.g. with emojis used by > mobile phone operators). Even China stopped updating its national standard > by publishing a final mapping table to/from the full UCS (including for > characters still not encoded in the UCS): this simplified the work because > only one standard needs to be maintained instead of 2. > > Note that as long there will not be any national standard supporting your > proposed encodng, there is no chance that the font standards will adopt it. > You may still want to register your encoding in the IANA registry, but > you'll need to pass the RFC validation. And there are lots of technical > details missing in your proposal so that it can work for supporting it with > a standard mapping in fonts. > > There is better chance for you to pomote it only as a transliteration > scheme, or as an input method for leyboard layout (both are also not in the > scope of the Unicode and ISO/ISC 10646 standards though, they could be in > the scope of the CLDR project, which is not by itself a standard but just a > repository of data, supported by a few standards)... Think about it. > > > > 2014-03-16 5:12 GMT+01:00 Naena Guru : > >> I made a presentation demonstrating Dual-script Singhala at National >> Science Foundation of Sri Lanka. Most of the attendees were government >> employees and media representatives; a few private citizens came too. >> >> Dual-script Singhala means romanized Singhala that can be displayed >> either in the Latin script or in the Singhala script using an Orthographic >> Smart Font. It is easy to input (phonetically) using a keyboard layout >> slightly altered from QWERTY. The font uses Standard Ligature feature >> of OpenType / OpenFont standard to display glyphs of Sanskrit >> ligatures as well as many Singhala letters. The font is supported across >> all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala >> is the proper and complete solution on the computer for the Singhala script >> used to write Singhala, Sanskrit and Pali languages. The same solution can >> be applied for all Indic languages. >> >> The government ministries, media and people welcomed it with enthusiasm >> and relief that there is something practical for Singhala. The response in >> the country was singularly positive, except for the person that >> filibustered the Q&A session of the presentation that spoke about the hard >> work done on Unicode Sinhala, clearly outside the subject matter of the >> presentation. >> >> The result of the survey passed around was 100% as below (translated from >> Singhala): >> >> 1. I believe that Dual-script Singhala is convenient to me as it is >> implemented similar to English - Yes >> 2. Today everyone uses Unicode Sinhala. It is easy and has no >> problems - No >> 3. The cost of Unicode Sinhala should be eliminated by switching to >> Dual-scrip Singhala - Yes >> 4. We should amend Pali text in the Tripitaka according to rulings of >> SLS1134 - No >> 5. Digitizing old books is a very important thing - Yes >> 6. We should focus on making this easy-to-use Dual-script Singhala >> method a standard - Yes >> >> Please comment or send questions. >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Sun Mar 16 21:52:19 2014 From: marc at keyman.com (Marc Durdin) Date: Mon, 17 Mar 2014 02:52:19 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD0DF80@federation.tavultesoft.local> Naena, If you have an encoding which is easy to type, that can be replicated with Keyman, or any number of other input systems, for Unicode Singhala. Input is not tied to encoding. I would be happy to assist you, off-list, to develop an input method for Unicode Singhala that works according to your requirements. However, if you have examples of Singhala which cannot be represented in Unicode, please do bring these to the attention of this list. But differences in input method are not really relevant. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru Sent: Monday, 17 March 2014 1:44 PM To: Philippe Verdy Cc: jc at ahangama.com; Unicode List Subject: Re: Romanized Singhala got great reception in Sri Lanka Philippe: All you said about ISCII is probably right. So, it has given you guys a lot of pain. I did not do it nor followed it. As for Japanese (and also for Indic) I have read the warnings in RFC 1815: http://tools.ietf.org/rfc/rfc1815.txt I am not creating a transcoding table as you say. I assume you think I take Unicode Sinhala to be a legitimate encoding for Singhala that I am mapping to SBCS for the love of SBCS. No. And I don't know what concepts I am mixing. I am trained in Computer Science, I have taught it at college level, and have done years of consulting work and written project proposals for a pretty good size one for the Federal Government too. I believe that you need to understand the problem at hand to find a solution for it. You cannot make solutions for Indic not knowing Indic. Starting blindly with ISCII was a mistake. It is useless at least for Singhala. ========= STORY OF UNICODE SINHALA ========== The first draft for the Sinhala chart was handwritten by Andy Daniels. He mentioned some doubts about some letters in it. He had a good instinct on that. It sat there people wondering from where he got his information. He said from Germany. Someone said that it came from a $300 book. I suspect that it is Rev. Fr. A.M.Gunasekara's book (1891). Then came the Lion of Unicode Michael Everson (down in this thread). He was making fonts by the dozen and took Daniels' draft certified the letters side of it, not having a nicely printed set of the digits. This certificate was countersigned by a Mettavihari for users. I know Ven. Mettavihari. He is a Danish man that researched and put up the most comprehensive Tripitaka, the Buddhist canon. This irreproachable man denies that he endorsed the standard on behalf of the Singhalese saying obviously he is not Singhalese. (Actually, I think he is more Singhalese than me). Who signed as him, a forgery? When the code chart came to Lanka, the closest to a computer that they knew was the IBM Selectric typewriter. When they did not do anything about it, the World Bank offered a $83 scheme to bring Lanka to the computer age all the way so the village fellow could communicate with the government online. They set up the IT agency ICTA and got the academics gathered there doing 'projects'. They even paid a fellow to come over and read the OpenType specification for them. I understand that the kingpin of the operations there is one person that studied in US.He is the adviser to the President, The top Colombo University and the ICTA itself. He is one consultant that does most projects. When Everson wanted to add the digits apparently finding Fr. Gunasekara's book, the Lankans denied such existed. When he showed them, they said they are not necessary. Now this everybody's consultant announced at my presentation that they are going to add them. ============ END STORY OF UNICODE SINHALA ========== BAD UNICODE SINHALA: Unicode Singhala violates Singhala / Sanskrit grammar. Unicode Singhala is not compatible with Sanskrit, an integral part of the Singhala script. That also applies to Pali whose native script is Singhala. Unicode Sinhala further helps kill Singhala by making it very difficult to type and impossible to obtain the entire repertoire of letters and limiting the applications and OSs that it can be used in. Typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too. There is a program called Helabasa by Keyman typing system that printers use to type it. There is a physical keyboard too. Then there is Google transliteration - very inadequate and another one by Colombo University found on a web page. These last two allow you to type phonetically but not entirely. The result is very few people type Unicode Singhala, only those that their job requires them to type Unicode Singhala. PERFECT ROMANIZED SINGHALA I did the same thing English and Western European languages did; very close. I mapped the well-known 58+2 Singhala-Sanskrit phonemes in the SBCS. The reason is because then Singhala gets to use all those applications perfected over decades that most here Westerners enjoy. That set covers all letters necessary for Singhala, Sanskrit and Pali, the three languages that use the Singhala script. See it here displayed using the first orthographic smartfont: http://lovatasinhala.com/ MORE READING: Let's look at this as a lay person (whose interest is our ultimate goal) sees: English was fully romanized from fu?ark by about 600 AD. Romanizing is writing by using letters of the Latin alphabet plus many, many others added to it. All Europeans when they became fully Christianized / literate, they all adopted Latin letters and extended them as they pleased. This set has branched off as Latin script and Cyrillic script. Printing industry standardized the greater part of the alphabets. Singhala has a well defined phoneme chart called hodiya. It is an extension of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932) and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala fully describe the writing system. Like most other languages, including English before printing arrived in England, it is written phonetically. Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to print Pali (Magadhi) in the Latin script. This requires letters with bars (macron) and dots not found in common fonts. This scheme is called PTS Pali. It is similar to IAST Sanskrit. It is impossible to type these on the regular keyboard. I freshly romanized Singhala by mapping its phonemes to the SAME area 13 Western European languages mapped their alphabetic letters within the following Unicode code charts: http://www.unicode.org/charts/PDF/U0000.pdf http://www.unicode.org/charts/PDF/U0080.pdf So, if that is "creating a transcoding table" all Europeans did it and I do it too. On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy > wrote: Don't you realize that what you are trying to create is completely out of topic of Unicode, as it is simply another new 8-bit encoding similar to what ISCII does for supporting multiple Indic scripts with a common encoding/transcoding table? The ISCII standard has shown its limitations, it cannot be enough to support all scripts correctly and completely, it has lots of unsolved ambiguities for tricky cases or historic orthographies, or newer orthographies, that the UCS encoding better supports due to its larger character set and more precise character properties and algorithms. You are in fact creating a transcoding table... Except that you are mixing the concepts; and the Unicode and ISO technical commitees working on the UCS don"t need to handle new 8-bit encodings. And you'll soon experiment the same problems as in ISCII and all other legacy 8-bit encodings: very poor INTEROPERABILITY due to version tracking or complax contextual rules... You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed working on standardization of new 8-bit encodings (only a few ones are maintained; but these are the most complex ones used in China and Japan. Well in fact only Japan now seens to be actively updating its legacy JIS standard; but only with the focus of converging it to use the UCS and solve ambiguities or solve some technical problems (e.g. with emojis used by mobile phone operators). Even China stopped updating its national standard by publishing a final mapping table to/from the full UCS (including for characters still not encoded in the UCS): this simplified the work because only one standard needs to be maintained instead of 2. Note that as long there will not be any national standard supporting your proposed encodng, there is no chance that the font standards will adopt it. You may still want to register your encoding in the IANA registry, but you'll need to pass the RFC validation. And there are lots of technical details missing in your proposal so that it can work for supporting it with a standard mapping in fonts. There is better chance for you to pomote it only as a transliteration scheme, or as an input method for leyboard layout (both are also not in the scope of the Unicode and ISO/ISC 10646 standards though, they could be in the scope of the CLDR project, which is not by itself a standard but just a repository of data, supported by a few standards)... Think about it. 2014-03-16 5:12 GMT+01:00 Naena Guru >: I made a presentation demonstrating Dual-script Singhala at National Science Foundation of Sri Lanka. Most of the attendees were government employees and media representatives; a few private citizens came too. Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font. It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. The font uses Standard Ligature feature of OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as well as many Singhala letters. The font is supported across all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. The same solution can be applied for all Indic languages. The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. The result of the survey passed around was 100% as below (translated from Singhala): 1. I believe that Dual-script Singhala is convenient to me as it is implemented similar to English - Yes 2. Today everyone uses Unicode Sinhala. It is easy and has no problems - No 3. The cost of Unicode Sinhala should be eliminated by switching to Dual-scrip Singhala - Yes 4. We should amend Pali text in the Tripitaka according to rulings of SLS1134 - No 5. Digitizing old books is a very important thing - Yes 6. We should focus on making this easy-to-use Dual-script Singhala method a standard - Yes Please comment or send questions. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 16 22:11:26 2014 From: doug at ewellic.org (Doug Ewell) Date: Sun, 16 Mar 2014 21:11:26 -0600 Subject: Romanized Singhala got great reception in Sri Lanka Message-ID: <8AA652F8D9B2449A8EC7136E4972283E@DougEwell> Naena Guru wrote: > As for Japanese (and also for Indic) I have read the warnings in RFC > 1815: > http://tools.ietf.org/rfc/rfc1815.txt That explains everything. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From jf at colson.eu Sun Mar 16 23:16:36 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 17 Mar 2014 05:16:36 +0100 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: Message-ID: <53267724.5050406@colson.eu> > As for Japanese (and also for Indic) I have read the warnings in RFC 1815: > http://tools.ietf.org/rfc/rfc1815.txt > > RFC 1815 Character Sets ISO-10646 and ISO-10646-J-1 July 1995 July 1995? Is that document up-to-date? From jf at colson.eu Sun Mar 16 23:32:37 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Mon, 17 Mar 2014 05:32:37 +0100 Subject: [private] Re: Unicode : Greek Extended. In-Reply-To: <201403170102.s2H124d5019796@unicode.org> References: <201403170102.s2H124d5019796@unicode.org> Message-ID: <53267AE5.9010106@colson.eu> > > > Some fonts don't display this correctly; they show the macron partially > > or completely to the right of the base letter, instead of directly > below > > it. The solution is to use another font, and to ask font vendors to fix > > this combination so it looks decent. ?(2) Fonts are much harder to create. Instead of just needing a graphic designer to draw characters, you now need to a programmer as well, who understands OpenType tables. [?] Again, HackAscii wins.? From doug at ewellic.org Mon Mar 17 01:01:13 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Mar 2014 00:01:13 -0600 Subject: Romanized Singhala got great reception in Sri Lanka Message-ID: Jean-Fran?ois Colson wrote: > RFC 1815 Character Sets ISO-10646 and ISO-10646-J-1 July 1995 > > July 1995? Is that document up-to-date? It was obsolete at the time of publication. RFC 1815 was a rant by someone who thought that: (a) Unicode was fatally broken for representing Japanese because of Han unification, unlike ISO-2022-JP which by definition was used only for Japanese; and (b) display is everything, and all characters not represented by a glyph in a Windows NT 3.51 font from 1995 ought to be excluded from interchange. Sounds to me a lot like the present campaign. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From everson at evertype.com Mon Mar 17 04:47:02 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 17 Mar 2014 09:47:02 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: Message-ID: <5C8924B5-493D-44E8-A74A-37F5057F6E0F@evertype.com> On 16 Mar 2014, at 23:47, Doug Ewell wrote: > Jean-Fran?ois Colson wrote: > >> The idea here was ?that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard? > > Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS. Please distinguish between ?keyboard? which is a piece of hardware and ?keyboard layout? which is a software input method. Michael Everson * http://www.evertype.com/ From everson at evertype.com Mon Mar 17 04:48:09 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 17 Mar 2014 09:48:09 +0000 Subject: [private] Re: Unicode : Greek Extended. In-Reply-To: <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13> References: <61D9B397AD154740A01A99C34E39DB2A@DougEwell> <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13> Message-ID: On 17 Mar 2014, at 00:51, Richard BUDELBERGER wrote: > So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for compatibility ?? No. > There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?. Yes, there is. Michael Everson * http://www.evertype.com/ From everson at evertype.com Mon Mar 17 04:51:00 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 17 Mar 2014 09:51:00 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com> Message-ID: <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com> On 17 Mar 2014, at 02:48, Naena Guru wrote: > You are talking about something you do not know. I am a Singhalese. That doesn?t give you any special knowledge or privilege. I know many people from Sri Lanka, who work in the area of computing, and who work with the Sinhala characters as encoded in the UCS. And really. ?The Lion of Unicode?? My stars. Michael Everson > On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson wrote: > > On 16 Mar 2014, at 04:12, Naena Guru wrote: > >> Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font? > > What a terrible, terrible idea. You are essentially promoting giving up writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack. > >> Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. > > No, it isn?t. It?s a huge step backwards, unless you propose abolishing the Sinhala script entirely and just writing in Latin. > >> The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation. > > That person understood the nature of data integrity. As does everyone who cares about the Universal Character Set. Michael Everson * http://www.evertype.com/ From wjgo_10009 at btinternet.com Mon Mar 17 05:36:53 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 17 Mar 2014 10:36:53 +0000 (GMT) Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <5325EA24.60503@ix.netcom.com> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> <5325EA24.60503@ix.netcom.com> Message-ID: <1395052613.21848.YahooMailNeo@web87806.mail.ir2.yahoo.com> >> Could this be achieved? > It's software. What do you think? > :) Well, it is not entirely software as there seems to be discussion about whether Unicode is regarded as a good encoding for the required purpose. I do not understand those issues at present but I am interested to learn. I have produced a format for a transliteration file in case that will help. Is this format of help? The format of the translit.dat file used for transliteration This is a thought experiment at present. Automated transliteration would be by having a file translit.dat available. In the thought experiment the file is a UTF-16 text file, such as can be saved from the WordPad program by selecting saving as a Unicode Text Document. The translit.dat file would consist of a number of lines of text. A valid line of text would have one of three possible formats. If the first character of the line is an asterisk, then the line is a comment. If the first character of the line is a PERCENT SIGN then the line is the last line of the file. Otherwise the line is intended to be a transliteration line, yet only is a transliteration line if it is of the correct structure. The correct structure for a transliteration line is as follows. One or more characters that are not the VERTICAL LINE character. A VERTICAL LINE character. One or more characters that are not the VERTICAL LINE character. The possibility was considered that on some software platforms that there might be complications, while reading characters from the translit.dat file, regarding detecting the end of the translit.dat file. If the first character of the line is a PERCENT SIGN then the line is the last line of the file. In a translit.dat file produced as a Unicode Text Document saved from the WordPad program, lines are separated by two characters, namely CARRIAGE RETURN and LINE FEED, in that order. That is, pressing the return key on the keyboard produces two characters in a Unicode Text Document saved from the WordPad program. The final five characters of the translit.dat file are here specified to be as follows. CARRIAGE RETURN LINE FEED PERCENT SIGN CARRIAGE RETURN LINE FEED This is achieved using WordPad by pressing the return key both before and after the PERCENT SIGN has been entered. It is noted that a Unicode Text Document saved from the WordPad program stores the two bytes of each character with the lower byte before the higher byte. It is noted that a Unicode Text Document saved from the WordPad program starts with a U+FEFF character, used as a BYTE ORDER MARK. Thus the first two bytes of a translit.dat file do not represent a character used in the automated transliteration process. It is noted that for English and for some other languages that a Unicode Text Document saved from the WordPad program has many bytes that have a value of zero. However, the use of a Unicode Text Document saved from the WordPad program is deliberately chosen for this system so as to make participation in producing a translit.dat file as straightforward as possible, and with the hope that software developed for automated transliteration using this system will work for all languages that can be represented using Unicode characters. William Overington 17 March 2014 From moyogo at gmail.com Mon Mar 17 08:48:39 2014 From: moyogo at gmail.com (Denis Jacquerye) Date: Mon, 17 Mar 2014 13:48:39 +0000 Subject: [private] Re: Unicode : Greek Extended. In-Reply-To: References: <61D9B397AD154740A01A99C34E39DB2A@DougEwell> <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13> Message-ID: The Syriac in Greek script shown in http://www.bethmardutho.org/index.php/hugoye/volume-index/585.html (which the fr.wiktionary.org articles are citing) has underlined chi and underlined sigma, not chi macron below or sigma macron below. See page 48: ?For characters not found in the Greek alphabet, underlining is employed: the he is represented by underlined ch, and shin by underlined sigma.? Given what was mentioned so far here, one might assume this could be a macron below with a specific positioning instead of underline, but that would be just that: assumptions. It would be interesting the see original documents to have a better idea of what these should look like. On Mon, Mar 17, 2014 at 9:48 AM, Michael Everson wrote: > On 17 Mar 2014, at 00:51, Richard BUDELBERGER wrote: > >> So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for compatibility ?? > > No. > >> There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?. > > Yes, there is. > > Michael Everson * http://www.evertype.com/ > -- Denis Moyogo Jacquerye From doug at ewellic.org Mon Mar 17 09:04:26 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Mar 2014 08:04:26 -0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: Message-ID: <289A5852AC2342BABF18C045E4157A90@DougEwell> Michael Everson wrote: >>> The idea here was that characters not on an ordinary QWERTY keyboard >>> could be entered _using_an_ordinary_QWERTY_keyboard._ Are there any >>> dead keys on an _ordinary_ (i.e. not one using an >>> international(ized) driver) QWERTY keyboard? >> >> Not on the standard vanilla U.S. keyboard. It has to be provided by >> the OS, via a driver, just as Compose key support has to be provided >> by the OS. > > Please distinguish between "keyboard" which is a piece of hardware and > "keyboard layout" which is a software input method. Sorry for the shorthand. Everything I am talking about is software. I don't think there is such a thing as a physical dead key on a computer keyboard. The Compose key on *nix systems may be a physical key, but it doesn't have any special ability to compose characters unless given that ability by software. "An ordinary QWERTY keyboard," as Jean-Fran?ois put it, can generate any character, Latin or Sinhala or whatever, so long as the hardware has the right software behind it. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From doug at ewellic.org Mon Mar 17 09:16:16 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Mar 2014 08:16:16 -0600 Subject: Unicode : Greek Extended. In-Reply-To: References: Message-ID: <27F8AD416E4D4EA9A68649603FCA8216@DougEwell> Jean-Fran?ois Colson wrote: >> Some fonts don't display this correctly; they show the macron >> partially or completely to the right of the base letter, instead of >> directly below it. The solution is to use another font, and to ask >> font vendors to fix this combination so it looks decent. > > "(2) Fonts are much harder to create. Instead of just needing a > graphic designer to draw characters, you now need to a programmer as > well, who understands OpenType tables. [?] Again, HackAscii wins." Richard wasn't suggesting HackAscii in this case. He was suggesting newly encoded precomposed characters. [Richard's original post was to the ietf-languages list, perhaps because he isn't signed up for the Unicode list. I replied privately, but Richard's subsequent response was cc'd to several people and also to the Unicode list. The response didn't make it to the list, but my reply (sent from my phone, where I didn't notice all of Richard's recipients) did. So I think we can close this by pointing out, again, that no new precomposed characters of the form "existing base + existing combining character" will be encoded, ever.] -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From petercon at microsoft.com Mon Mar 17 10:36:29 2014 From: petercon at microsoft.com (Peter Constable) Date: Mon, 17 Mar 2014 15:36:29 +0000 Subject: [private] Re: Unicode : Greek Extended. In-Reply-To: <53267AE5.9010106@colson.eu> References: <201403170102.s2H124d5019796@unicode.org> <53267AE5.9010106@colson.eu> Message-ID: Font tables to position diacritics are not "much harder to create" than anything else involved in font development, and certainly don't require being a programmer. Hinting is harder than positioning tables and does literally involve programming, though I don't hear font developers griping about that. Professional font developers are not quite the luddites the comment suggests. Petr -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Jean-Fran?ois Colson Sent: March 16, 2014 9:33 PM To: unicode at unicode.org Subject: Re: [private] Re: Unicode : Greek Extended. > > > Some fonts don't display this correctly; they show the macron > > partially or completely to the right of the base letter, instead of > > directly > below > > it. The solution is to use another font, and to ask font vendors to > > fix this combination so it looks decent. "(2) Fonts are much harder to create. Instead of just needing a graphic designer to draw characters, you now need to a programmer as well, who understands OpenType tables. [.] Again, HackAscii wins." _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From budelberger.richard at wanadoo.fr Mon Mar 17 10:23:25 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Mon, 17 Mar 2014 16:23:25 +0100 (CET) Subject: Unicode : Greek Extended. Message-ID: <30528926.15541.1395069805050.JavaMail.www@wwinf1n09> From: Denis Jacquerye Date: Mon, 17 Mar 2014 13:48:39 +0000 > The Syriac in Greek script shown in > http://www.bethmardutho.org/index.php/hugoye/volume-index/585.html > (which the fr.wiktionary.org articles are citing) has underlined chi > and underlined sigma, not chi macron below or sigma macron below. > See page 48: ?For characters not found in the Greek alphabet, > underlining is employed: the he is represented by underlined ch, and > shin by underlined sigma.? > Given what was mentioned so far here, one might assume this could be a > macron below with a specific positioning instead of underline, but > that would be just that: assumptions. See now http://tinyurl.com/py72z7w !? > It would be interesting the see original documents to have a better > idea of what these should look like. I wasn?t able to find anything about this Mgr Butrus Gemayel?s book?! On Mon, Mar 17, 2014 at 9:48 AM, Michael Everson wrote: > On 17 Mar 2014, at 00:51, Richard BUDELBERGER wrote: > >> So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for compatibility ?? > > No. > >> There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?. > > Yes, there is. See http://www.unicover.com/ecatimag/MI-C309-.jpg & https://fr.wiktionary.org/wiki/Category:marshallais ! > Michael Everson * http://www.evertype.com/ > -- Denis Moyogo Jacquerye _______________________________________________ From naenaguru at gmail.com Mon Mar 17 11:08:19 2014 From: naenaguru at gmail.com (Naena Guru) Date: Mon, 17 Mar 2014 11:08:19 -0500 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: Message-ID: Making a keyboard is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US-International. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Use Microsft's keyboard layout creator for all versions of Windows from XP: http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx Select the language carefully. I selected US-English for RS. That way, I can switch between the two keyboards quickly with Ctrl+Shift. You can change all these in the Control Panel. Here is the keymap I made for RS in Linux: http://ahangama.com/apiapi/singhala/linuxkb-s.php Just scroll down for the English part. (The lines starting with double slashes are comments and have no effect on the program) The Macintosh key layout is easy too. The story with iOS and Android are different but not hard either. On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell wrote: > Jean-Fran?ois Colson wrote: > > The idea here was ?that characters not on an ordinary QWERTY keyboard >> could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any >> dead keys on an _ordinary_ (i.e. not one using an international(ized) >> driver) QWERTY keyboard? >> > > Not on the standard vanilla U.S. keyboard. It has to be provided by the > OS, via a driver, just as Compose key support has to be provided by the OS. > > The standard vanilla U.S. keyboard also doesn't provide the accented > letters and other non-ASCII letters like ? that Naena Guru uses for his > font hack. > > If a character is available by a dead key, isn?t it on the keyboard ? >> > > It depends on what you mean by "on the keyboard." Thanks to John Cowan's > delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to > get the fraction ? (one-seventh). That character is not "on the keyboard" > in any sense other than what the driver provides. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 17 11:38:47 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Mar 2014 09:38:47 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net> Naena Guru wrote: > Making a keyboard [layout] is not hard. You can either edit an > existing one or make one from scratch. I made the latest Romanized > Singhala one from scratch. The earlier one was an edit of US- > International. I've made a couple dozen of them myself, with MSKLC. > When you type a key on the physical keyboard, you generate what is > called a scan-code of that key so that the keyboard driver knows which > key was pressed. (During DOS days, we used to catch them to make > menus.) Now, you assign one or a sequence of Unicode characters you > want to generate for the keypress. Precisely. As Marc Durdin said, you can create a keyboard layout just as easily for Unicode characters as for ASCII and Latin-1 characters. You can also assign a combination of characters to a single key. So it is not true that "typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too." Unicode does not prescribe any key map. You can have whatever layout you like. As Marc also said, if you think there are "marks and signs" missing from Unicode, that is another matter. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From naenaguru at gmail.com Mon Mar 17 13:21:06 2014 From: naenaguru at gmail.com (Naena Guru) Date: Mon, 17 Mar 2014 13:21:06 -0500 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com> References: <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com> <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com> Message-ID: You have lot of stars, my friend. You ARE the Lion. Roar!! You have lot of friends in Sri Lanka, indeed, and one thanked you highly for the service you did for the Language too. Good for you. However, he did not know the way to the Public Library or the nearest Buddhist temple or the Christian Church that would have enlightened him on Singhala writing system, so you could help you make meaningful proposal than copying something that somebody else forwarded with a question. Did any one of them show you Rev. Fr. A. M Gunasekara's book? No! Who signed Mettavihari purportedly for the Singhala user group? On Mon, Mar 17, 2014 at 4:51 AM, Michael Everson wrote: > On 17 Mar 2014, at 02:48, Naena Guru wrote: > > > You are talking about something you do not know. I am a Singhalese. > > That doesn't give you any special knowledge or privilege. I know many > people from Sri Lanka, who work in the area of computing, and who work with > the Sinhala characters as encoded in the UCS. > > And really. "The Lion of Unicode"? > > My stars. > > Michael Everson > > > On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson > wrote: > > > > On 16 Mar 2014, at 04:12, Naena Guru wrote: > > > >> Dual-script Singhala means romanized Singhala that can be displayed > either in the Latin script or in the Singhala script using an Orthographic > Smart Font... > > > > What a terrible, terrible idea. You are essentially promoting giving up > writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack. > > > >> Dual-script Singhala is the proper and complete solution on the > computer for the Singhala script used to write Singhala, Sanskrit and Pali > languages. > > > > No, it isn't. It's a huge step backwards, unless you propose abolishing > the Sinhala script entirely and just writing in Latin. > > > >> The government ministries, media and people welcomed it with enthusiasm > and relief that there is something practical for Singhala. The response in > the country was singularly positive, except for the person that > filibustered the Q&A session of the presentation that spoke about the hard > work done on Unicode Sinhala, clearly outside the subject matter of the > presentation. > > > > That person understood the nature of data integrity. As does everyone > who cares about the Universal Character Set. > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Mon Mar 17 15:33:56 2014 From: marc at keyman.com (Marc Durdin) Date: Mon, 17 Mar 2014 20:33:56 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> I disagree. Making a basic keyboard layout is not hard, just like making a font without OpenType support is not that hard. Making a keyboard layout that doesn?t force users to learn the nuances of the encoding of a script is more of a challenge, and making a high quality keyboard layout that is consistent, easy to use, and efficient is anything but straightforward. Most keyboard layouts fail at one of these. The story for touch device input is even more challenging. Not being constrained to a physical set of keys increases your flexibility. The big challenge is usually the size of the display on mobile-sized devices. Regarding keyboard design: ? Scan make/break codes are not really relevant to Windows keyboards ? Windows has an abstraction layer of ?virtual key? codes, for better or worse. ? Selecting US-English for a non-English keyboard means that all language tools will break with your text. Spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more. That?s a big price to pay. Conversely, selecting Singhala for your Romanised non-Unicode encoding will break spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru Sent: Tuesday, 18 March 2014 3:08 AM To: Doug Ewell Cc: UnicoDe List Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Making a keyboard is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US-International. When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress. Use Microsft's keyboard layout creator for all versions of Windows from XP: http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx Select the language carefully. I selected US-English for RS. That way, I can switch between the two keyboards quickly with Ctrl+Shift. You can change all these in the Control Panel. Here is the keymap I made for RS in Linux: http://ahangama.com/apiapi/singhala/linuxkb-s.php Just scroll down for the English part. (The lines starting with double slashes are comments and have no effect on the program) The Macintosh key layout is easy too. The story with iOS and Android are different but not hard either. On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell > wrote: Jean-Fran?ois Colson wrote: The idea here was ?that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard? Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS. The standard vanilla U.S. keyboard also doesn't provide the accented letters and other non-ASCII letters like ? that Naena Guru uses for his font hack. If a character is available by a dead key, isn?t it on the keyboard ? It depends on what you mean by "on the keyboard." Thanks to John Cowan's delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to get the fraction ? (one-seventh). That character is not "on the keyboard" in any sense other than what the driver provides. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Mon Mar 17 15:36:42 2014 From: marc at keyman.com (Marc Durdin) Date: Mon, 17 Mar 2014 20:36:42 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <289A5852AC2342BABF18C045E4157A90@DougEwell> References: <289A5852AC2342BABF18C045E4157A90@DougEwell> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD141BB@federation.tavultesoft.local> In the modern PC world, the physical keyboard generates scan codes, and these are not tied to what is printed on the key cap. Dead keys and modifiers are implemented in software. But key repeat is implemented in hardware. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Tuesday, 18 March 2014 1:04 AM To: unicode at unicode.org Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Michael Everson wrote: >>> The idea here was that characters not on an ordinary QWERTY keyboard >>> could be entered _using_an_ordinary_QWERTY_keyboard._ Are there any >>> dead keys on an _ordinary_ (i.e. not one using an >>> international(ized) driver) QWERTY keyboard? >> >> Not on the standard vanilla U.S. keyboard. It has to be provided by >> the OS, via a driver, just as Compose key support has to be provided >> by the OS. > > Please distinguish between "keyboard" which is a piece of hardware and > "keyboard layout" which is a software input method. Sorry for the shorthand. Everything I am talking about is software. I don't think there is such a thing as a physical dead key on a computer keyboard. The Compose key on *nix systems may be a physical key, but it doesn't have any special ability to compose characters unless given that ability by software. "An ordinary QWERTY keyboard," as Jean-Fran?ois put it, can generate any character, Latin or Sinhala or whatever, so long as the hardware has the right software behind it. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From naenaguru at gmail.com Mon Mar 17 18:18:50 2014 From: naenaguru at gmail.com (Naena Guru) Date: Mon, 17 Mar 2014 18:18:50 -0500 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: Romanized to Unicode? Romanizing is inside Unicode. English and all Western European languages also use Unicode. Romanized Singhala resides in Latin 1 character set, that is between U+0020 to U+00FF. Unicode Singhala resides in the range U+0D80 to U+0DFF There is no difference between RS and those languages except the users live in an island far away from those others. Is there some reason you want to convert romanized Singhala to Unicode Singhala, a terrible specification that is already corrupting the language? I spoke to serious users such as journalists and teachers just few weeks back. It is unfortunate that you guys are still hanging on to it. Why? The proof-of-concept font I made has glyph substitutions. That is how it can apply an orthography. Unicode Singhala is completely a botched work. It has vowels each with two codes, one for stand-alone and one for its sign. Each consonant is considered as having the embedded (intrinsic!) vowel a. I is not a consonant, people. Then it has two ligatures included as basic consonants These do not have normalizing rules, 1 because they are NOT canonical forms as there was no precedent digital form of Singhala for backward compatibility 2 It was submitted after Unicode closed receiving applications for normalizing canonical forms. How on earth can you make a sorting method for it? When you backspace it destroys multiple keystrokes. Search and replace is not possible, at least the way do it with English. Typing is a nightmare. There are special rules for making Unicode Singhala fonts. The keyboards have keys to type pieces of letters not in the code block. As you see, this is a terrible mess and cannot be straightened, granted few people use it, and there'll be more. What other choice do they have except Anglicizing?. In Singhala, they say, "balu valigee u?a purukee ?aalaa h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ??????? ?? ??? ??????? <- I inserted all joiners, but can't guarantee if vowel signs would pop out). It means you cannot straighten dog tail even if you put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing down the language with it. On Sun, Mar 16, 2014 at 11:05 AM, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > > So, everyone, can the Romanized Singhala system be used with a QWERTY > keyboard to produce Unicode-encoded text, thereby producing a good combined > system? > > Could this be achieved if a text-processing software package were produced > that could automatically perform a character string to character string > substitution (namely Romanized Singhala character string to Unicode > character string) that would be applied before any OpenType glyph to glyph > substitution? > > The character string to character string substitution rules could be > stored in a text file, such as a UTF-16 text file saved from WordPad, that > format being what WordPad describes as a Unicode Text Document file type. > > Could this be achieved? > > If so, text entry could use an ordinary QWERTY keyboard and yet the > resulting text would be stored using the appropriate Unicode characters for > the script and the font would use Unicode mappings. > > William Overington > > 16 March 2014 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Mon Mar 17 18:36:34 2014 From: marc at keyman.com (Marc Durdin) Date: Mon, 17 Mar 2014 23:36:34 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD15AAE@federation.tavultesoft.local> Typing methods are not a factor. These are easily solved with modern input methods. You are confusing encoding systems with input methods. As I said previously, I am happy to assist you in creating a Unicode-based Romanized input method for Singhala, off list, that works exactly the way you want it to. It?s an exciting process, especially that point when you discover how much more flexibility you get when you don?t design your encoding around a particular input method! For example, you can create both a Romanized input method and a visual input method for the same encoding. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru Sent: Tuesday, 18 March 2014 10:19 AM To: William_J_G Overington Cc: Unicode List Subject: Re: Romanized Singhala got great reception in Sri Lanka When you backspace it destroys multiple keystrokes. Search and replace is not possible, at least the way do it with English. Typing is a nightmare. There are special rules for making Unicode Singhala fonts. The keyboards have keys to type pieces of letters not in the code block. -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Mon Mar 17 19:04:28 2014 From: naenaguru at gmail.com (Naena Guru) Date: Mon, 17 Mar 2014 19:04:28 -0500 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> Message-ID: Marc, Yes, making keyboard layouts is not difficult. I believed that language tools are selected for each language manually when input. I did not know that there is an automatic switching of language tools say when you switch to French keyboard from English. That shouldn't be difficult to make for RS, though. In the case of romanized Singhala, any processing that English accepts, it accepts too. For RS, you select a font to display it in the native script because if it is mixed with English, both are using the same character space, just as when English and French are mixed. Spell checking, grammar checking for Unicode Singhala? There are no such things for it there. It is in the stage of struggling to input text: special programs, physical keyboards etc. I saw them. They have a special IT category of employees to input Unicode Sinhala. They have special places called Typesetting kiosks in Lanka where you go to get your r?sum? and term paper printed. On Mon, Mar 17, 2014 at 3:33 PM, Marc Durdin wrote: > I disagree. Making a basic keyboard layout is not hard, just like > making a font without OpenType support is not that hard. Making a > keyboard layout that doesn?t force users to learn the nuances of the > encoding of a script is more of a challenge, and making a high quality > keyboard layout that is consistent, easy to use, and efficient is anything > but straightforward. Most keyboard layouts fail at one of these. > > > > The story for touch device input is even more challenging. Not being > constrained to a physical set of keys increases your flexibility. The big > challenge is usually the size of the display on mobile-sized devices. > > > > Regarding keyboard design: > > ? Scan make/break codes are not really relevant to Windows > keyboards ? Windows has an abstraction layer of ?virtual key? codes, for > better or worse. > > ? Selecting US-English for a non-English keyboard means that all > language tools will break with your text. Spell checking, grammar > checking, automatic keyboard selection, autocorrect, font selection and > more. That?s a big price to pay. Conversely, selecting Singhala for your > Romanised non-Unicode encoding will break spell checking, grammar checking, > automatic keyboard selection, autocorrect, font selection and more. > > > > Marc > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Naena > Guru > *Sent:* Tuesday, 18 March 2014 3:08 AM > *To:* Doug Ewell > *Cc:* UnicoDe List > *Subject:* Re: Dead and Compose keys (was: Re: Romanized Singhala got > great reception in Sri Lanka) > > > > Making a keyboard is not hard. You can either edit an existing one or make > one from scratch. I made the latest Romanized Singhala one from scratch. > The earlier one was an edit of US-International. > > > > When you type a key on the physical keyboard, you generate what is called > a scan-code of that key so that the keyboard driver knows which key was > pressed. (During DOS days, we used to catch them to make menus.) Now, you > assign one or a sequence of Unicode characters you want to generate for the > keypress. > > > > Use Microsft's keyboard layout creator for all versions of Windows from XP: > > http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx > > > > Select the language carefully. I selected US-English for RS. That way, I > can switch between the two keyboards quickly with Ctrl+Shift. You can > change all these in the Control Panel. > > > > Here is the keymap I made for RS in Linux: > > http://ahangama.com/apiapi/singhala/linuxkb-s.php > > Just scroll down for the English part. (The lines starting with double > slashes are comments and have no effect on the program) > > > > The Macintosh key layout is easy too. > > > > The story with iOS and Android are different but not hard either. > > > > > > On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell wrote: > > Jean-Fran?ois Colson wrote: > > The idea here was ?that characters not on an ordinary QWERTY keyboard > could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any > dead keys on an _ordinary_ (i.e. not one using an international(ized) > driver) QWERTY keyboard? > > > Not on the standard vanilla U.S. keyboard. It has to be provided by the > OS, via a driver, just as Compose key support has to be provided by the OS. > > The standard vanilla U.S. keyboard also doesn't provide the accented > letters and other non-ASCII letters like ? that Naena Guru uses for his > font hack. > > If a character is available by a dead key, isn?t it on the keyboard ? > > > It depends on what you mean by "on the keyboard." Thanks to John Cowan's > delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to > get the fraction ? (one-seventh). That character is not "on the keyboard" > in any sense other than what the driver provides. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Mon Mar 17 19:21:00 2014 From: naenaguru at gmail.com (Naena Guru) Date: Mon, 17 Mar 2014 19:21:00 -0500 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net> References: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net> Message-ID: Doug, Making keyboard layouts for Unicode Singhala is hard not because of fault of Unicode. It is the complexity of letter assembly. I have use the Wijesekara keyboard on a 24in Olympia Singhala keyboard in 1970s. It is radically different from US-English. I tried to make a phonetic one to kind of relate to the English keys. Still, you need to have many shifted keys to get common letters. On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell wrote: > Naena Guru wrote: > > > Making a keyboard [layout] is not hard. You can either edit an > > existing one or make one from scratch. I made the latest Romanized > > Singhala one from scratch. The earlier one was an edit of US- > > International. > > I've made a couple dozen of them myself, with MSKLC. > > > When you type a key on the physical keyboard, you generate what is > > called a scan-code of that key so that the keyboard driver knows which > > key was pressed. (During DOS days, we used to catch them to make > > menus.) Now, you assign one or a sequence of Unicode characters you > > want to generate for the keypress. > > Precisely. As Marc Durdin said, you can create a keyboard layout just as > easily for Unicode characters as for ASCII and Latin-1 characters. You > can also assign a combination of characters to a single key. > > So it is not true that "typing Unicode Sinhala requires you to learn a > key map that is entirely different from the familiar English keyboard, > while losing some marks and signs too." Unicode does not prescribe any > key map. You can have whatever layout you like. > > As Marc also said, if you think there are "marks and signs" missing from > Unicode, that is another matter. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Mon Mar 17 19:45:55 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Tue, 18 Mar 2014 11:45:55 +1100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net> Message-ID: On 18/03/2014 11:23 AM, "Naena Guru" wrote: > > > I tried to make a phonetic one to kind of relate to the English keys. Still, you need to have many shifted keys to get common letters. > No you don't, you just need to understand the possibilities of what your input framework is capable of and the best way to implement what you want to achieve. The windows input system is probably the most contrained, but to look at a good phonetic layout have a look at the Cherokee Phonetic layout on Windows 8+ Designing a god layout requires using the right tools, knowing the limits and capabilities of those tools, and using them in creative ways. > > On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell wrote: >> >> Naena Guru wrote: >> >> > Making a keyboard [layout] is not hard. You can either edit an >> > existing one or make one from scratch. I made the latest Romanized >> > Singhala one from scratch. The earlier one was an edit of US- >> > International. >> >> I've made a couple dozen of them myself, with MSKLC. >> >> > When you type a key on the physical keyboard, you generate what is >> > called a scan-code of that key so that the keyboard driver knows which >> > key was pressed. (During DOS days, we used to catch them to make >> > menus.) Now, you assign one or a sequence of Unicode characters you >> > want to generate for the keypress. >> >> Precisely. As Marc Durdin said, you can create a keyboard layout just as >> easily for Unicode characters as for ASCII and Latin-1 characters. You >> can also assign a combination of characters to a single key. >> >> So it is not true that "typing Unicode Sinhala requires you to learn a >> key map that is entirely different from the familiar English keyboard, >> while losing some marks and signs too." Unicode does not prescribe any >> key map. You can have whatever layout you like. >> >> As Marc also said, if you think there are "marks and signs" missing from >> Unicode, that is another matter. >> >> -- >> Doug Ewell | Thornton, CO, USA >> http://ewellic.org | @DougEwell >> > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 17 19:56:25 2014 From: doug at ewellic.org (Doug Ewell) Date: Mon, 17 Mar 2014 18:56:25 -0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> Message-ID: <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Naena Guru wrote: > In the case of romanized Singhala, any processing that English > accepts, it accepts too. For RS, you select a font to display it in > the native script because if it is mixed with English, both are using > the same character space, just as when English and French are mixed. But English and French actually *use* the same letters, or at any rate most of them. With your approach, it is not possible to write Sinhala in the Sinhala script mixed with English or French or anything else in the Latin script. In web pages you can resort to tricks, but this doesn't work for plain text. This is what people mean when they suggest that your real goal is to abolish the Sinhala script and just write in Latin. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From ken.whistler at sap.com Mon Mar 17 20:36:59 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 18 Mar 2014 01:36:59 +0000 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: Well, I actually don?t see. I took a look at the Sinhala you inserted in this email. I cannot tell what you did at your input end (about ?inserted all joiners?), but there are no actual joiners in the text itself. It displayed just fine in my email (including the correct conditional formatting of the ?u vowel applied to the ra in purukee), without me doing anything special (or installing any hacked font). Why? Because it was transmitted in plain Unicode. I cut and pasted that Unicode Sinhala string into a Word document, and it worked just fine. The boundaries for all the syllables were correctly detected. I saved it as a plain text UTF-8 file, and it worked just fine. I even then read the plain text UTF-8 file into a UTF-8 aware programming editor, and it worked just fine. (In a programming editor, which doesn?t attempt complex script rendering, the vowels don?t apply to the consonants and no reordering is done, so the display isn?t correct, but each character is correctly preserved, and if I write it back out to a document and read it in Word or some other tool that has access to proper rendering, it is still fine.) And all that interoperability works, why? Because this is plain Unicode. So while I don?t doubt that people may be having serious issues with input methods for Sinhala, I tend to agree with Marc Durdin that you are confusing encoding with input methods. Yes, I know you know the difference, but it appears to me that the inescapable conclusion from your argumentation is that the highest priority for the design of an encoding system should be to make the design of input methods as simple as possible. And in my estimation, that is confusing encoding with input methods. The art of input methods is to hide encoding details from users, and instead to provide them with an abstraction that they find easy to use and which accords with their general understanding of the writing system they are using. If done correctly, then the details of the input method *also* recede into the background, and users then simply do what they want: write and edit text easily on their devices. --Ken P.S. Here is an octal dump of that text (after I inserted a closing parenthesis in the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8, no fancy stuff, and works just fine. 0000000000 EF BB BF 62 61 6C 75 20 76 61 6C 69 67 65 65 C2 0000000020 A0 75 C2 B5 61 20 70 75 72 75 6B 65 65 C2 A0 C3 0000000040 B0 61 61 6C 61 61 20 68 C3 A6 C3 B0 75 76 61 C3 0000000060 BE 20 6E C3 A6 C3 A6 20 C3 A6 C3 B0 65 65 20 C3 0000000100 A6 72 65 6E 6E 65 65 0D 0A 28 E0 B6 B6 E0 B6 BD 0000000120 E0 B7 94 20 E0 B7 80 E0 B6 BD E0 B7 92 E0 B6 9C 0000000140 E0 B7 9A 20 E0 B6 8B E0 B6 AB 20 E0 B6 B4 E0 B7 0000000160 94 E0 B6 BB E0 B7 94 E0 B6 9A E0 B7 9A 20 E0 B6 0000000200 AF E0 B7 8F E0 B6 BD E0 B7 8F 20 E0 B7 84 E0 B7 0000000220 90 E0 B6 AF E0 B7 94 E0 B7 80 E0 B6 AD E0 B7 8A 0000000240 20 E0 B6 B1 E0 B7 91 20 E0 B6 87 E0 B6 AF E0 B7 0000000260 9A 20 E0 B6 87 E0 B6 BB E0 B7 99 E0 B6 B1 E0 B7 0000000300 8A E0 B6 B1 E0 B7 9A 29 0D 0A 0D 0A As you see, this is a terrible mess and cannot be straightened, granted few people use it, and there'll be more. What other choice do they have except Anglicizing?. In Singhala, they say, "balu valigee u?a purukee ?aalaa h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ??????? ?? ??? ??????? <- I inserted all joiners, but can't guarantee if vowel signs would pop out). It means you cannot straighten dog tail even if you put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing down the language with it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Tue Mar 18 00:23:31 2014 From: naenaguru at gmail.com (Naena Guru) Date: Tue, 18 Mar 2014 00:23:31 -0500 Subject: Romanized Singhala got great reception in Sri Lanka In-Reply-To: References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: Thank you, Ken. You very nicely analyzed it. Why I said that the signs might pop out because I have had complaints that happening. I think this is because implementation of proper rendering is behind in some systems. On input, I tried to make a layout that is close to QWERTY. But failed because of the need for too many combination keys. Keyman uses the old typewriter keyboard Wijesekara. I saw a better one on the front page for Singhala but did not find it further inside. Marc would know, of course. Anyway my complaint is that Unicode Singhala is incomplete and wrong and that it has a deleterious effect on the language, one of the oldest in the world. What's aggravating is that they institutionalize errors as correct. Rev. Fr. Perera warned against this 80 years ago. I suppose I wouldn't have much to say if the 58 phonemes are used to replace the ones there. It will not happen. On Mon, Mar 17, 2014 at 8:36 PM, Whistler, Ken wrote: > Well, I actually don?t see. I took a look at the Sinhala you inserted in > this > > email. I cannot tell what you did at your input end (about ?inserted all > joiners?), > > but there are no actual joiners in the text itself. It displayed just fine > > in my email (including the correct conditional formatting of the ?u vowel > > applied to the ra in pu*ru*kee), without me doing anything special (or > installing > > any hacked font). Why? Because it was transmitted in plain Unicode. > > > > I cut and pasted that Unicode Sinhala string into a Word document, and > > it worked just fine. The boundaries for all the syllables were correctly > > detected. > > > > I saved it as a plain text UTF-8 file, and it worked just > > fine. I even then read the plain text UTF-8 file into a UTF-8 aware > > programming editor, and it worked just fine. (In a programming editor, > > which doesn?t attempt complex script rendering, > > the vowels don?t apply to the consonants and no reordering is done, so > > the display isn?t correct, but each character is correctly preserved, and > > if I write it back out to a document and read it in Word or some other > > tool that has access to proper rendering, it is still fine.) And all that > > interoperability works, why? Because this is plain Unicode. > > > > So while I don?t doubt that people may be having serious issues with > > input methods for Sinhala, I tend to agree with Marc Durdin that you are > confusing > > encoding with input methods. Yes, I know you know the difference, > > but it appears to me that the inescapable conclusion from your > > argumentation is that the highest priority for the design of an > > encoding system should be to make the design of input methods > > as simple as possible. And in my estimation, that is confusing encoding > > with input methods. > > > > The art of input methods is to hide encoding details from users, and > > instead to provide them with an abstraction that they find easy to > > use and which accords with their general understanding of the writing > > system they are using. If done correctly, then the details of the input > > method *also* recede into the background, and users then simply > > do what they want: write and edit text easily on their devices. > > > > --Ken > > > > P.S. Here is an octal dump of that text (after I inserted a closing > parenthesis in > > the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8, > > no fancy stuff, and works just fine. > > > > 0000000000 EF BB BF 62 61 6C 75 20 76 61 6C 69 67 65 65 > C2 > > 0000000020 A0 75 C2 B5 61 20 70 75 72 75 6B 65 65 C2 A0 > C3 > > 0000000040 B0 61 61 6C 61 61 20 68 C3 A6 C3 B0 75 76 61 > C3 > > 0000000060 BE 20 6E C3 A6 C3 A6 20 C3 A6 C3 B0 65 65 20 > C3 > > 0000000100 A6 72 65 6E 6E 65 65 0D 0A 28 E0 B6 B6 E0 B6 > BD > > 0000000120 E0 B7 94 20 E0 B7 80 E0 B6 BD E0 B7 92 E0 B6 > 9C > > 0000000140 E0 B7 9A 20 E0 B6 8B E0 B6 AB 20 E0 B6 B4 E0 > B7 > > 0000000160 94 E0 B6 BB E0 B7 94 E0 B6 9A E0 B7 9A 20 E0 > B6 > > 0000000200 AF E0 B7 8F E0 B6 BD E0 B7 8F 20 E0 B7 84 E0 > B7 > > 0000000220 90 E0 B6 AF E0 B7 94 E0 B7 80 E0 B6 AD E0 B7 > 8A > > 0000000240 20 E0 B6 B1 E0 B7 91 20 E0 B6 87 E0 B6 AF E0 > B7 > > 0000000260 9A 20 E0 B6 87 E0 B6 BB E0 B7 99 E0 B6 B1 E0 > B7 > > 0000000300 8A E0 B6 B1 E0 B7 9A 29 0D 0A 0D 0A > > > > As you see, this is a terrible mess and cannot be straightened, granted > few people use it, and there'll be more. What other choice do they have > except Anglicizing?. In Singhala, they say, "balu valigee u?a > purukee ?aalaa h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ??????? > ?? ??? ??????? <- I inserted all joiners, but can't guarantee if vowel > signs would pop out). It means you cannot straighten dog tail even if you > put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is > bringing down the language with it. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From naenaguru at gmail.com Tue Mar 18 01:01:02 2014 From: naenaguru at gmail.com (Naena Guru) Date: Tue, 18 Mar 2014 01:01:02 -0500 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <97765363E97E4C2E9CE7F318327A11D1@DougEwell> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: Okay, Doug. Type this inside the yellow text box in the following page: kaaryyaalavala yan?ra pa?k?i http://www.lovatasinhala.com/puvaruva.php Please tell me what sequence of Unicode Sinhala codes would produce what the text box shows. On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell wrote: > Naena Guru wrote: > > In the case of romanized Singhala, any processing that English >> accepts, it accepts too. For RS, you select a font to display it in >> the native script because if it is mixed with English, both are using >> the same character space, just as when English and French are mixed. >> > > But English and French actually *use* the same letters, or at any rate > most of them. With your approach, it is not possible to write Sinhala in > the Sinhala script mixed with English or French or anything else in the > Latin script. In web pages you can resort to tricks, > but this doesn't work for plain text. > > This is what people mean when they suggest that your real goal is to > abolish the Sinhala script and just write in Latin. > > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Tue Mar 18 03:20:40 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 18 Mar 2014 14:20:40 +0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: MSKLC and KeyMan are fairly crude ways of creating input methods For what you want to - you probably need a memory resident program that traps the Latin input from the keyboard, processes the (transliterated) input strings converting them into unicode Sinhala strings, and then injects these back into the input queue in place of the Latin characters. There are a couple of utilities that do this for typing transliterated/romanised Tibetan in Windows and getting Tibetan Unicode output. http://tise.mokhin.org/ http://www.thubtenrigzin.fr/denjongtibtype/en.html But I think both of these were written in C as they have to do a lot of processing which is far beyond what can be accomplished with MSKLC and even KeyMan - C From lang.support at gmail.com Tue Mar 18 04:26:30 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Tue, 18 Mar 2014 20:26:30 +1100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: Chris, Keyman is capable of doing that and a lot more, but few keyboard layout developers use it to its full potential. As an example, I was asked by Harari teachers here in Melbourne to develop a set of three keyboard layouts for them and their students. The three keyboards were for three different orthographies in the following scripts: 1) Latin 2) Ethiopic 3) Arabic They wanted all three layouts to work identically, using the keystrokes used on the Latin keyboard. The Ethiopic and Arabic keyboard layouts required extensive remapping of key sequences to output. If I was a programmer I could have done something more elegant by building an external library Keyman could call but as it is we could do a lot inside the Keyman keyboard layout itself. For Myanmar script keyboard layouts we allow visual input for the e-vowel sign and medial Ra, with the layout handling reordering. One of the Latin layouts I use, supports combining diacritics and reorders sequences of diacritics to their canonical order regardless of order of input. Assuming a maximum of one diacritic below and two diacrtics above base character. Analysis and creativity can produce some very effective Keyman layouts. Andrew On 18/03/2014 7:23 PM, "Christopher Fynn" wrote: > MSKLC and KeyMan are fairly crude ways of creating input methods > > For what you want to - you probably need a memory resident program > that traps the Latin input from the keyboard, processes the > (transliterated) input strings converting them into unicode Sinhala > strings, and then injects these back into the input queue in place of > the Latin characters. > > There are a couple of utilities that do this for typing > transliterated/romanised Tibetan in Windows and getting Tibetan > Unicode output. > http://tise.mokhin.org/ > http://www.thubtenrigzin.fr/denjongtibtype/en.html > > But I think both of these were written in C as they have to do a lot > of processing which is far beyond what can be accomplished with MSKLC > and even KeyMan > > - C > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Tue Mar 18 08:35:48 2014 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Tue, 18 Mar 2014 14:35:48 +0100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: <53284BB4.60707@colson.eu> Le 18/03/14 07:01, Naena Guru a ?crit : > Okay, Doug. > > Type this inside the yellow text box in the following page: > kaaryyaalavala yan?ra pa?k?i > http://www.lovatasinhala.com/puvaruva.php > > Please tell me what sequence of Unicode Sinhala codes would produce > what the text box shows. > OK. I'd first say I don't speak Sinhala and I've never written a word in that language... until now. Therefore there might be mistakes and I didn't find how to write the second syllable, ryyaa. I've replaced it by ***** below. Here is my attempt: ??*****??? ?????? ????? Could an aware person tell how to type the syllable ryyaa? > > > On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell > wrote: > > Naena Guru wrote: > > In the case of romanized Singhala, any processing that English > accepts, it accepts too. For RS, you select a font to display > it in > the native script because if it is mixed with English, both > are using > the same character space, just as when English and French are > mixed. > > > But English and French actually *use* the same letters, or at any > rate most of them. With your approach, it is not possible to write > Sinhala in the Sinhala script mixed with English or French or > anything else in the Latin script. In web pages you can resort to > tricks, but this doesn't work for plain text. > > This is what people mean when they suggest that your real goal is > to abolish the Sinhala script and just write in Latin. > > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Tue Mar 18 09:46:47 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 18 Mar 2014 20:46:47 +0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: Hi Andrew It may be possible with Keyman. I once even wrote a set of MS Word macros that did the same thing (let users type in Romanized Tibetan and output Tibetan characters) - however it stopped working when Microsoft switched from Word Basic to VBA. :-( At least Keyman hides all the messy (and poorly documented) details of Windows system hooks which is what you have to use if you want to make a stand-alone utility (did that once too). If Keyman can call external libraries ~ that's interesting. It is certainly *far* more sophisticated and flexible than MSKLC and I shouldn't have lumped the two together. - Chris On 18/03/2014, Andrew Cunningham wrote: > Chris, > > Keyman is capable of doing that and a lot more, but few keyboard layout > developers use it to its full potential. > > As an example, I was asked by Harari teachers here in Melbourne to develop > a set of three keyboard layouts for them and their students. > The three keyboards were for three different orthographies in the following > scripts: > 1) Latin > 2) Ethiopic > 3) Arabic > > They wanted all three layouts to work identically, using the keystrokes > used on the Latin keyboard. > > The Ethiopic and Arabic keyboard layouts required extensive remapping of > key sequences to output. > > If I was a programmer I could have done something more elegant by building > an external library Keyman could call but as it is we could do a lot inside > the Keyman keyboard layout itself. > > For Myanmar script keyboard layouts we allow visual input for the e-vowel > sign and medial Ra, with the layout handling reordering. > > One of the Latin layouts I use, supports combining diacritics and reorders > sequences of diacritics to their canonical order regardless of order of > input. Assuming a maximum of one diacritic below and two diacrtics above > base character. > Analysis and creativity can produce some very effective Keyman layouts. > > Andrew > On 18/03/2014 7:23 PM, "Christopher Fynn" wrote: > >> MSKLC and KeyMan are fairly crude ways of creating input methods >> For what you want to - you probably need a memory resident program >> that traps the Latin input from the keyboard, processes the >> (transliterated) input strings converting them into unicode Sinhala >> strings, and then injects these back into the input queue in place of >> the Latin characters. >> There are a couple of utilities that do this for typing >> transliterated/romanised Tibetan in Windows and getting Tibetan >> Unicode output. >> http://tise.mokhin.org/ >> http://www.thubtenrigzin.fr/denjongtibtype/en.html >> But I think both of these were written in C as they have to do a lot >> of processing which is far beyond what can be accomplished with MSKLC >> and even KeyMan >> - C From chris.fynn at gmail.com Tue Mar 18 10:04:35 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Tue, 18 Mar 2014 21:04:35 +0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: On 18/03/2014, Naena Guru wrote: > Okay, Doug. > Type this inside the yellow text box in the following page: > kaaryyaalavala yan?ra pa?k?i > http://www.lovatasinhala.com/puvaruva.php > Please tell me what sequence of Unicode Sinhala codes would produce what > the text box shows. Naena Guru If you want you should just be able to type it in as you wrote it "kaaryyaalavala yan?ra pa?k?i" and get Singhala Unicode characters. But to do this you do need something more than a re-mapped keyboard layout made with MSKLC So long as the Roman transliteration system you are using for Singhala and Pali follows consistent rules, it is possible to write an input method that parses the Romanized text and converts it into Singhala Unicode. If you care about your language and script, that is the proper way to do this sort of thing - not by using OpenType lookups to map strings of latin characters to Singhala glyphs. Chris Fynn Thimphu, Bhutan From jf at colson.eu Tue Mar 18 10:36:44 2014 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Tue, 18 Mar 2014 16:36:44 +0100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <53284BB4.60707@colson.eu> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53284BB4.60707@colson.eu> Message-ID: <5328680C.2000909@colson.eu> Le 18/03/14 14:35, Jean-Fran?ois Colson a ?crit : > Le 18/03/14 07:01, Naena Guru a ?crit : >> Okay, Doug. >> >> Type this inside the yellow text box in the following page: >> kaaryyaalavala yan?ra pa?k?i >> http://www.lovatasinhala.com/puvaruva.php >> >> Please tell me what sequence of Unicode Sinhala codes would produce >> what the text box shows. >> > > OK. I'd first say I don't speak Sinhala and I've never written a word > in that language... until now. Therefore there might be mistakes and I > didn't find how to write the second syllable, ryyaa. I've replaced it > by ***** below. > Here is my attempt: > ??*****??? ?????? ????? > > Could an aware person tell how to type the syllable ryyaa? Perhaps a good tutorial on the use of ZWJ/ZWNJ in Sinhala could do the job. > > >> >> >> On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell > > wrote: >> >> Naena Guru wrote: >> >> In the case of romanized Singhala, any processing that English >> accepts, it accepts too. For RS, you select a font to display >> it in >> the native script because if it is mixed with English, both >> are using >> the same character space, just as when English and French are >> mixed. >> >> >> But English and French actually *use* the same letters, or at any >> rate most of them. With your approach, it is not possible to >> write Sinhala in the Sinhala script mixed with English or French >> or anything else in the Latin script. In web pages you can resort >> to tricks, but this doesn't work for plain text. >> >> This is what people mean when they suggest that your real goal is >> to abolish the Sinhala script and just write in Latin. >> >> >> -- >> Doug Ewell | Thornton, CO, USA >> http://ewellic.org | @DougEwell ? >> >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 18 11:29:03 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 09:29:03 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140318092903.665a7a7059d7ee80bb4d670165c8327d.12c06e3b98.wbe@email03.secureserver.net> Jean-Fran?ois Colson wrote: >>> Type this inside the yellow text box in the following page: >>> kaaryyaalavala yan?ra pa?k?i >>> http://www.lovatasinhala.com/puvaruva.php >>> >>> Please tell me what sequence of Unicode Sinhala codes would produce >>> what the text box shows. >> >> Could an aware person tell how to type the syllable ryyaa? > > Perhaps a good tutorial on the use of ZWJ/ZWNJ in Sinhala could do the > job. The RS-to-Unicode conversion tool on Naena's own site gives ????????????? ??????? ???????, but this doesn't exactly match the text in the yellow box visually. I suppose some combination of ZWJ/ZWNJ usage and font differences accounts for this. In any case, I wouldn't expect his RS-to-Unicode converter to work with 100% fidelity, given that he is trying to make a case that Unicode is inadequate to represent Sinhala text. Naena knows from previous threads that I don't speak or write Sinhala. I hope his intent in asking me to provide a Unicode transcoding is not to call attention to this, and to try to demonstrate thereby that he knows more about character encoding than I do. I can spend a little more time on this when I'm in front of my Windows 8.1 machine, which has better support for Sinhala than Windows 7. But I would think a better argument on Naena's part would be to show *us* what parts of "kaaryyaalavala yan?ra pa?k?i" can be adequately represented in his scheme but not in Unicode. And by "represented," I don't just mean "displayed on any arbitrary system." Display problems can be and tend to be fixed over time, and are not all there is to character encoding anyway. I did also notice that Naena carefully avoided responding to my point that his approach prevents the simultaneous plain-text display of en-Latn and si-Sinh, something he does often on his web pages. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From doug at ewellic.org Tue Mar 18 12:27:21 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 10:27:21 -0700 Subject: Details, please (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net> I think what some of us would like to see are detailed examples, citing specific characters and combinations, rather than general rhetoric, to support claims like this: "Anyway my complaint is that Unicode Singhala is incomplete and wrong and that it has a deleterious effect on the language, one of the oldest in the world. What's aggravating is that they institutionalize errors as correct. Rev. Fr. Perera warned against this 80 years ago. I suppose I wouldn't have much to say if the 58 phonemes are used to replace the ones there. It will not happen." and these, from the web site: "[Romanized Singhala] is stable as it is safe from rules imposed by Unicode Consortium based on misinformation, and careless mangling of the language by disinterested bureaucrats." "Unicode Sinhala is a failure and cannot be fixed. That is because the premise on which it was designed is flawed." "Abugida is a writing system relegated to the sideline, as inherently incapable of a smooth interface with the computer. This is what Unicode Sinhala suffers from." -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From mike at mikemorr.com Tue Mar 18 12:46:37 2014 From: mike at mikemorr.com (Mike Morrison) Date: Tue, 18 Mar 2014 13:46:37 -0400 Subject: "Heron" element in Unicode? Message-ID: Hello, Does the element which is the right half of ?, and the left half of ?, ?, ?, exist anywhere in the current or proposed Unicode standards? It's a simplification of ?, and similar to but not the same as ?. If not currently in Unicode, is it the sort of thing that might be considered for addition in the future? Thanks, Mike Morrison From andrewcwest at gmail.com Tue Mar 18 13:49:14 2014 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 18 Mar 2014 18:49:14 +0000 Subject: "Heron" element in Unicode? In-Reply-To: References: Message-ID: On 18 March 2014 17:46, Mike Morrison wrote: > > Does the element which is the right half of ?, and the left half of ?, > ?, ?, exist anywhere in the current or proposed Unicode standards? No. > It's a simplification of ?, and similar to but not the same as ?. If > not currently in Unicode, is it the sort of thing that might be > considered for addition in the future? People have been talking for many years about encoding the relatively few CJK components that do not exist as characters in there own right, and I think that there would be some support from the relevant committees if a well-presented proposal was submitted. Andrew From tom at bluesky.org Tue Mar 18 14:07:07 2014 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 12:07:07 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: On Mar 17, 2014, at 11:01 PM, Naena Guru wrote: > > Type this inside the yellow text box in the following page: > kaaryyaalavala yan?ra pa?k?i > http://www.lovatasinhala.com/puvaruva.php > > Please tell me what sequence of Unicode Sinhala codes would produce what the text box shows. Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed karYYalvl ynhR p;kni and I got ????????????? ?????? ????? Graphic: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2014-03-18 at 11.59.13 AM.png Type: image/png Size: 8085 bytes Desc: not available URL: From jf at colson.eu Tue Mar 18 14:10:12 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Tue, 18 Mar 2014 20:10:12 +0100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> Message-ID: <53289A14.9050108@colson.eu> Le 18/03/14 20:07, Tom Gewecke a ?crit : > > On Mar 17, 2014, at 11:01 PM, Naena Guru wrote: > >> >> Type this inside the yellow text box in the following page: >> kaaryyaalavala yan?ra pa?k?i >> http://www.lovatasinhala.com/puvaruva.php >> >> Please tell me what sequence of Unicode Sinhala codes would produce >> what the text box shows. > > Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala > fonts, I typed > > karYYalvl ynhR p;kni Shouldn?t it be p;khi? > > and I got > > ????????????? ?????? ????? > > Graphic: > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8085 bytes Desc: not available URL: From tom at bluesky.org Tue Mar 18 14:24:02 2014 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 12:24:02 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <53289A14.9050108@colson.eu> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> Message-ID: On Mar 18, 2014, at 12:10 PM, Jean-Fran?ois Colson wrote: >> >> Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed >> >> karYYalvl ynhR p;kni > > Shouldn?t it be p;khi? > Yes, sorry. karYYalvl ynhR p;khi ????????????? ?????? ????? Graphic -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2014-03-18 at 12.21.06 PM.png Type: image/png Size: 8338 bytes Desc: not available URL: From doug at ewellic.org Tue Mar 18 14:37:20 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 12:37:20 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have found it: ????????????? ?????? ????? The sequence of code points would thus be: 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2 Naena, is this what you were looking for? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From lang.support at gmail.com Tue Mar 18 14:52:34 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Wed, 19 Mar 2014 06:52:34 +1100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> Message-ID: I suspect it was a fishing expedition to illustrate how awkward it is to type on Unicode keyboard layouts versus his system. Ie still no clear separation of input and encoding in his responses. On 19/03/2014 6:39 AM, "Doug Ewell" wrote: > Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have > found it: > > ????????????? ?????? > ????? > > The sequence of code points would thus be: > > 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020 > 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2 > > Naena, is this what you were looking for? > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Tue Mar 18 15:48:21 2014 From: marc at keyman.com (Marc Durdin) Date: Tue, 18 Mar 2014 20:48:21 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> I know this is adding fuel to the fire but I?m sure that I am not the only one to note that the way the text is rendered in Tom?s graphic differs from the way the text is rendered with Iskoola Pota font in Win 7 and Nirmala UI font in Win 8.1. I have not analysed the difference, nor can I state with certainty which is more accurate, but this is clearly an inconsistency that will plague end users. Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering? See graphic below. [cid:image002.png at 01CF4346.93823730] ????????????? ?????? ????? Unicode text Unicode Values: U+0D9A U+0DCF U+0DBB U+0DCA U+0DBA U+0DCA U+0DBA U+0DCF U+0DBD U+0DC0 U+0DBD U+0020 U+0DBA U+0DB1 U+0DC4 U+0DCA U+0DBB U+0020 U+0DB4 U+0DA9 U+0D9A U+0DC4 U+0DD2 FWIW, iOS 7.1 renders that string identically to Mac OS X. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Tom Gewecke Sent: Wednesday, 19 March 2014 6:24 AM To: Jean-Fran?ois Colson Cc: Naena Guru; UnicoDe List Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) On Mar 18, 2014, at 12:10 PM, Jean-Fran?ois Colson wrote: Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed karYYalvl ynhR p;kni Shouldn?t it be p;khi? Yes, sorry. karYYalvl ynhR p;khi ????????????? ?????? ????? Graphic [cid:image001.png at 01CF4345.55B7FF80] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8338 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 15888 bytes Desc: image002.png URL: From michel at suignard.com Tue Mar 18 15:56:13 2014 From: michel at suignard.com (Michel Suignard) Date: Tue, 18 Mar 2014 20:56:13 +0000 Subject: "Heron" element in Unicode? In-Reply-To: References: Message-ID: <706c3c73189a408d9a03eacb36cf6bc2@CO1PR02MB157.namprd02.prod.outlook.com> >> It's a simplification of ?, and similar to but not the same as ?. If >> not currently in Unicode, is it the sort of thing that might be >> considered for addition in the future? >People have been talking for many years about encoding the relatively few CJK components that do not exist as characters in there own right, and I think that there would be some support from the relevant committees if a well-presented proposal was submitted. Isn't that component even used in 10646 Annex S.1.4.3 (9th pair)? It is one of my surprise that 10646 can't even be fully textually documented using 10646 character elements (we use many pictures in Annex S). It has been one of my goals to get all Annex S 'components' to be fully encoded, but I'd need time to create the glyphs, unless someone wants to volunteer. Michel From tom at bluesky.org Tue Mar 18 15:57:11 2014 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 13:57:11 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> Message-ID: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote: > > Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering? See graphic below. > > The OS X version is the most correct according my limited knowledge of the script. I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one. This can be fixed by using the Bhashitha font. Also it's possible that some of the characters should be "touching". I did not add the code for that and don't think any of my current fonts support it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 18 16:19:29 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 14:19:29 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140318141929.665a7a7059d7ee80bb4d670165c8327d.f5c5f67bc7.wbe@email03.secureserver.net> The attached image (also at http://ewellic.org/images/sinhala-babelpad.jpg) shows how I see Tom's corrected string on Windows 7 running BabelPad, in both Iskoola Pota and Nirmala UI. Different rendering based on different operating systems, versions, and applications is unfortunate, but no more so than a solution which only works within a web browser, and not at all on IE8. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell -------------- next part -------------- A non-text attachment was scrubbed... Name: sinhala-babelpad.jpg Type: image/jpeg Size: 25452 bytes Desc: not available URL: From jf at colson.eu Tue Mar 18 16:28:09 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Tue, 18 Mar 2014 22:28:09 +0100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> Message-ID: <5328BA69.9050502@colson.eu> Le 18/03/14 20:37, Doug Ewell a ?crit : > Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have > found it: > > ????????????? ?????? > ????? > > The sequence of code points would thus be: > > 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020 > 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2 > > Naena, is this what you were looking for? It seems there?s still a big difference in the second syllable. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From marc at keyman.com Tue Mar 18 16:44:43 2014 From: marc at keyman.com (Marc Durdin) Date: Tue, 18 Mar 2014 21:44:43 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local> I've done some more analysis now that I've arrived at my office (and if I'd read Doug's email earlier, would have been able to see this too). The email I received, on my notebook running Outlook 2010, has had U+200D stripped out from that Sinhala text - hence the rendering difference. Now it gets weirder. I also have Outlook 2010 running on my office computer, also attached to the same Exchange server: so I have two copies of the same email, which in a sane world would be byte-for-byte identical. On my office computer, U+200D has not been stripped out, and the text renders in the same way as OS X. I am struggling to understand why two copies of the same email have ended up with different content - given the clients are both running the same version of Windows and the same version of Outlook (even down to the same updates and security patches), connected to the same Exchange server. MS Office Language Preferences do not list Sinhala in either case. The same email on my iPhone has correct content, as does the webmail version. Anyone got any ideas? Marc From: Tom Gewecke [mailto:tom at bluesky.org] Sent: Wednesday, 19 March 2014 7:57 AM To: Marc Durdin Cc: Jean-Fran?ois Colson; Naena Guru; UnicoDe List Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote: Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering? See graphic below. The OS X version is the most correct according my limited knowledge of the script. I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one. This can be fixed by using the Bhashitha font. Also it's possible that some of the characters should be "touching". I did not add the code for that and don't think any of my current fonts support it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 18 16:50:42 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 14:50:42 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) Message-ID: <20140318145042.665a7a7059d7ee80bb4d670165c8327d.c4a6e2336e.wbe@email03.secureserver.net> Jean-Fran?ois Colson wrote: > It seems there?s still a big difference in the second syllable. Naena's original text "kaaryyaalavala" seems to imply the second syllable begins with "r" followed by "ya". Is the "r" supposed to form a conjunct with the second "ya", as his font shows, rather than the first? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From tom at bluesky.org Tue Mar 18 16:53:35 2014 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 14:53:35 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <5328BA69.9050502@colson.eu> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> <5328BA69.9050502@colson.eu> Message-ID: <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org> On Mar 18, 2014, at 2:28 PM, Jean-Fran?ois Colson wrote: >> >> The sequence of code points would thus be: >> >> 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020 >> 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2 >> >> Naena, is this what you were looking for? > > It seems there?s still a big difference in the second syllable. Are you referring to "ryy" (0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA)? That is the correct encoding I think. But most fonts don't display it quite right. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen Shot 2014-03-18 at 2.51.11 PM.png Type: image/png Size: 26998 bytes Desc: not available URL: From asmusf at ix.netcom.com Tue Mar 18 16:48:46 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 18 Mar 2014 14:48:46 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> Message-ID: <5328BF3E.9050100@ix.netcom.com> On 3/18/2014 1:57 PM, Tom Gewecke wrote: > > On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote: > >> Can anyone who is more knowledgeable in Unicode Sinhala tell me which >> is the correct rendering? See graphic below. >> > > The OS X version is the most correct according my limited knowledge of > the script. I think the Apple font does not place the diacritic over > the second character correctly, it should be over the next one. This > can be fixed by using the Bhashitha font. I get the "OS X" appearance (or one that matches the "graphics" on Win7 + viewin in Thunderbird). > > Also it's possible that some of the characters should be "touching". > I did not add the code for that and don't think any of my current > fonts support it. > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tom at bluesky.org Tue Mar 18 16:56:51 2014 From: Tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 14:56:51 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <5328BA69.9050502@colson.eu> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> <5328BA69.9050502@colson.eu> Message-ID: <326E464D-AA06-4BCC-A72F-1215BE7EA3CF@bluesky.org> PS A good source for info on the Sinhala codes, etc is https://www.microsoft.com/typography/OpenTypeDev/sinhala/intro.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc at keyman.com Tue Mar 18 17:02:10 2014 From: marc at keyman.com (Marc Durdin) Date: Tue, 18 Mar 2014 22:02:10 +0000 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local> <97765363E97E4C2E9CE7F318327A11D1@DougEwell> <53289A14.9050108@colson.eu> <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local> <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org> <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F345@federation.tavultesoft.local> And I understand the issue now. My notebook did not have any ?complex script? languages installed as ?Editing Languages? in MS Office Language Preferences. Thus, it stripped out U+200D when presenting the Sinhala text. My office computer had Arabic installed as an ?Editing Language,? and so the content rendered correctly. The thing that threw me is that this is not even a rendering-level issue but a content-level issue ? the content is corrupted before it ever gets to the renderer. Traps for young players (and honestly, Microsoft, I should not have to install an ?Editing? language to view an email?) Sorry for spamming the list. Marc From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marc Durdin Sent: Wednesday, 19 March 2014 8:45 AM To: Tom Gewecke Cc: Naena Guru; Jean-Fran?ois Colson; UnicoDe List Subject: RE: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) I?ve done some more analysis now that I?ve arrived at my office (and if I?d read Doug?s email earlier, would have been able to see this too). The email I received, on my notebook running Outlook 2010, has had U+200D stripped out from that Sinhala text ? hence the rendering difference. Now it gets weirder. I also have Outlook 2010 running on my office computer, also attached to the same Exchange server: so I have two copies of the same email, which in a sane world would be byte-for-byte identical. On my office computer, U+200D has not been stripped out, and the text renders in the same way as OS X. I am struggling to understand why two copies of the same email have ended up with different content ? given the clients are both running the same version of Windows and the same version of Outlook (even down to the same updates and security patches), connected to the same Exchange server. MS Office Language Preferences do not list Sinhala in either case. The same email on my iPhone has correct content, as does the webmail version. Anyone got any ideas? Marc From: Tom Gewecke [mailto:tom at bluesky.org] Sent: Wednesday, 19 March 2014 7:57 AM To: Marc Durdin Cc: Jean-Fran?ois Colson; Naena Guru; UnicoDe List Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote: Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering? See graphic below. The OS X version is the most correct according my limited knowledge of the script. I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one. This can be fixed by using the Bhashitha font. Also it's possible that some of the characters should be "touching". I did not add the code for that and don't think any of my current fonts support it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Tue Mar 18 17:13:24 2014 From: tom at bluesky.org (Tom Gewecke) Date: Tue, 18 Mar 2014 15:13:24 -0700 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> Message-ID: <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org> On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote: > I suspect it was a fishing expedition to illustrate how awkward it is to type on Unicode keyboard layouts versus his system. > Interesting question perhaps. Is it more awkward to type 14 strokes as k a a r y y a a l a v a l a or to type 9 as ??? ? ???? ???? ?? ? ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Tue Mar 18 18:42:51 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Wed, 19 Mar 2014 10:42:51 +1100 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org> Message-ID: Different individuals, groups and communities can bring their own expectations to input layout designs. Design is a balance between capabilities and limitations of the input framework versus the expectations of the user community around how they language should work. I work with multiple operating systems and even more input frameworks. I have my preferred input frameworks. But it ultimately air is a question of knowing your tools. For instance, if you compile a keyborad layout from the commandline with MSKLC you can chain deadkeys, build against custom locales in Vista and Win7, or build against unsupported language codes in Win8+ Andrew On 19/03/2014 9:13 AM, "Tom Gewecke" wrote: > > On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote: > > I suspect it was a fishing expedition to illustrate how awkward it is to > type on Unicode keyboard layouts versus his system. > > > Interesting question perhaps. Is it more awkward to type 14 strokes as k > a a r y y a a l a v a l a or to type 9 as ? ? ? ??? ??? ? ? ? ? ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Mar 18 20:00:57 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 19 Mar 2014 01:00:57 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com> <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com> Message-ID: <20140319010057.42f5b5ff@JRWUBU2> On Mon, 17 Mar 2014 18:18:50 -0500 Naena Guru wrote: (in topic 'Romanized Singhala got great reception in Sri Lanka') > Typing is a nightmare. > When you backspace it destroys multiple keystrokes. I suspect this is a widespread and unsolved problem. If one positions the cursor after a character entered by multiple characters in a previous program, there doesn't seem to be a way of undoing the previous typing. A Latin-1 analogy is entering e-acute by typing 'e and then backspacing later. That will usually simply delete the e-acute rather than leaving the dead key apostrophe. In some ways it may be an insoluble problem rather than merely difficult. For example, when using KMFL to type the Tai Tham script, I have two ways of typing the combination (historically corresponding to the single character U+199C NEW TAI LUE LETTER HIGH LA and sometimes listed as a letter in its own right): 1) !} 2) s!] I use '!' because I can't get altGr to work in KMFL. It works as a dead key. Key sequence (1) views the Tai Tham sequence as a single character: key sequence (2) views it as the sequence of Unicode characters. The key stokes are based on the Thai Kesmanee keyboard. The mnemonic for the sequences with '!' is that the single key stroke ']' results in U+1A43 TAI THAM LETTER LA. The single, shifted key stroke '}' results in a comma, as in the Thai keyboard. If I position the curor after the character sequence, what should I get after typing and then the character '}'? Should I get: (a) (by assuming input sequence 1); (b) (by assuming input sequence 2); or (c) (what I actually get)? > Search and > replace is not possible, at least the way do it with English. I suspect the problem you have have is that editing tools expect the user to think of a combination of base character and combining mark as a single character. I don't know how to counter this expectation. For LibreOffice, I do search and replace by choosing the 'regular expression' option, as this does allow the user to work with characters rather than legacy grapheme clusters (UAX #29: Unicode Text Segmentation). Richard. From naa.ganesan at gmail.com Wed Mar 19 03:04:01 2014 From: naa.ganesan at gmail.com (N. Ganesan) Date: Wed, 19 Mar 2014 01:04:01 -0700 Subject: Urdu Nastaliq script Message-ID: http://tribune.com.pk/story/683067/inventing-revolution-the-man-who-gave-urdu-its-wings/ Inventing revolution: The man who gave Urdu its wings By Khalid Rahman Published: March 15, 2014 *KARACHI: * *Ahmed Mirza Jamil changed the way all Urdu newspapers and books would be published anywhere in the world; and he did it back in 1981 with his Noori Nastaliq script that gave the Midas touch to desktop publishing.* The present-day Urdu publishing owes its elegant contours to the calligraphic skills of this great wizard of calligraphy. Before being used in the composing software, InPage, the Noori Nastaliq was created as a digital typeface (font) in 1981 when master-calligrapher Ahmed Mirza Jamil and Monotype Imaging (then called Monotype Corp) collaborated on a joint venture. Earlier, Urdu newspapers, books and magazines needed manual calligraphers, who were replaced by computer machines in Pakistan, India, UK and other countries. The government of Pakistan recognised Ahmad Mirza Jamil?s singular achievement in 1982 by designating Noori Nastaliq as an ?Invention of National Importance? and awarded him with the medal of distinction, Tamgha-e-Imtiaz. In recognition of his achievement, the University of Karachi also awarded him the degree of Doctor of Letters, Honoris Causa. Narrating the history of his achievement in his book, ?Revolution in Urdu Composing?, he wrote: ?In future, Urdu authors will be able to compose their books like the authors of the languages of Roman script. Now, the day a manuscript is ready is the day the publication is ready for printing. There is no waiting for calligraphers to give their time grudgingly, no apprehension of mistakes creeping in, nor any complaints about the calligraphers or operators not being familiar with the language. ?Soon our future generations will be asking incredulously whether it was really true that there was a time when newspapers were painstakingly manually calligraphed all through the night to be printed on high speed machines in the morning. Were we really so primitive that our national language had to limp along holding on to the crutches of the calligraphers that made the completion of books an exercise ranging from months to years depending upon their volume.? Noted Urdu litterateur Ahmed Nadeem Qasmi paid tribute to Ahmed Mirza Jamil during his lifetime. He said, ?The revolution brought about by Noori Nastaliq in the field of Urdu publishing sends out many positive signals. It has at last settled the long-standing dispute about Urdu typewriter?s keys that had raged from the time Pakistan was born. The future generations will surely be indebted to him for this revolution. Dr Ahmed Mirza Jamil passed away unsung on February 17, 2014. May his soul be blessed. *Published in The Express Tribune, March 15th, 2014.* -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 18 20:33:39 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 19:33:39 -0600 Subject: Editing Sinhala and Similar Scripts Message-ID: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Richard Wordingham wrote: >> Typing is a nightmare. > >> When you backspace it destroys multiple keystrokes. > > I suspect this is a widespread and unsolved problem. There are two types of people: 1. those who fully expect Backspace to erase a single keystroke, and feel it is a fatal flaw if it erases an entire combination, and 2. those who fully expect Backspace to erase an entire combination, and feel it is a fatal flaw if it erases just a single keystroke. Unfortunately, both types exist in significant numbers. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From doug at ewellic.org Tue Mar 18 20:50:48 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 18 Mar 2014 19:50:48 -0600 Subject: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org> References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net> <5328BA69.9050502@colson.eu> <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org> Message-ID: Tom Gewecke wrote: >> It seems there?s still a big difference in the second syllable. > > Are you referring to "ryy" (0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA)? > > That is the correct encoding I think. But most fonts don't display it > quite right. On Windows 8.1, still using BabelPad, the "ryy" comes out just as Naena had it. Furthermore, it even comes out right on Windows Phone. See attached images. So there it is. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? -------------- next part -------------- A non-text attachment was scrubbed... Name: sinhala-babelpad-8.jpg Type: image/jpeg Size: 45785 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sinhala-babelpad-phone.jpg Type: image/jpeg Size: 13984 bytes Desc: not available URL: From daniel.buenzli at erratique.ch Wed Mar 19 08:17:00 2014 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 19 Mar 2014 14:17:00 +0100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Message-ID: <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch> Le mercredi, 19 mars 2014 ? 02:33, Doug Ewell a ?crit : > There are two types of people: > > 1. those who fully expect Backspace to erase a single keystroke, and > feel it is a fatal flaw if it erases an entire combination, and > > 2. those who fully expect Backspace to erase an entire combination, and > feel it is a fatal flaw if it erases just a single keystroke. > > Unfortunately, both types exist in significant numbers. Isn't it possible to classify appartenance to 1 or 2 according to script ? E.g. I suspect most french speaking person when backspacing an ? would like to erase the whole combination; for ? it seems even more obvious since usually it's introduced with a single keystroke. Best, Daniel From lang.support at gmail.com Wed Mar 19 08:19:55 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Thu, 20 Mar 2014 00:19:55 +1100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Message-ID: LOL, that's why, if the input framework allows it, its easier to support both approachable to backspace or at least an option to choose one or the other. ; ) Andrew On 19/03/2014 11:37 PM, "Doug Ewell" wrote: > Richard Wordingham wrote: > > Typing is a nightmare. >>> >> >> When you backspace it destroys multiple keystrokes. >>> >> >> I suspect this is a widespread and unsolved problem. >> > > There are two types of people: > > 1. those who fully expect Backspace to erase a single keystroke, and feel > it is a fatal flaw if it erases an entire combination, and > > 2. those who fully expect Backspace to erase an entire combination, and > feel it is a fatal flaw if it erases just a single keystroke. > > Unfortunately, both types exist in significant numbers. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From A.Schappo at lboro.ac.uk Wed Mar 19 09:26:04 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Wed, 19 Mar 2014 14:26:04 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Message-ID: On 19 Mar 2014, at 01:33, Doug Ewell wrote: > Richard Wordingham wrote: > >>> Typing is a nightmare. >> >>> When you backspace it destroys multiple keystrokes. >> >> I suspect this is a widespread and unsolved problem. > > There are two types of people: > > 1. those who fully expect Backspace to erase a single keystroke, and feel it is a fatal flaw if it erases an entire combination, and > > 2. those who fully expect Backspace to erase an entire combination, and feel it is a fatal flaw if it erases just a single keystroke. > > Unfortunately, both types exist in significant numbers. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell ? > WRT Latin. I have just tested with OSX TextEdit and the precomposed character ? U+00E9 Backspace erases ? Control+Backspace erases ? leaving one with e I had not realised this was possible until I experimented with combinations of Backspace + alt/command/ctrl/shift Andr? From petercon at microsoft.com Wed Mar 19 09:57:35 2014 From: petercon at microsoft.com (Peter Constable) Date: Wed, 19 Mar 2014 14:57:35 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell >>> When you backspace it destroys multiple keystrokes. > There are two types of people: > > 1. those who fully expect Backspace to erase a single keystroke It is nonsensical to talk about erasing a _keystroke_. That would be comparable to erasing a mouse click, or erasing a tap on a touch-sensitive device. These are user actions that may result in any number of machine states. Unless you can manage to build a time machine, at the time when the erasing is happening, there is no longer any record of what process might have been operating that responded to the user action or of what machine state was the result. All that is available to act on at that point are characters. Peter From emuller at adobe.com Wed Mar 19 10:08:01 2014 From: emuller at adobe.com (Eric Muller) Date: Wed, 19 Mar 2014 08:08:01 -0700 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> Message-ID: <5329B2D1.5030603@adobe.com> On 3/19/2014 7:57 AM, Peter Constable wrote: > It is nonsensical to talk about erasing a _keystroke_. "undo", "revert" the effect of a keystroke. The concept is meaningful. Eric. From doug at ewellic.org Wed Mar 19 11:38:19 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 19 Mar 2014 09:38:19 -0700 Subject: Editing Sinhala and Similar Scripts Message-ID: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net> Andre Schappo wrote: > WRT Latin. I have just tested with OSX TextEdit and the precomposed > character ? U+00E9 > > Backspace erases ? > Control+Backspace erases ? leaving one with e > > I had not realised this was possible until I experimented with > combinations of Backspace + alt/command/ctrl/shift That's the sort of feature I would just love. I also love the Alt+Tab and Windows+Tab features to switch between windows in Windows. I am led to believe that "normal users" (cf. nerds) hate this kind of hidden feature, and either never use it or become annoyed when they invoke it accidentally by hitting the magic key combination. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From doug at ewellic.org Wed Mar 19 11:39:13 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 19 Mar 2014 09:39:13 -0700 Subject: Editing Sinhala and Similar Scripts Message-ID: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Peter Constable wrote: >> There are two types of people: >> >> 1. those who fully expect Backspace to erase a single keystroke > > It is nonsensical to talk about erasing a _keystroke_. But that's what they expect. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From ken.whistler at sap.com Wed Mar 19 12:28:07 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 19 Mar 2014 17:28:07 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: And I think you need to distinguish between *proximate* behavior in an editor and editing behavior in general. Once a user enters editing mode, the expectation that we (the software community writing text editors) have built, in interaction with users, is that within reason, something that you have *just* done in editing, can be easily undone. And that is what "backspace" now means to many users. Do do do do ... undo undo undo undo ... That should get you back to where you were before the do do do do. It is really annoying, particularly to efficient typists, when a sequence of 4 keystrokes is *not* exactly undone by a sequence of 4 backspace strokes. When that occurs, the flow of text composition is suddenly interrupted by forcing the user out of "compose" mode and into a completely different "monitor and check what the state of the display is" mode that can be very annoying. But that is what I am referring to as the proximate behavior. An editing implementation can and should collect a reasonable undo buffer, which *does* know about complicated states, including significant operations like selection deletion, which are the most common types of operations that composers really, really, really want to be able to undo. But in cases like that, the backspace key is only a partial aspect of the undo, and I suspect that most people are not all that annoyed when they have to shift out of "compose" mode to accomplish more significant undo operations. But what Peter was pointing out that in the *generic* case for editing, such as first cursor down at some random location in already existing text, there is no existing history of how that text was created. And thus there are no "keystrokes" to be undone by hitting a backspace at that point. Yet a backspace command has to do *something* reasonable, and my own assessment is that it shouldn't be too different from what a backspace key does during active text entry. So that is the real conundrum here. Getting all of the commands of a text editor to work efficiently the way users expect is itself an art form -- even for relatively simple scripts. So it really should not be too surprising that people have rather intense arguments about how such operations *should* work for abugidas. (Particularly because such operations very often are not occurring in monolingual/monoscriptal contexts, and expectations carry over from one language/script to another.) --Ken > Peter Constable wrote: > > >> There are two types of people: > >> > >> 1. those who fully expect Backspace to erase a single keystroke > > > > It is nonsensical to talk about erasing a _keystroke_. > > But that's what they expect. From doug at ewellic.org Wed Mar 19 13:07:05 2014 From: doug at ewellic.org (Doug Ewell) Date: Wed, 19 Mar 2014 11:07:05 -0700 Subject: Editing Sinhala and Similar Scripts Message-ID: <20140319110705.665a7a7059d7ee80bb4d670165c8327d.59f7242c57.wbe@email03.secureserver.net> "Whistler, Ken" wrote: > But what Peter was pointing out that in the *generic* case > for editing, such as first cursor down at some random > location in already existing text, there is no existing history > of how that text was created. And thus there are no "keystrokes" > to be undone by hitting a backspace at that point. Well, of course you and Peter are right, and stated it well as always. Probably a better way for me to say it would be that, for any visual combination of letters or marks that are decomposable in some way, such as an acute accent over an 'e' or a conjunct cluster in an Indic script, there are at least some users who expect Backspace to erase one element of the cluster (the "last," whatever that means) and some who expect it to erase the entire cluster. Each type of user will be frustrated if the other behavior occurs. > So that is the real conundrum here. Getting all of the commands > of a text editor to work efficiently the way users expect is > itself an art form -- even for relatively simple scripts. So it > really should not be too surprising that people have rather > intense arguments about how such operations *should* work > for abugidas. (Particularly because such operations very > often are not occurring in monolingual/monoscriptal > contexts, and expectations carry over from one language/script > to another.) Daniel B?nzli pointed out that French-speaking users would consider '?' a unitary letter, and would expect Backspace to erase the whole thing, even if "under the hood" it might be encoded in NFD as <0065 0301>. It's not at all clear that a Sinhala user would expect Backspace to delete a cluster of three or four "letters" (Naena certainly didn't like that). But the two scenarios are quite similar as far as software and encoding are concerned. So maybe a French keyboard could have one Backspace behavior built in, and a Sinhala keyboard could have something different, something that may or may not be possible under various input architectures. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From richard.wordingham at ntlworld.com Wed Mar 19 15:29:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 19 Mar 2014 20:29:09 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch> References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell> <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch> Message-ID: <20140319202909.6b7572cc@JRWUBU2> On Wed, 19 Mar 2014 14:17:00 +0100 Daniel B?nzli wrote: > Le mercredi, 19 mars 2014 ? 02:33, Doug Ewell a ?crit : > > There are two types of people: > > > > 1. those who fully expect Backspace to erase a single keystroke, > > and feel it is a fatal flaw if it erases an entire combination, and > > > > 2. those who fully expect Backspace to erase an entire combination, > > and feel it is a fatal flaw if it erases just a single keystroke. > > > > Unfortunately, both types exist in significant numbers. And I belong to a third group - I expect it to delete a Unicode character. > Isn't it possible to classify appartenance to 1 or 2 according to > script ? E.g. I suspect most french speaking person when backspacing > an ? would like to erase the whole combination; for ? it seems even > more obvious since usually it's introduced with a single keystroke. It's not as simple as script. For an English speaker who enters it on a keyboard, it's normally entered with multiple keystrokes, most typically via a dead key. Now, if I type it in using an out of order sequence such as 'e, it is quite reasonable for it to be stored as a single composed character and deleted by backspace. On the other hand, if I type it in using an XSAMPA-based keyboard sequence such as e_H, I expect the backspace to delete just the accent, just as I am used to for the sequence O_H which yields 2 characters, open o with acute (??). The diacritic here would not not arbitrary - I would be using it to indicate a specific tone. (It came as a nasty shock to find my e-mail client, Claws on Ubuntu, takes out the entire cluster. For Thai legacy grapheme clusters, it just takes out the last character entered.) At the moment I have made my life more difficult for myself by devising a keyboard that generates NFC if the key strokes are in the right order. As a reasonable guide, backspace should not take out more than one NFC character, and I would defend this even for Cyrillic-script tone marking in Serbian. Now, there's supposed to be an interface definition for using incremental keyboard typing as in Keyman, where keyboards can be arranges so that one sees what's been typed in already. Where is it? It is rather important for an application to know when it can normalise input characters. For example, LibreOffice helpfully swaps round a tone Thai mark with a following vowel mark below, with the slightly bizarre consequence that the sequence ko kai, mai ek, sara u, backspace yields . Traditionally, the sequence yields a beep and just - the input handler rejects the SARA U because it does not accord with the character order prescribed by WTT (wing thuk thi). Richard. From petercon at microsoft.com Wed Mar 19 22:43:05 2014 From: petercon at microsoft.com (Peter Constable) Date: Thu, 20 Mar 2014 03:43:05 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: If you click into the existing text in this email and backspace, what keystroke will you expect to be "erased"? Your system has no way of knowing what keystroke might have been involved in creating the text. What is _can_ make sense to talk about is to say that a user expects execution of a particular key sequence, such as pressing a Backspace key, to have a particular editing effect on the content of text. "Erasing a keystroke" and "keystrokes resulting in edits" are different things. One makes sense, the other does not. It may seem like I'm being pedantic, but I think the distinction is important. Our failure is in framing our thinking from years of experience (and perhaps some behaviours originally influenced by typewriter and teletype technologies) in which a keyboard has a bunch of keys that add characters, and variations on that that even include a lot of logic to get input keying sequences that can generate tens of thousands of different character; but then one or two keys (delete, backspace) that can only operate in very dumb ways. (We've also always assumed that any logic in keying behaviours can be conditioned only by the input sequences, but not by any existing content, but that steps beyond my earlier point.) These constraints in how we think limit possibilities Peter -----Original Message----- From: Doug Ewell [mailto:doug at ewellic.org] Sent: March 19, 2014 9:39 AM To: Peter Constable; unicode at unicode.org Subject: RE: Editing Sinhala and Similar Scripts Peter Constable wrote: >> There are two types of people: >> >> 1. those who fully expect Backspace to erase a single keystroke > > It is nonsensical to talk about erasing a _keystroke_. But that's what they expect. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell From jlturriff at centurylink.net Wed Mar 19 23:17:24 2014 From: jlturriff at centurylink.net (J. Leslie Turriff) Date: Wed, 19 Mar 2014 21:17:24 -0700 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: <201403192117.24313.jlturriff@centurylink.net> Perhaps it might be useful to be able to distinguish between an "editing mode" and a "composition mode": editing mode would be active when a document is first loaded into the editor, when the editor has no keystroke history to consult, and in this mode the backspace key would merely remove text "glyph by glyph", so to speak, as happens with ASCII text; composition mode would be active when keystrokes have been recorded in a buffer, so that backspace could be used to "unstroke" the original strokes; the "unstroke" operations would mimic the order in which the originals were entered, even if the editor had optomized the composition. Leslie On Wednesday 19 March 2014 20:43:05 Peter Constable wrote: > If you click into the existing text in this email and backspace, what > keystroke will you expect to be "erased"? Your system has no way of knowing > what keystroke might have been involved in creating the text. > > What is _can_ make sense to talk about is to say that a user expects > execution of a particular key sequence, such as pressing a Backspace key, > to have a particular editing effect on the content of text. "Erasing a > keystroke" and "keystrokes resulting in edits" are different things. One > makes sense, the other does not. > > It may seem like I'm being pedantic, but I think the distinction is > important. Our failure is in framing our thinking from years of experience > (and perhaps some behaviours originally influenced by typewriter and > teletype technologies) in which a keyboard has a bunch of keys that add > characters, and variations on that that even include a lot of logic to get > input keying sequences that can generate tens of thousands of different > character; but then one or two keys (delete, backspace) that can only > operate in very dumb ways. (We've also always assumed that any logic in > keying behaviours can be conditioned only by the input sequences, but not > by any existing content, but that steps beyond my earlier point.) These > constraints in how we think limit possibilities From lang.support at gmail.com Wed Mar 19 23:21:59 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Thu, 20 Mar 2014 15:21:59 +1100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: There is also a distinction between editing an existing document that you opened as distinct from writing a document, going back to a certain point in document and editing that section within the same editing session. In the first case their is no history, in the second case their may be history to work with. Andrew On 20 March 2014 14:43, Peter Constable wrote: > If you click into the existing text in this email and backspace, what > keystroke will you expect to be "erased"? Your system has no way of knowing > what keystroke might have been involved in creating the text. > > What is _can_ make sense to talk about is to say that a user expects > execution of a particular key sequence, such as pressing a Backspace key, > to have a particular editing effect on the content of text. "Erasing a > keystroke" and "keystrokes resulting in edits" are different things. One > makes sense, the other does not. > > It may seem like I'm being pedantic, but I think the distinction is > important. Our failure is in framing our thinking from years of experience > (and perhaps some behaviours originally influenced by typewriter and > teletype technologies) in which a keyboard has a bunch of keys that add > characters, and variations on that that even include a lot of logic to get > input keying sequences that can generate tens of thousands of different > character; but then one or two keys (delete, backspace) that can only > operate in very dumb ways. (We've also always assumed that any logic in > keying behaviours can be conditioned only by the input sequences, but not > by any existing content, but that steps beyond my earlier point.) These > constraints in how we think limit possibilities > > > Peter > > > -----Original Message----- > From: Doug Ewell [mailto:doug at ewellic.org] > Sent: March 19, 2014 9:39 AM > To: Peter Constable; unicode at unicode.org > Subject: RE: Editing Sinhala and Similar Scripts > > Peter Constable wrote: > > >> There are two types of people: > >> > >> 1. those who fully expect Backspace to erase a single keystroke > > > > It is nonsensical to talk about erasing a _keystroke_. > > But that's what they expect. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Wed Mar 19 23:25:34 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Thu, 20 Mar 2014 15:25:34 +1100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <201403192117.24313.jlturriff@centurylink.net> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <201403192117.24313.jlturriff@centurylink.net> Message-ID: On 20 March 2014 15:17, J. Leslie Turriff wrote: > Perhaps it might be useful to be able to distinguish between an > "editing > mode" and a "composition mode": editing mode would be active when a > document > is first loaded into the editor, when the editor has no keystroke history > to > consult, and in this mode the backspace key would merely remove text > "glyph > by glyph", so to speak, as happens with ASCII text; composition mode would > be active when keystrokes have been recorded in a buffer, so that backspace > could be used to "unstroke" the original strokes; the "unstroke" operations > would mimic the order in which the originals were entered, even if the > editor > had optomized the composition. > > > Although that requires an input framework and application that utilise that buffer in various ways during "composition mode". It is possible, and in the past I have written a manual and run training on advanced editing for Dinka language translators on how to utilise such features. But not many applications support such features. Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Mar 19 23:59:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 20 Mar 2014 05:59:49 +0100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: The Backspace key has never been considered an "Undo" key. Some OSes or keyboard provide an Undo key or an equivalent shortcut like CTRL+Z (but even in this case the editor may want to undo in one operation multiple successive insertions). In the ti,e of typewriters; backspaces meant going back one full cluster (in order to be able to retype it completely (e.g. with a blanking typex). Its effect was effectively to going backwardto the start of the cluster. On typewriters and modern computer keyboards that have dead keys, the backspace key was ignoring that dead key and going backward to the previous cluster. So deadkeys are not counted. With keyboards using a compose key method, there is NO character output in the edit buffer as long as the compose sequence is not complete, so there's a single string inserted continaing the result of the composition and users will not see anything inserted in the edited text before, so there's nothing you can delete with backspace. Users also should not have to care it the composed seuqnce was encoded in NFC or NFD form (with precomposed characters or with decomposed base character followed by diacritics). So they expect just one cluster. It you really wat to delete an accent on top of a Latin letter, Backspace is certainly not hat the Backspace key will usually perform. You would need another key such as ALT+Backspace to *transform* the previous cluster before the cursor into a shorter one. But here this time, it is sometimes not really possible to predict which diacritic will be deleted if there are multiple ones and they are unordered (i.e. these combining diacritics have **distinct** and **non-zero** combining classes): may be all these diacritics with distinct non-zero combining classes should be deleted in a single operation, or otherwise just the last **ordered** diacritic if there's still one. (Note: here the term "diacritic" is meant "broadly" and it may be any combining mark, or joiner control lie CGJ, or ZWJ, or ZWNJ, and sometimes a modifier letter that may participage to the same cluster such has apostrophe or middle dot: the Catalan letter L with middle dot may be viewed in the editor as the letter L containing a combined diacritic, so ALT+BACKSPACE could replace the L with middle dot by the letter L alone, even if it's not canonically decomposable, as long as the editor knows that it is operating within a Catalan locale) In all cases, the action being performed by Backspace or alt+Backspace is compeltely independant of the underlying Unicode encoding and shoudl also be independant of the normalization form (except in advanced technical editor mode such as "visible controls" where every encoded character is rendered separately with a special form to make them visible). In my opinion the standard edit mode (working in visual WYSIWYG mode) should not depend on the encoding and Backspace should not create in the edited text new oddities that were not really inserted and made visible imediately when they were first entered. Indic diacritics are entered separately from the base letter and they are combined progressively. They are also ordered, for this reason Backspace can remove them in a predicatable order one by one. The same could be saif about Hebrew and Arabic diacritics entered separately (even if sometimes they could be unordered: Backspace will will still delete all diacritics that are in the same unordered group, even if it keeps the base letter) But for Latin/Greek/Cyrillic keyboards that use dead keys for entering unordered diacritics (and that are not even made visible in the document before you have typed the base letter), it makes no sense for Backspace to choose between these diacritics. Backspace will then delete all the full cluster up tp the base letter. 2014-03-20 5:21 GMT+01:00 Andrew Cunningham : > There is also a distinction between editing an existing document that you > opened as distinct from writing a document, going back to a certain point > in document and editing that section within the same editing session. > > In the first case their is no history, in the second case their may be > history to work with. > > Andrew > > > On 20 March 2014 14:43, Peter Constable wrote: > >> If you click into the existing text in this email and backspace, what >> keystroke will you expect to be "erased"? Your system has no way of knowing >> what keystroke might have been involved in creating the text. >> >> What is _can_ make sense to talk about is to say that a user expects >> execution of a particular key sequence, such as pressing a Backspace key, >> to have a particular editing effect on the content of text. "Erasing a >> keystroke" and "keystrokes resulting in edits" are different things. One >> makes sense, the other does not. >> >> It may seem like I'm being pedantic, but I think the distinction is >> important. Our failure is in framing our thinking from years of experience >> (and perhaps some behaviours originally influenced by typewriter and >> teletype technologies) in which a keyboard has a bunch of keys that add >> characters, and variations on that that even include a lot of logic to get >> input keying sequences that can generate tens of thousands of different >> character; but then one or two keys (delete, backspace) that can only >> operate in very dumb ways. (We've also always assumed that any logic in >> keying behaviours can be conditioned only by the input sequences, but not >> by any existing content, but that steps beyond my earlier point.) These >> constraints in how we think limit possibilities >> >> >> Peter >> >> >> -----Original Message----- >> From: Doug Ewell [mailto:doug at ewellic.org] >> Sent: March 19, 2014 9:39 AM >> To: Peter Constable; unicode at unicode.org >> Subject: RE: Editing Sinhala and Similar Scripts >> >> Peter Constable wrote: >> >> >> There are two types of people: >> >> >> >> 1. those who fully expect Backspace to erase a single keystroke >> > >> > It is nonsensical to talk about erasing a _keystroke_. >> >> But that's what they expect. >> >> -- >> Doug Ewell | Thornton, CO, USA >> http://ewellic.org | @DougEwell >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Mar 20 02:31:41 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 20 Mar 2014 00:31:41 -0700 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <201403192117.24313.jlturriff@centurylink.net> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <201403192117.24313.jlturriff@centurylink.net> Message-ID: <532A995D.60102@ix.netcom.com> On 3/19/2014 9:17 PM, J. Leslie Turriff wrote: > Perhaps it might be useful to be able to distinguish between an "editing > mode" and a "composition mode": editing mode would be active when a document > is first loaded into the editor, when the editor has no keystroke history to > consult, and in this mode the backspace key would merely remove text "glyph > by glyph", so to speak, as happens with ASCII text; composition mode would > be active when keystrokes have been recorded in a buffer, so that backspace > could be used to "unstroke" the original strokes; the "unstroke" operations > would mimic the order in which the originals were entered, even if the editor > had optomized the composition. It's more complicated than that. Many editors don't (always) support "micro" undo. At some point, keystrokes (or their result) are coalesced and an undo will delete entire words or phrases, perhaps entire bullet items on a slide. If done right, this will feel natural. If I've made edits to my document in three places, inserting the same word, then it feels natural to "undo" these as whole words (and not slavishly by keystroke - including all the false starts and backspace keys). At the current caret position, one would expect the undo to be less aggressive and act more like a backspace. But in that case the user would (roughly) remember the keystrokes that just happened, so inverting the sequence feels more natural. That same memory is why backspacing by composition step (keystroke) is appealing - you intuitively know how many wrong keys were pressed. But many user interfaces do not support that. Composing SMS with the T9 interface will let you erase characters from the composed string, but will not revert to earlier word-guesses, so you can't cycle back, except by erasing until the beginning and then starting over. For that interface, the upside is that sometimes breaking a word apart by freezing the composition of the leading part, erasing parts that don't fit and then composing the remainder is the most efficient way to get around some limitations of the composition method. Whatever the details, the design of an ideal user interface should not drive, or worse, dictate the character encoding - nor should the reverse be true. A./ From A.Schappo at lboro.ac.uk Thu Mar 20 07:24:38 2014 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Thu, 20 Mar 2014 12:24:38 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net> References: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net> Message-ID: <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk> WRT Hangul Syllables & OSX TextEdit Take a Hangul Syllable such as ? backspace erases the whole syllable control+backspace erases one jamo at a time from the syllable Andr? Schappo On 19 Mar 2014, at 16:38, Doug Ewell wrote: > Andre Schappo wrote: > >> WRT Latin. I have just tested with OSX TextEdit and the precomposed >> character ? U+00E9 >> >> Backspace erases ? >> Control+Backspace erases ? leaving one with e >> >> I had not realised this was possible until I experimented with >> combinations of Backspace + alt/command/ctrl/shift > > That's the sort of feature I would just love. I also love the Alt+Tab > and Windows+Tab features to switch between windows in Windows. I am led > to believe that "normal users" (cf. nerds) hate this kind of hidden > feature, and either never use it or become annoyed when they invoke it > accidentally by hitting the magic key combination. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > From shizhao at gmail.com Thu Mar 20 08:36:56 2014 From: shizhao at gmail.com (shi zhao) Date: Thu, 20 Mar 2014 21:36:56 +0800 Subject: two Hanzi Message-ID: plese add two Hanzi (up ?+ down ?) and (up ? + down ?) see http://www.term.org.cn/CN/abstract/abstract9314.shtml# include in : * Zhonghua Zihai??????, 1994: 1770. * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219. (up ?+ down ?) = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 ) (up ? + down ?) = coronium = newtonium (see http://yedict.com/zslistbs.asp?word=%C6%F87 ) My blog: http://shizhao.org twitter: https://twitter.com/shizhao [[zh:User:Shizhao]] From mpsuzuki at hiroshima-u.ac.jp Thu Mar 20 08:50:58 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Thu, 20 Mar 2014 22:50:58 +0900 Subject: ["Unicode"] two Hanzi In-Reply-To: References: Message-ID: <532AF242.60703@hiroshima-u.ac.jp> If they are officially standardized characters for the elements by PRC government, China NB will submit them to ISO/IEC 10646 via Urgently Needed Characters process. They are official? Regards, mpsuzuki On 03/20/2014 10:36 PM, shi zhao wrote: > plese add two Hanzi (up ?+ down ?) and (up ? + down ?) > > see http://www.term.org.cn/CN/abstract/abstract9314.shtml# > > include in : > * Zhonghua Zihai??????, 1994: 1770. > * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219. > > > (up ?+ down ?) = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 ) > (up ? + down ?) = coronium = newtonium (see > http://yedict.com/zslistbs.asp?word=%C6%F87 ) > > > > My blog: http://shizhao.org > twitter: https://twitter.com/shizhao > > [[zh:User:Shizhao]] > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From shizhao at gmail.com Thu Mar 20 09:02:50 2014 From: shizhao at gmail.com (shi zhao) Date: Thu, 20 Mar 2014 22:02:50 +0800 Subject: ["Unicode"] two Hanzi In-Reply-To: <532AF242.60703@hiroshima-u.ac.jp> References: <532AF242.60703@hiroshima-u.ac.jp> Message-ID: there is former chinese translation of newtonium and nebulium see https://en.wikipedia.org/wiki/Coronium https://en.wikipedia.org/wiki/Nebulium Chinese wikipedia: http://zh.wikipedia.org/ My blog: http://shizhao.org twitter: https://twitter.com/shizhao [[zh:User:Shizhao]] 2014-03-20 21:50 GMT+08:00 suzuki toshiya : > If they are officially standardized characters for the > elements by PRC government, China NB will submit them > to ISO/IEC 10646 via Urgently Needed Characters process. > They are official? > > Regards, > mpsuzuki > > > On 03/20/2014 10:36 PM, shi zhao wrote: >> >> plese add two Hanzi (up ?+ down ?) and (up ? + down ?) >> >> see http://www.term.org.cn/CN/abstract/abstract9314.shtml# >> >> include in : >> * Zhonghua Zihai??????, 1994: 1770. >> * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219. >> >> >> (up ?+ down ?) = nebulium (see >> http://yedict.com/zslistbs.asp?word=%C6%F84 ) >> (up ? + down ?) = coronium = newtonium (see >> http://yedict.com/zslistbs.asp?word=%C6%F87 ) >> >> >> >> My blog: http://shizhao.org >> twitter: https://twitter.com/shizhao >> >> [[zh:User:Shizhao]] >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > From jknappen at web.de Thu Mar 20 09:12:53 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 20 Mar 2014 15:12:53 +0100 Subject: Aw: Re: ["Unicode"] two Hanzi In-Reply-To: <532AF242.60703@hiroshima-u.ac.jp> References: , <532AF242.60703@hiroshima-u.ac.jp> Message-ID: An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Mar 20 09:59:00 2014 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 20 Mar 2014 14:59:00 +0000 Subject: ["Unicode"] two Hanzi In-Reply-To: References: <532AF242.60703@hiroshima-u.ac.jp> Message-ID: On 20 March 2014 14:12, "J?rg Knappen" wrote: > > Who writes a proposal? I wish that there was a mechanism for encoding CJK characters that allowed individuals to simply submit characters with appropriate evidence to Unicode, and after review they could be added to the next version Unicode, but the reality is that you need to go through a long and bureaucratic process involving the Ideographic Rapporteur Group (IRG), with the result that it may take ten years to get a CJK character encoded. Even the Unicode Consortium seems powerless to overcome IRG bureaucracy, as the sorry tale below illustrates. In 2012 I wrote a proposal to encode 226 Han characters, including two fish characters previously requested by Shi Zhao on this list , which I submitted to the Unicode Technical Committee (UTC): The UTC accepted this document, and included the suggested characters in the Unicode submission to the IRG for inclusion in the CJK-F extension: This was discussed at the IRG meeting in Hanoi in November 2012 (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRG39.htm), but the Unicode submission for CJK-F was entirely rejected by IRG just because the submission was a couple of days late. The UTC later submitted a proposal to encode 19 of the original characters (including Shi Zhao's two fish characters) as urgently needed characters: But this was rejected by IRG in November last year as they considered that these characters were not urgent enough, so now we will have to wait another four or five years before they can be considered for CJK-G. Good luck getting the characters for newtonium and nebulium encoded any sooner! Andrew From mpsuzuki at hiroshima-u.ac.jp Thu Mar 20 11:20:10 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Fri, 21 Mar 2014 01:20:10 +0900 Subject: ["Unicode"] two Hanzi In-Reply-To: References: <532AF242.60703@hiroshima-u.ac.jp> Message-ID: <532B153A.1080100@hiroshima-u.ac.jp> Hi, I have no objection against the impression of the slowness, but please don't say IRG as bureaucratic. IRG members are pushing themselves to their limits for reviewing process of the thousands of the submitted characters. Although IRG could not response "here you are" immediately to the voice "give me", but IRG is not saying "go away". In my personal impression, the UNC submissions from the experts are slightly difficult for other IRG members to evaluate about their urgency. Taking Chinese UNC submission, the urgency is justified by the update of the normative Hanzi table. Until the standardization of the characters in PRC's UNC, the governmental procurements have the difficulty to request the feature to interchange the normative characters. Apparently, it is not only the domestic problem in China, but also the problems for the industries trading around Chinese market. If I submit some characters sampled from a dictionary as UNC, how I could make the delegates sympathized as "they are also urgently needed as other governmental requests"? I don't have good idea. Regards, mpsuzuki On 03/20/2014 11:59 PM, Andrew West wrote: > On 20 March 2014 14:12, "J?rg Knappen" wrote: >> >> Who writes a proposal? > > I wish that there was a mechanism for encoding CJK characters that > allowed individuals to simply submit characters with appropriate > evidence to Unicode, and after review they could be added to the next > version Unicode, but the reality is that you need to go through a long > and bureaucratic process involving the Ideographic Rapporteur Group > (IRG), with the result that it may take ten years to get a CJK > character encoded. Even the Unicode Consortium seems powerless to > overcome IRG bureaucracy, as the sorry tale below illustrates. > > In 2012 I wrote a proposal to encode 226 Han characters, including two > fish characters previously requested by Shi Zhao on this list > , > which I submitted to the Unicode Technical Committee (UTC): > > > > The UTC accepted this document, and included the suggested characters > in the Unicode submission to the IRG for inclusion in the CJK-F > extension: > > > > This was discussed at the IRG meeting in Hanoi in November 2012 > (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRG39.htm), but the > Unicode submission for CJK-F was entirely rejected by IRG just because > the submission was a couple of days late. > > The UTC later submitted a proposal to encode 19 of the original > characters (including Shi Zhao's two fish characters) as urgently > needed characters: > > > > But this was rejected by IRG in November last year as they considered > that these characters were not urgent enough, so now we will have to > wait another four or five years before they can be considered for > CJK-G. > > Good luck getting the characters for newtonium and nebulium encoded any sooner! > > Andrew > From rscook at wenlin.com Thu Mar 20 11:43:31 2014 From: rscook at wenlin.com (Richard COOK) Date: Thu, 20 Mar 2014 09:43:31 -0700 Subject: two Hanzi In-Reply-To: References: Message-ID: On Mar 20, 2014, at 6:36 AM, shi zhao wrote: > plese add two Hanzi (up ?+ down ?) and (up ? + down ?) > > see http://www.term.org.cn/CN/abstract/abstract9314.shtml# > > include in : > * Zhonghua Zihai??????, 1994: 1770. > * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219. > > > (up ?+ down ?) = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 ) > (up ? + down ?) = coronium = newtonium (see > http://yedict.com/zslistbs.asp?word=%C6%F87 ) Interesting, yedict.com lists a few characters as "?unicode??", some repeatedly. -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen shot 2014-03-20 at 9.20.54 AM.png Type: image/png Size: 47807 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen shot 2014-03-20 at 9.21.10 AM.png Type: image/png Size: 18639 bytes Desc: not available URL: -------------- next part -------------- ??? =?!?? ??? =? ??? !?? # [U+2C1CF] Ext E ??? !?? One of these is in Ext E (from V-Source), but the other three seem not to be encoded. None is tracked in TR45. I'm adding them to the CDL database ... they could be tracked in TR45 if someone does a proposal to document them ... this would ensure that IRG at least looks at them. Note that the structure may differ ... ??X ??X ??X might all refer to the same abstract character ... -Richard > > > My blog: http://shizhao.org > twitter: https://twitter.com/shizhao > > [[zh:User:Shizhao]] > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From kojiishi at gluesoft.co.jp Thu Mar 20 12:44:56 2014 From: kojiishi at gluesoft.co.jp (Koji Ishii) Date: Thu, 20 Mar 2014 17:44:56 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk> References: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net> <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk> Message-ID: <1A2FCF7B-7FE5-4A41-BB43-315DFB9E81E1@gluesoft.co.jp> Japanese Windows IME does similar. After committing converted text, Backspace erases the last converted character, while CTRL+Backspace right after commits undo the commit and put the text back to converted (i.e., undetermined or editing) state. /koji On Mar 20, 2014, at 9:24 PM, Andre Schappo wrote: WRT Hangul Syllables & OSX TextEdit Take a Hangul Syllable such as ? backspace erases the whole syllable control+backspace erases one jamo at a time from the syllable Andr? Schappo On 19 Mar 2014, at 16:38, Doug Ewell wrote: > Andre Schappo wrote: > >> WRT Latin. I have just tested with OSX TextEdit and the precomposed >> character ? U+00E9 >> >> Backspace erases ? >> Control+Backspace erases ? leaving one with e >> >> I had not realised this was possible until I experimented with >> combinations of Backspace + alt/command/ctrl/shift > > That's the sort of feature I would just love. I also love the Alt+Tab > and Windows+Tab features to switch between windows in Windows. I am led > to believe that "normal users" (cf. nerds) hate this kind of hidden > feature, and either never use it or become annoyed when they invoke it > accidentally by hitting the magic key combination. > > -- > Doug Ewell | Thornton, CO, USA > http://ewellic.org | @DougEwell > _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From andrewcwest at gmail.com Fri Mar 21 05:13:11 2014 From: andrewcwest at gmail.com (Andrew West) Date: Fri, 21 Mar 2014 10:13:11 +0000 Subject: two Hanzi In-Reply-To: References: Message-ID: On 20 March 2014 16:43, Richard COOK wrote: > > Interesting, yedict.com lists a few characters as "?unicode??", some repeatedly. > > ??? =?!?? > ??? =? > ??? !?? # [U+2C1CF] Ext E > ??? !?? > > One of these is in Ext E (from V-Source), but the other three seem not to be encoded. The "Zhonghua Zihai" ?????? dictionary includes thousands of characters not yet encoded in Unicode, and just under the ? radical yedict.com lists 24 not-in-Unicode characters from "Zhonghua Zihai", of which 18 are not encoded or included in CJK-E: ??? ??? = U+520F ? ??? ???? ??? = CJK-E U+2C1CF ??? ??? ??? = CJK-E U+2C1D0 ??? ??? ??? ??? ??? = CJK-E U+2C1D1 ???? ??? ???? ??? ???? ??? ??? = CJK-E U+2C1D2 ????? != U+20103 ?? (?????) ??? = CJK-E U+2C1D3 ???????? != U+2010B ?? (????????) ??? > Note that the structure may differ ... > > ??X > ??X > ??X > > might all refer to the same abstract character ... ... but nevertheless would not be unified according to Annex S. Andrew From velterop at gmail.com Fri Mar 21 06:14:50 2014 From: velterop at gmail.com (Jan Velterop) Date: Fri, 21 Mar 2014 11:14:50 +0000 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol Message-ID: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic. A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/) I am seeking comments and support for this proposal. Jan Velterop From jknappen at web.de Fri Mar 21 09:33:07 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 21 Mar 2014 15:33:07 +0100 Subject: Aw: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From velterop at gmail.com Fri Mar 21 10:22:37 2014 From: velterop at gmail.com (Jan Velterop) Date: Fri, 21 Mar 2014 15:22:37 +0000 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> Message-ID: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> But are the chances nil? It would be a nice complement to the series of ?, ?, ?, etcetera and perform a similar function. A symbol for Creative Commons, presumably a double c in a circle, would probably indicate the document in question is covered by one of the CC licences, but it wouldn't be clear by which one, which may be an impediment for having a symbol. Similarly, copyleft is also a licensing scheme, and as such is not quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a copyleft symbol is in the same 'single encircled letter' convention. For the encircled 'a' symbol for open access it is proposed to use this definition: "The symbol for 'open access', if applied to documents and images, indicates their free availability, on the internet or otherwise, permitting any users to read, download, copy, distribute, (re)print, search, or link to the full texts of such documents, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself or to printing materials and facilities. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. Jan Velterop On 21 Mar 2014, at 14:33, J?rg Knappen wrote: > Even when this symbol really catches on (what I doubt because it is too close to the @ sign in the first place) chance are low that it will be encoded in UNicode. Precedents like the Creative Commons sign or the Copyleft sign have been discussed on this mailing list (search the archives for the relevant threads) but were never encoded in UNicode. > > When the symbol does not catch on, why should it be encoded in UNicode? > > --J?rg Knappen > > Gesendet: Freitag, 21. M?rz 2014 um 12:14 Uhr > Von: "Jan Velterop" > An: unicode at unicode.org > Betreff: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol > May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic. > > A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html > > The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/) > > I am seeking comments and support for this proposal. > > Jan Velterop > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From slevin at signpuddle.net Fri Mar 21 10:55:01 2014 From: slevin at signpuddle.net (Stephen E Slevinski Jr) Date: Fri, 21 Mar 2014 10:55:01 -0500 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> Message-ID: <532C60D5.9040306@signpuddle.net> On 3/21/14, 10:22 AM, Jan Velterop wrote: > But are the chances nil? Unicode won't encode new symbols without demonstrated use. A recent exception was a currency symbol, but it had institutional support. If your new symbol gains widespread use, there is a chance. If you can not demonstrate anyone using the symbol, the chance is nil. Regards, -Steve -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Mar 21 11:06:59 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 21 Mar 2014 09:06:59 -0700 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> Message-ID: <532C63A3.7080007@ix.netcom.com> On 3/21/2014 8:22 AM, Jan Velterop wrote: > But are the chances nil? Essentially you are trying to create a symbol for "this material is placed in the public domain". If you get that symbol adopted by similar authorities as those that created ?, then you would see it encoded in due time. If not, it would have to become massively adopted to become a "de-facto" convention first, but, without an encoded character, that is really unlikely. So, if you are serious about his idea, the rout is to get the convention formally adopted first. A./ > It would be a nice complement to the series of ?, ?, ?, etcetera and > perform a similar function. A symbol for Creative Commons, presumably > a double c in a circle, would probably indicate the document in > question is covered by one of the CC licences, but it wouldn't be > clear by which one, which may be an impediment for having a symbol. > Similarly, copyleft is also a licensing scheme, and as such is not > quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a > copyleft symbol is in the same 'single encircled letter' convention. > > For the encircled 'a' symbol for open access it is proposed to use > this definition: > > "The symbol for 'open access', if applied to documents and images, > indicates their free availability, on the internet or otherwise, > permitting any users to read, download, copy, distribute, > (re)print, search, or link to the full texts of such documents, > crawl them for indexing, pass them as data to software, or use > them for any other lawful purpose, without financial, legal, or > technical barriers other than those inseparable from gaining > access to the internet itself or to printing materials and > facilities. The only constraint on reproduction and distribution, > and the only role for copyright in this domain, should be to give > authors control over the integrity of their work and the right to > be properly acknowledged and cited. > > > Jan Velterop > > On 21 Mar 2014, at 14:33, J?rg Knappen > wrote: > >> Even when this symbol really catches on (what I doubt because it is >> too close to the @ sign in the first place) chance are low that it >> will be encoded in UNicode. Precedents like the Creative Commons sign >> or the Copyleft sign have been discussed on this mailing list (search >> the archives for the relevant threads) but were never encoded in UNicode. >> When the symbol does not catch on, why should it be encoded in UNicode? >> --J?rg Knappen >> *Gesendet:* Freitag, 21. M?rz 2014 um 12:14 Uhr >> *Von:* "Jan Velterop" > >> *An:* unicode at unicode.org >> *Betreff:* New symbol to denote true open access (e.g. to scholarly >> literature), analogous to the copyright symbol >> May I propose a new Unicode symbol to denote true open access, for >> instance applied to scholarly literature, in a similar way that ? and >> ? denote copyright and registered trademarks respectively? The >> proposed symbol is an encircled lower case letter a, in particular in >> a font where the a has a 'tail', as in a font like Arial, for >> instance, and not as in a font like Century Gothic. >> >> A sketch of what I have in mind is here: >> http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html >> >> The intended use would be for documents and images that have been >> published with so-called BOAI-compliant open access >> (http://www.budapestopenaccessinitiative.org/read), meaning that all >> reuse is permitted, with the only permissible condition that the >> author(s) should be acknowledged (CC_BY licence: >> http://creativecommons.org/licenses/by/4.0/). This condition would >> not be mandatory, and also public domain, CC-0 licences would be >> denoted by the proposed symbol >> (http://creativecommons.org/publicdomain/zero/1.0/) >> >> I am seeking comments and support for this proposal. >> >> Jan Velterop >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From velterop at gmail.com Fri Mar 21 12:17:17 2014 From: velterop at gmail.com (Jan Velterop) Date: Fri, 21 Mar 2014 17:17:17 +0000 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <532C63A3.7080007@ix.netcom.com> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com> <532C63A3.7080007@ix.netcom.com> Message-ID: <3CBFA1D1-872C-44A0-85D0-9BB593F54014@gmail.com> Apparently it is already in Unicode, as ? (U+24D0) ? from anonymous feedback. No further need for a formal proposal. Jan Velterop On 21 Mar 2014, at 16:06, Asmus Freytag wrote: > On 3/21/2014 8:22 AM, Jan Velterop wrote: >> But are the chances nil? > > Essentially you are trying to create a symbol for "this material is placed in the public domain". If you get that symbol adopted by similar authorities as those that created ?, then you would see it encoded in due time. If not, it would have to become massively adopted to become a "de-facto" convention first, but, without an encoded character, that is really unlikely. So, if you are serious about his idea, the rout is to get the convention formally adopted first. > > A./ >> It would be a nice complement to the series of ?, ?, ?, etcetera and perform a similar function. A symbol for Creative Commons, presumably a double c in a circle, would probably indicate the document in question is covered by one of the CC licences, but it wouldn't be clear by which one, which may be an impediment for having a symbol. Similarly, copyleft is also a licensing scheme, and as such is not quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a copyleft symbol is in the same 'single encircled letter' convention. >> >> For the encircled 'a' symbol for open access it is proposed to use this definition: >> >> "The symbol for 'open access', if applied to documents and images, indicates their free availability, on the internet or otherwise, permitting any users to read, download, copy, distribute, (re)print, search, or link to the full texts of such documents, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself or to printing materials and facilities. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. >> >> Jan Velterop >> >> On 21 Mar 2014, at 14:33, J?rg Knappen wrote: >> >>> Even when this symbol really catches on (what I doubt because it is too close to the @ sign in the first place) chance are low that it will be encoded in UNicode. Precedents like the Creative Commons sign or the Copyleft sign have been discussed on this mailing list (search the archives for the relevant threads) but were never encoded in UNicode. >>> >>> When the symbol does not catch on, why should it be encoded in UNicode? >>> >>> --J?rg Knappen >>> >>> Gesendet: Freitag, 21. M?rz 2014 um 12:14 Uhr >>> Von: "Jan Velterop" >>> An: unicode at unicode.org >>> Betreff: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol >>> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic. >>> >>> A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html >>> >>> The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/) >>> >>> I am seeking comments and support for this proposal. >>> >>> Jan Velterop >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Fri Mar 21 15:42:40 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Fri, 21 Mar 2014 21:42:40 +0100 Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> Message-ID: <532CA440.6090300@colson.eu> Le 21/03/14 12:14, Jan Velterop a ?crit : > May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic. > > A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html Could ? do the job ? From budelberger.richard at wanadoo.fr Fri Mar 21 16:49:39 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Fri, 21 Mar 2014 22:49:39 +0100 (CET) Subject: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol In-Reply-To: <532CA440.6090300@colson.eu> References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com> <532CA440.6090300@colson.eu> Message-ID: <1708827360.38258.1395438579815.JavaMail.www@wwinf1p12> > Message du 21/03/14 21:56 > De : "Jean-Fran?ois Colson" > A : unicode at unicode.org > Copie ? : > Objet : Re: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol > > Le 21/03/14 12:14, Jan Velterop a ?crit : >> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic. > > A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html > Could ? do the job ? No, ou alors, meanwhile? From verdy_p at wanadoo.fr Sat Mar 22 03:41:40 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 22 Mar 2014 09:41:40 +0100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140322000439.4af309a5@JRWUBU2> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> Message-ID: 2014-03-22 1:04 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Thu, 20 Mar 2014 05:59:49 +0100 > Philippe Verdy wrote: > Not all Indic diacritics have combining class 0, and Hebrew diacritics > have non-zero combining classes. > Did I say something else ? You have probably misread me. I have written "distinct and non-zero" You forgot the term "AND" which is important as it gives the condition where combining characters may be reordered during normalization, and so that their relative encoding order is unpreditable (independantly of the fact that they may be precomposed). So if you enter or , you get in the editor's backing store some encoding form (which my be precombined or not, or with diacritics not necessarily in the normalized form, and all these 4 possible encodings are canonically equivalent): they if you press Backspace, the effect should also not depend on whever you just entered these keystroke or if you loaded the text and clicked after the sequence before pressing backspace: How can you predict which character to remove ? That why here it should delete BOTH the CEDILLA and the ACUTE, because they are using distinct and non-zero combining classes, and so are unordered. The relationale would be true as well for Hebrew points (most of them use distinct non-zero compbining classes when they are used in sequences). But it won't apply to "diacritics" (combining characters or joiner controls like CGJ, ZWK and ZWNJ, and possibly even some oher format controls) that have combining class 0 because their encoding order is significant to you know where to stop the effect of Backspace. I see absolutely no reason why Backspace would arbitrarily delete only the last encoded character when users canno even count them and may not have input them separately. or could expect them to have be typed in a different order. So yes, entering: , or , or , or should all result in keeping only the letter C in the backing store. And with a IME supporint Compose key this will also be true; , or , or , or Canonical equivalence should be respected in visual editing modes. Deleting only the "last" encoding diacritic should only be done in specific non-visual editing modes (with "visible controls") and it is not expected that most users will like this editing mode. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Sat Mar 22 09:34:06 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Sat, 22 Mar 2014 20:34:06 +0600 Subject: Details, please (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net> References: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net> Message-ID: On 18/03/2014, Doug Ewell wrote: > I think what some of us would like to see are detailed examples, citing > specific characters and combinations, rather than general rhetoric, to > support claims like this: Yes From jf at colson.eu Sat Mar 22 10:54:53 2014 From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=) Date: Sat, 22 Mar 2014 16:54:53 +0100 Subject: Details, please (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: References: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net> Message-ID: <532DB24D.50803@colson.eu> Le 22/03/14 15:34, Christopher Fynn a ?crit : > On 18/03/2014, Doug Ewell wrote: >> I think what some of us would like to see are detailed examples, citing >> specific characters and combinations, rather than general rhetoric, to >> support claims like this: > Yes +1 From richard.wordingham at ntlworld.com Sat Mar 22 14:50:56 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 22 Mar 2014 19:50:56 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> Message-ID: <20140322195056.3e6d7c61@JRWUBU2> On Sat, 22 Mar 2014 09:41:40 +0100 Philippe Verdy wrote: > 2014-03-22 1:04 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > So if you enter or , you get > in the editor's backing store some encoding form (which my be > precombined or not, or with diacritics not necessarily in the > normalized form, and all these 4 possible encodings are canonically > equivalent): they if you press Backspace, the effect should also not > depend on whever you just entered these keystroke or if you loaded > the text and clicked after the sequence before pressing backspace: > How can you predict which character to remove ? If I entered those three characters in NFD order, I would expect to remove the ACUTE. I would annoyed to find the string reduced to just C, and am annoyed to find it completely deleted. I do not find consistently poor service to be better than frequently poor service. > The relationale would be true as well for Hebrew points (most of them > use distinct non-zero compbining classes when they are used in > sequences). > But it won't apply to "diacritics" (combining characters or joiner > controls like CGJ, ZWK and ZWNJ, and possibly even some oher format > controls) that have combining class 0 because their encoding order is > significant to you know where to stop the effect of Backspace. Your approach recommends input methods that separate combining marks of different combining classes by CGJ for easier editing! > I see absolutely no reason why Backspace would arbitrarily delete > only the last encoded character when users canno even count them and > may not have input them separately. or could expect them to have be > typed in a different order. > > So yes, entering: > , or > , or > , or > > should all result in keeping only the letter C in the backing store. > And with a IME supporint Compose key this will also be true; > , or > , or > , or > Your input methods suggest that there is something unitary about the result - which makes sense if their output is U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE. Would you make the same arguments if 'C' were replaced with 'S'? There is no character LATIN CAPITAL LETTER S WITH CEDILLA AND ACUTE. It will be distinctly unpleasant and unnatural with an input method that allows separate input of all three characters - C, COMBINING CEDILLA and COMBINING ACUTE - one by one. Your suggestion that typing THAI CHARACTER RO RUA, THAI CHARACTER SARA UU, THAI CHARACTER MAI THO, BACKSPACE should result in just THAI CHARACTER RO RUA is unlikely to be welcome to Thais. I believe our sharply opposing opinions arise because of different views of the clusters. You are seeing characters that are composed of multiple elements. I am seeing groups of characters that, in general, happen not to be arranged in a line of constant direction. > Canonical equivalence should be respected in visual editing modes. > Deleting only the "last" encoding diacritic should only be done in > specific non-visual editing modes (with "visible controls") and it is > not expected that most users will like this editing mode. For users who know what characters should be there, it makes a lot of sense to enter a non-visual editing mode - ideally of limited scope - when editing a previously typed cluster. Richard. From verdy_p at wanadoo.fr Sat Mar 22 17:37:49 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 22 Mar 2014 23:37:49 +0100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140322195056.3e6d7c61@JRWUBU2> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> <20140322195056.3e6d7c61@JRWUBU2> Message-ID: 2014-03-22 20:50 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > > But it won't apply to "diacritics" (combining characters or joiner > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher format > > controls) that have combining class 0 because their encoding order is > > significant to you know where to stop the effect of Backspace. > > Your approach recommends input methods that separate combining > marks of different combining classes by CGJ for easier editing! > NO. I certainly do not recommend it ! This is a false assertion. > I see absolutely no reason why Backspace would arbitrarily delete > > only the last encoded character when users canno even count them and > > may not have input them separately. or could expect them to have be > > typed in a different order. > > > > So yes, entering: > > , or > > , or > > , or > > > > should all result in keeping only the letter C in the backing store. > > > And with a IME supporint Compose key this will also be true; > > > , or > > , or > > , or > > > > Your input methods suggest that there is something unitary about the > result - which makes sense if their output is U+1E08 LATIN CAPITAL > LETTER C WITH CEDILLA AND ACUTE. Would you make the same arguments if > 'C' were replaced with 'S'? There is no character LATIN CAPITAL > LETTER S WITH CEDILLA AND ACUTE. I have NOT said that there existed such character (look at the separating commas). This is a false interpretation. > > It will be distinctly unpleasant and unnatural with an input method > that allows separate input of all three characters - C, > COMBINING CEDILLA and COMBINING ACUTE - one by one. Your suggestion > that typing THAI CHARACTER RO RUA, THAI CHARACTER SARA UU, THAI > CHARACTER MAI THO, BACKSPACE should result in just THAI CHARACTER RO RUA > is unlikely to be welcome to Thais. > > I believe our sharply opposing opinions arise because of different > views of the clusters. You are seeing characters that are composed of > multiple elements. I am seeing groups of characters that, in general, > happen not to be arranged in a line of constant direction. This is a pragmatic consideration, that canonical equivalence should also be respected even when editing texts. The same key should produce canonically equivalent text when editing at the same logical position texts that are canonincally equivalent. > Canonical equivalence should be respected in visual editing modes. > > Deleting only the "last" encoding diacritic should only be done in > > specific non-visual editing modes (with "visible controls") and it is > > not expected that most users will like this editing mode. > > For users who know what characters should be there, it makes a lot of > sense to enter a non-visual editing mode - ideally of limited scope > - when editing a previously typed cluster. > As long as the IME (or keyboard driver) has not transmitted the characters to the edited document, it may record the sequence of keystrokes used. But clicing anywhere in the document, or pressing any cursor movement key will reset the IME to its initial state. If an advanced IME is used to allow editing the content of a cluster before the cursor position, it will require a specific dialog to decompose the characters and render in the IME the cluster as a sequence of characters rendered isolately in "view controls mode"). Most text editors do not support such separate IME panel and in fact users do not like seeing these IME popups appearing on top of the edited text. They want to be able to inpute text diretly in the WYSIWIG window. The IME panel is an advanced edit mode which requires specific support in the application (and an integration similar to the panels used by spell checkers). IME popups also cause severe difficulties for accessibility, due to the separation of the previewed text and the edited text in the panel, also because it is difficult to naviate in these popups with the keyboard and also because the popup is obscuring the rest of the text (complicating the rereading). And on small screens below 5 inches (like smartphones), it is really difficult to fit the IME panel and make it easy to use with fingers, and allow also reading a complete sentence, without reducing a lot the size of touchable buttons, reducing a lot the font sizes, and making the text very difficult to read. That's why so many people over the age of 40 really hate composing any text on smartphones and will prefer larger tablets : their smartphone is used only to view small texts : this is a problem of vision - presbytie - and size of fingers, the screen is too small to fit an IME editor except a TS9 one with 12 keys, used only to compose very short messages such as SMS or Facebook status. On for this usage, people do not care much about composing advanced diacritics; theu will compose only the basic letters and will even drop correct punctuation except space and they won't care about capitalisation if the spell cehceker of the TS9 editor does not guess it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Mar 22 19:16:44 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 23 Mar 2014 00:16:44 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> <20140322195056.3e6d7c61@JRWUBU2> Message-ID: <20140323001644.5dc25bd6@JRWUBU2> On Sat, 22 Mar 2014 23:37:49 +0100 Philippe Verdy wrote: > 2014-03-22 20:50 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: > > > > But it won't apply to "diacritics" (combining characters or joiner > > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher > > > format > > > controls) that have combining class 0 because their encoding > > > order is significant to you know where to stop the effect of > > > Backspace. > > > > Your approach recommends input methods that separate combining > > marks of different combining classes by CGJ for easier editing! > > > > NO. I certainly do not recommend it ! This is a false assertion. If one takes your approach to handling input, then one needs CGJ to ease the correction of diacritics. I am not saying that you recommend the use of CGJ. > > I see absolutely no reason why Backspace would arbitrarily delete > > > only the last encoded character when users canno even count them > > > and may not have input them separately. or could expect them to > > > have be typed in a different order. > > > > > > So yes, entering: > > > , or > > > , or > > > , or > > > > > > should all result in keeping only the letter C in the backing > > > store. > > > > > And with a IME supporint Compose key this will also be true; > > > > > , or > > > , or > > > , or > > > > > > > Your input methods suggest that there is something unitary about the > > result - which makes sense if their output is U+1E08 LATIN CAPITAL > > LETTER C WITH CEDILLA AND ACUTE. Would you make the same arguments > > if 'C' were replaced with 'S'? There is no character LATIN CAPITAL > > LETTER S WITH CEDILLA AND ACUTE. > > I have NOT said that there existed such character (look at the > separating commas). I looked at the names. Dead keys are effectively modifiers applied beforehand rather than simultaneously, so there is no more reason for the dead key sequences to generate more than one character than there is for an ordinary key to generate multiple characters. The use of 'COMPOSE' indicates that one is not simply entering a sequence of characters. 'COMPOSE, C, CEDILLA, ACUTE' should mean an input process different to simply 'C, COMBINING CEDILLA, COMBINING ACUTE'. > This is a pragmatic consideration, that canonical equivalence should > also be respected even when editing texts. The same key should produce > canonically equivalent text when editing at the same logical position > texts that are canonincally equivalent. That raises an interesting question. Which positions in the string (ccc = 0, 103, 107) are logically the same positions as which positions in the canonically equivalent string ? Are you saying that some positions are not 'logical'? I for one would prefer to be able to access any position within the string. It is a shame there has been so little uptake of the SIL Graphite split cursor approach, which attempted to address the issue of editing clusters. As to pragmatics, we are discussing editing with feedback. If we have full feedback, we do not need canonical equivalence to be respected. > If > an advanced IME is used to allow editing the content of a cluster > before the cursor position, it will require a specific dialog to > decompose the characters and render in the IME the cluster as a > sequence of characters rendered isolately in "view controls mode"). It is not a good idea to tamper with the normalisation in the first place. The sequence of characters used may say quite a bit about how the user thinks of the cluster. Pragmatically, normalisation may also degrade rendering - recall the efforts Microsoft went to to discourage the normalisation of Korean text! > Most text editors do not support such separate IME panel and in fact > users do not like seeing these IME popups appearing on top of the > edited text. They want to be able to inpute text diretly in the > WYSIWIG window. The IME panel is an advanced edit mode which requires > specific support in the application (and an integration similar to > the panels used by spell checkers). A separate IME panel is not the only approach. Another approach is to use a modified font in the region of the cluster so that it displays clusters suitably, and then renders the whole region in the WYSIWYG region according to the usual rules except that it applies the font modification in the relevant region. Richard. From verdy_p at wanadoo.fr Sat Mar 22 21:32:06 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 23 Mar 2014 03:32:06 +0100 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140323001644.5dc25bd6@JRWUBU2> References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> <20140322195056.3e6d7c61@JRWUBU2> <20140323001644.5dc25bd6@JRWUBU2> Message-ID: 2014-03-23 1:16 GMT+01:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sat, 22 Mar 2014 23:37:49 +0100 > Philippe Verdy wrote: > > > 2014-03-22 20:50 GMT+01:00 Richard Wordingham < > > richard.wordingham at ntlworld.com>: > > > > > > But it won't apply to "diacritics" (combining characters or joiner > > > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher > > > > format > > > > controls) that have combining class 0 because their encoding > > > > order is significant to you know where to stop the effect of > > > > Backspace. > > > > > > Your approach recommends input methods that separate combining > > > marks of different combining classes by CGJ for easier editing! > > > > > > > NO. I certainly do not recommend it ! This is a false assertion. > > If one takes your approach to handling input, then one needs CGJ to ease > the correction of diacritics. I am not saying that you recommend the > use of CGJ. > > > > I see absolutely no reason why Backspace would arbitrarily delete > > > > only the last encoded character when users canno even count them > > > > and may not have input them separately. or could expect them to > > > > have be typed in a different order. > > > > > > > > So yes, entering: > > > > , or > > > > , or > > > > , or > > > > > > > > should all result in keeping only the letter C in the backing > > > > store. > > > > > > > And with a IME supporint Compose key this will also be true; > > > > > > > , or > > > > , or > > > > , or > > > > > > > > > > Your input methods suggest that there is something unitary about the > > > result - which makes sense if their output is U+1E08 LATIN CAPITAL > > > LETTER C WITH CEDILLA AND ACUTE. Would you make the same arguments > > > if 'C' were replaced with 'S'? There is no character LATIN CAPITAL > > > LETTER S WITH CEDILLA AND ACUTE. > > > > I have NOT said that there existed such character (look at the > > separating commas). > > I looked at the names. Dead keys are effectively modifiers applied > beforehand rather than simultaneously, so there is no more reason for > the dead key sequences to generate more than one character than there > is for an ordinary key to generate multiple characters. > > The use of 'COMPOSE' indicates that one is not simply entering a > sequence of characters. 'COMPOSE, C, CEDILLA, ACUTE' should mean > an input process different to simply 'C, COMBINING CEDILLA, COMBINING > ACUTE'. > Here again you reinterpret what I did not say. When U used DEADKEY or COMPOSE, I was evidently refering to keystrokes, not characters. So I did not imply any encoding of characters (I was clear enough to say that these sequences of keystrokes was allowed to generate any canonically equivalent encoding), so instrad I described the input (on keyboard or IME) and the expected output (an encoded text that should be canonically equivalent). I have NOWHERE intended to force the use of CGJ (you seem to imply that these keys will generate separate combining diacritics/joiners, one or two, for each key... This is wrong, the IME or keyboard driver handles the state of keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does not matter, and so it won't feed the encoded text with streams of characters as long as the state is not complete enough: In fact this input with a compose key does not work: COMPOSE, C, CEDILLA, ACUTE simply because the composed sequence is areaddy terminated after the cedilla modifier key. So when you would type the acute modifier key it would not be associated. That's another reson why dead keys are working: the state is not complete as long as you have not *finally* input the base letter. But let's suppose that the driver must generate something, then for the ACUTE key it would need to output the combining character, possibly with a preceding CGJ if the intent is to have the acute accent ordered relatively with the cedilla (this is very unusual). In most usages, by far, diacritics never need any preceding CGJ to preserve their relative ordering: it is almost never the case for diacrititcs that have distinct non-zero combining classes. The rare cases occur however in classical pointed Hebrew. For this reason the keyboard driver will likely include a separate key mapping for the CGJ, either - as a base key entered after the diacritic deadkey, to force the ouput of CGJ+diacritic characters ; or - as a sequence with COMPOSE+diacritic key, without any key for the intermediate base letter, to produce the same ouput. In the first case (driver with dead keys), you need a single keyboard mapping for the CGJ working as a dead key. In the second case (driver with compose key), you use the COMPOSE key mapping only, but you still need to map positions for the second base key (in the 3-key compose sequence) meant to represent diacritics. The effect of Backspace entered just after it would delete simulatenously CGJ and the diacritic characters. It does not need to depend on the input state of the driver or the IME. In all cases, nothing in the keyboard mapping or IME will generate a CGJ character isolately, ir will be always followed by something. But what would happen if you would type the compose sequence generating CGJ with COMPOSE where you forget to press the initial base letter, or type COMPOSE after the base letter ? C, COMPOSE, ACUTE you get the characters you cannot type another CEDILLA after it without pressing COMPOSE again before it, to get . The result is clearly abusing the use of CGJ when the input output should just be canonically equivalent to (i.e. without any CGJ at all) Your system would be even less meaningful, it would break in most renderers and spell checkers. It would break in IDNA domain names. it would not match in plain text search unless they are tuned so that ther collators discard the CGJs to look for fuzzy matches (fuzzy matches would also look for strings that are compatibility equivalent under NFKD, or could search at collation levels 2, or at collation level 1 ignoring all diacritics and CGJ wherever they are). So compose keys cause more confusion to native users than dead keys that are smarter as they can record more internal states and also allow arbitrary order of input for unordered diacritics (like acute plus cedilla : you can press their dead key in any order, the IME or driver handles the case and generates them, preferably in canonical order with growing combining classes; the drive or IME alos generates them in an input state where it also knows the base letter to ouput, it can precombine the diacritics and so it will output C WITH CEDILLA, followed by COMBINING ACUTE, as expected, and still without needing any CGJ). -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sun Mar 23 06:51:09 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 23 Mar 2014 11:51:09 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> <20140322195056.3e6d7c61@JRWUBU2> <20140323001644.5dc25bd6@JRWUBU2> Message-ID: <20140323115109.54179d85@JRWUBU2> On Sun, 23 Mar 2014 03:32:06 +0100 Philippe Verdy wrote: > 2014-03-23 1:16 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: >> The use of 'COMPOSE' indicates that one is not simply entering a >> sequence of characters. 'COMPOSE, C, CEDILLA, ACUTE' should mean >> an input process different to simply 'C, COMBINING CEDILLA, >> COMBINING ACUTE'. > Here again you reinterpret what I did not say. When U used DEADKEY or > COMPOSE, I was evidently refering to keystrokes, not characters. That is how I understood it. > So I > did not imply any encoding of characters (I was clear enough to say > that these sequences of keystrokes was allowed to generate any > canonically equivalent encoding), so instrad I described the input > (on keyboard or IME) That is how I understood it. > and the expected output (an encoded text that > should be canonically equivalent). I think you mean that you have only specified the generated character output up to equivalence. An actual implementation would have to chose one specific sequence, though there might conceivably be a mechanism to select this sequence. > I have NOWHERE intended to force the use of CGJ (you seem to imply > that these keys will generate separate combining diacritics/joiners, > one or two, for each key... The input method and the editing of backing store are generally done by separate processes. For IPA and Tai Tham input I have written my own input methods. If I frequently had to use a process editing backing store as you recommend, I would be strongly tempted to write a variant that protected marks with non-zero combining class by inserting CGJ. > This is wrong, the IME or keyboard driver handles the state of > keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does > not matter, and so it won't feed the encoded text with streams of > characters as long as the state is not complete enough: This is certainly not true of Keyman for Linux (KMFL), and I don't believe it is true of Tavultesoft Keyman for Windows either. This does require that the input method have a way of cancelling previously provided input. Now, if you use a method with a COMPOSE key or a DEAD key, you are generally unlikely to get tentative entries. However, one could write an input method that simulated a dead key but actually generated an output for it so as to imitate a typewriter differently. > In fact this input with a compose key does not work: > COMPOSE, C, CEDILLA, ACUTE > simply because the composed sequence is areaddy terminated after the > cedilla modifier key. So when you would type the acute modifier key it > would not be associated. I would not be at all surprised to find that someone has it working. > That's another reson why dead keys are > working: the state is not complete as long as you have not *finally* > input the base letter. But let's suppose that the driver must > generate something, then for the ACUTE key it would need to output > the combining character, possibly with a preceding CGJ if the intent > is to have the acute accent ordered relatively with the cedilla (this > is very unusual). Another method would be to generate, one character at a time, the sequence . The NFD decomposition of U+1E08 is . The use of CGJ would apply to 'COMPOSE, C, ACUTE, CEDILLA', for which I would again expect to see the output U+1E08. > The effect of Backspace entered just after it would delete > simulatenously CGJ and the diacritic characters. It does not need to > depend on the input state of the driver or the IME. In all cases, > nothing in the keyboard mapping or IME will generate a CGJ character > isolately, ir will be always followed by something. If backspace is not modified by the input method - and Marc Durdin has suggested that the input method should sometimes modify it - its effect will depend on the process controlling the backing store, which in general will work with multiple input methods, even during the course of a single editing session. You might not write an input method that generates a single CGJ, but I do. Do you insist on a soft hyphen when writing 'Llangollen' so that it will collate after 'Llanberis' in Welsh? (I typed the place names in English; the names are spelt the same way in English and Welsh in hardcopy, though of course the letter counts differ.) > But what would happen if you would type the compose sequence > generating CGJ with COMPOSE where you forget to press the initial > base letter, or type COMPOSE after the base letter ? > C, COMPOSE, ACUTE > you get the characters you cannot type > another CEDILLA after it without pressing COMPOSE again before it, to > get . > The result is clearly abusing the use of CGJ when the input output > should just be canonically equivalent to > (i.e. without any CGJ at all) Lower case specimen? c???? (this was in NFD as I edited it) Actually, I would prefer to avoid the first, unnecessary CGJ. Lower case specimen: c??? (the was in NFD as I edited it) > Your system would be even less meaningful, it would break in most > renderers Some, not all. It renders fine in Firefox, though one can of course set up input forms so that not even Thai renders properly. > and spell checkers. Most of the stuff I currently write with two combining marks of non-zero ccc already fails with spell checkers. > It would break in IDNA domain names. No, it wouldn't. If you consult Table B.1 in http://tools.ietf.org/html/rfc3454#appendix-B 'Stringprep', you will see that CGJ is stripped out. For example, the URL http://www.c???? .com, using the first specimen above, successfully reached http://www.?.com/ when I used Firefox. > would not match in plain text search unless they are tuned so that > ther collators discard the CGJs to look for fuzzy matches (fuzzy > matches would also look for strings that are compatibility equivalent > under NFKD, or could search at collation levels 2, or at collation > level 1 ignoring all diacritics and CGJ wherever they are). Collation Level 3 searches would work for what I type. Level 2 can have a problem with diacritics frozen in the wrong order. > So compose keys cause more confusion to native users than dead keys > that are smarter as they can record more internal states and also > allow arbitrary order of input for unordered diacritics (like acute > plus cedilla : you can press their dead key in any order, the IME or > driver handles the case and generates them, preferably in canonical > order with growing combining classes; the drive or IME alos generates > them in an input state where it also knows the base letter to ouput, > it can precombine the diacritics and so it will output C WITH > CEDILLA, followed by COMBINING ACUTE, as expected, and still without > needing any CGJ). A better easy solution is for backspace just to delete the previous character, so the user will often find what he wants. There is then no need for the extra CGJ. Commands to step into a cluster would be helpful, but are more difficult. One thing that bothers me is that no-one has come forward with the conventions that an application must follow to work with Tavultesoft Keyman and its derivatives and imitations. Richard. From richard.wordingham at ntlworld.com Sun Mar 23 08:07:27 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 23 Mar 2014 13:07:27 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> <20140322000439.4af309a5@JRWUBU2> <20140322195056.3e6d7c61@JRWUBU2> <20140323001644.5dc25bd6@JRWUBU2> Message-ID: <20140323130727.32346f80@JRWUBU2> On Sun, 23 Mar 2014 03:32:06 +0100 Philippe Verdy wrote: > 2014-03-23 1:16 GMT+01:00 Richard Wordingham < > richard.wordingham at ntlworld.com>: >> The use of 'COMPOSE' indicates that one is not simply entering a >> sequence of characters. 'COMPOSE, C, CEDILLA, ACUTE' should mean >> an input process different to simply 'C, COMBINING CEDILLA, >> COMBINING ACUTE'. > Here again you reinterpret what I did not say. When U used DEADKEY or > COMPOSE, I was evidently refering to keystrokes, not characters. That is how I understood it. > So I > did not imply any encoding of characters (I was clear enough to say > that these sequences of keystrokes was allowed to generate any > canonically equivalent encoding), so instrad I described the input > (on keyboard or IME) That is how I understood it. > and the expected output (an encoded text that > should be canonically equivalent). I think you mean that you have only specified the generated character output up to equivalence. An actual implementation would have to chose one specific sequence, though there might conceivably be a mechanism to select this sequence. > I have NOWHERE intended to force the use of CGJ (you seem to imply > that these keys will generate separate combining diacritics/joiners, > one or two, for each key... The input method and the editing of backing store are generally done by separate processes. For IPA and Tai Tham input I have written my own input methods. If I frequently had to use a process editing backing store as you recommend, I would be strongly tempted to write a variant that protected marks with non-zero combining class by inserting CGJ. > This is wrong, the IME or keyboard driver handles the state of > keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does > not matter, and so it won't feed the encoded text with streams of > characters as long as the state is not complete enough: This is certainly not true of Keyman for Linux (KMFL), and I don't believe it is true of Tavultesoft Keyman for Windows either. This does require that the input method have a way of cancelling previously provided input. Now, if you use a method with a COMPOSE key or a DEAD key, you are generally unlikely to get tentative entries. However, one could write an input method that simulated a dead key but actually generated an output for it so as to imitate a typewriter differently. > In fact this input with a compose key does not work: > COMPOSE, C, CEDILLA, ACUTE > simply because the composed sequence is areaddy terminated after the > cedilla modifier key. So when you would type the acute modifier key it > would not be associated. I would not be at all surprised to find that someone has it working. > That's another reson why dead keys are > working: the state is not complete as long as you have not *finally* > input the base letter. But let's suppose that the driver must > generate something, then for the ACUTE key it would need to output > the combining character, possibly with a preceding CGJ if the intent > is to have the acute accent ordered relatively with the cedilla (this > is very unusual). Another method would be to generate, one character at a time, the sequence . The NFD decomposition of U+1E08 is . The use of CGJ would apply to 'COMPOSE, C, ACUTE, CEDILLA', for which I would again expect to see the output U+1E08. > The effect of Backspace entered just after it would delete > simulatenously CGJ and the diacritic characters. It does not need to > depend on the input state of the driver or the IME. In all cases, > nothing in the keyboard mapping or IME will generate a CGJ character > isolately, ir will be always followed by something. If backspace is not modified by the input method - and Marc Durdin has suggested that the input method should sometimes modify it - its effect will depend on the process controlling the backing store, which in general will work with multiple input methods, even during the course of a single editing session. You might not write an input method that generates a single CGJ, but I do. Do you insist on a soft hyphen when writing 'Llangollen' so that it will collate after 'Llanberis' in Welsh? (I typed the place names in English; the names are spelt the same way in English and Welsh in hardcopy, though of course the letter counts differ.) > But what would happen if you would type the compose sequence > generating CGJ with COMPOSE where you forget to press the initial > base letter, or type COMPOSE after the base letter ? > C, COMPOSE, ACUTE > you get the characters you cannot type > another CEDILLA after it without pressing COMPOSE again before it, to > get . > The result is clearly abusing the use of CGJ when the input output > should just be canonically equivalent to > (i.e. without any CGJ at all) Lower case specimen? c???? (this was in NFD as I edited it) Actually, I would prefer to avoid the first, unnecessary CGJ. Lower case specimen: c??? (the was in NFD as I edited it) > Your system would be even less meaningful, it would break in most > renderers Some, not all. It renders fine in Firefox, though one can of course set up input forms so that not even Thai renders properly. > and spell checkers. Most of the stuff I currently write with two combining marks of non-zero ccc already fails with spell checkers. > It would break in IDNA domain names. No, it wouldn't. If you consult Table B.1 in http://tools.ietf.org/html/rfc3454#appendix-B 'Stringprep', you will see that CGJ is stripped out. For example, the URL http://www.c???? .com, using the first specimen above, successfully reached http://www.?.com/ when I used Firefox. > would not match in plain text search unless they are tuned so that > ther collators discard the CGJs to look for fuzzy matches (fuzzy > matches would also look for strings that are compatibility equivalent > under NFKD, or could search at collation levels 2, or at collation > level 1 ignoring all diacritics and CGJ wherever they are). Collation Level 3 searches would work for what I type. Level 2 can have a problem with diacritics frozen in the wrong order. > So compose keys cause more confusion to native users than dead keys > that are smarter as they can record more internal states and also > allow arbitrary order of input for unordered diacritics (like acute > plus cedilla : you can press their dead key in any order, the IME or > driver handles the case and generates them, preferably in canonical > order with growing combining classes; the drive or IME alos generates > them in an input state where it also knows the base letter to ouput, > it can precombine the diacritics and so it will output C WITH > CEDILLA, followed by COMBINING ACUTE, as expected, and still without > needing any CGJ). A better easy solution is for backspace just to delete the previous character, so the user will often find what he wants. There is then no need for the extra CGJ. Commands to step into a cluster would be helpful, but are more difficult. One thing that bothers me is that no-one has come forward with the conventions that an application must follow to work with Tavultesoft Keyman and its derivatives and imitations. Richard. From marc at keyman.com Sun Mar 23 17:46:49 2014 From: marc at keyman.com (Marc Durdin) Date: Sun, 23 Mar 2014 22:46:49 +0000 Subject: Editing Sinhala and Similar Scripts Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local> All the Keyman products -- on Windows, web, iOS and Android, as well as KMFL, which is a port of Keyman, work on the principle of modifying the text buffer directly. There is no intermediate compose buffer. For Indic and western scripts this works pretty well; the compose buffer which is a feature of IMEs does not fit these scripts cleanly in my experience. It is often hard to know when a text entry is 'complete' for committing the compose buffer, and one effect is that the compose buffer tends to get very long, which makes accidental cancellation of input a common and frustrating issue. The most obvious backspace intelligence I've seen in use is around handling NFC vs NFD text. It is confusing to the end user if backspace sometimes deletes a whole character + diacritic, and sometimes just the diacritic mark. For example, Vietnamese text has suffered from this issue with the varying composition schemes we've seen enforced by limited input methods. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Monday, 24 March 2014 12:07 AM To: unicode at unicode.org Subject: Re: Editing Sinhala and Similar Scripts On Sun, 23 Mar 2014 03:32:06 +0100 Philippe Verdy wrote: > This is wrong, the IME or keyboard driver handles the state of > keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does not > matter, and so it won't feed the encoded text with streams of > characters as long as the state is not complete enough: This is certainly not true of Keyman for Linux (KMFL), and I don't believe it is true of Tavultesoft Keyman for Windows either. This does require that the input method have a way of cancelling previously provided input. Now, if you use a method with a COMPOSE key or a DEAD key, you are generally unlikely to get tentative entries. However, one could write an input method that simulated a dead key but actually generated an output for it so as to imitate a typewriter differently. > The effect of Backspace entered just after it would delete > simulatenously CGJ and the diacritic characters. It does not need to > depend on the input state of the driver or the IME. In all cases, > nothing in the keyboard mapping or IME will generate a CGJ character > isolately, ir will be always followed by something. If backspace is not modified by the input method - and Marc Durdin has suggested that the input method should sometimes modify it - its effect will depend on the process controlling the backing store, which in general will work with multiple input methods, even during the course of a single editing session. You might not write an input method that generates a single CGJ, but I do. Do you insist on a soft hyphen when writing 'Llangollen' so that it will collate after 'Llanberis' in Welsh? (I typed the place names in English; the names are spelt the same way in English and Welsh in hardcopy, though of course the letter counts differ.) From duerst at it.aoyama.ac.jp Mon Mar 24 04:54:22 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Mon, 24 Mar 2014 18:54:22 +0900 Subject: Editing Sinhala and Similar Scripts In-Reply-To: References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net> Message-ID: <533000CE.1050204@it.aoyama.ac.jp> On 2014/03/20 02:28, Whistler, Ken wrote: > It is really annoying, particularly to efficient typists, when > a sequence of 4 keystrokes is *not* exactly undone by a > sequence of 4 backspace strokes. When that occurs, the > flow of text composition is suddenly interrupted by forcing > the user out of "compose" mode and into a completely different > "monitor and check what the state of the display is" mode that > can be very annoying. It is certainly very annoying to a typist who is used to one backspace stroke removing one original keystroke. But not all typists are used to this. If I e.g. type Japanese, then depending on the syllable, there are one or more keystrokes for each Kana character, and because I'm entering Kana and only my fingers type Romaji, removing one Kana per backspace stroke isn't necessarily less natural than a more straightforward correspondence. Regards, Martin. From richard.wordingham at ntlworld.com Mon Mar 24 17:06:02 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 24 Mar 2014 22:06:02 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local> Message-ID: <20140324220602.6b8d11f1@JRWUBU2> On Sun, 23 Mar 2014 22:46:49 +0000 Marc Durdin wrote: > All the Keyman products -- on Windows, web, iOS and Android, as well > as KMFL, which is a port of Keyman, work on the principle of > modifying the text buffer directly. I had been going to remark that they couldn't do that directly, but further research showed that Text Services Framework and GTK+ both allow it to be done in fact as opposed to merely in principle. (Does this mean that the Keyman substitution rules follow the principle of canonical equivalence?) > The most obvious backspace intelligence I've seen in use is around > handling NFC vs NFD text. It is confusing to the end user if > backspace sometimes deletes a whole character + diacritic, and > sometimes just the diacritic mark. For example, Vietnamese text has > suffered from this issue with the varying composition schemes we've > seen enforced by limited input methods. However, with Keyman and KMFL there are fallbacks for when the text buffer is not accessible. e.g. when using X and presumably when using a Windows program that does not use the Text Services Framework. Version 1.07 of the interface between ibus and KMFL shows that a backspace character generated by the input method (as opposed to simply passed on from the keyboard) is intended to delete exactly one character. It seems to me that at least where fallbacks are used, the backing store that KMFL wishes to delete must be in the state in which KMFL placed it - intervening normalisation will corrupt the input. Is there an explicit statement of this anywhere? When using X, it is possible to tell a backspace generated by the 'Input Method' from one simply generated by the keyboard; the keycode is 0 in the former case but not the latter. Richard. From marc at keyman.com Mon Mar 24 17:37:59 2014 From: marc at keyman.com (Marc Durdin) Date: Mon, 24 Mar 2014 22:37:59 +0000 Subject: Editing Sinhala and Similar Scripts In-Reply-To: <20140324220602.6b8d11f1@JRWUBU2> References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local> <20140324220602.6b8d11f1@JRWUBU2> Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD5046F@federation.tavultesoft.local> Richard Wordingham wrote: > > On Sun, 23 Mar 2014 22:46:49 +0000 > Marc Durdin wrote: > > > All the Keyman products -- on Windows, web, iOS and Android, as well > > as KMFL, which is a port of Keyman, work on the principle of modifying > > the text buffer directly. > > I had been going to remark that they couldn't do that directly, but further > research showed that Text Services Framework and GTK+ both allow it to be > done in fact as opposed to merely in principle. (Does this mean that the > Keyman substitution rules follow the principle of canonical equivalence?) Currently, KMFL and Keyman do no normalization of text, treating it as a raw Unicode character stream -- catering for this is up to the input method. Keyman works directly with the text store in the web, iOS and Android versions, and where possible on Windows -- which in practice means the very few applications that have enough support for Text Services Framework, including, for example, MS Word, SIL FLEx, and the RichEdit control. Otherwise it works in a fallback mode where it retains the last sequence of characters typed in a cache until an intervening event causes it to flush its cache. Not perfect, but covers the vast majority of cases without issue (as in, less than one support case per month...) >It seems to me that at > least where fallbacks are used, the backing store that KMFL wishes to delete > must be in the state in which KMFL placed it > - intervening normalisation will corrupt the input. Is there an explicit > statement of this anywhere? Yes, this is true for both Keyman and KMFL when working in fallback modes. In practice, it's rarely a problem. Marc From wjgo_10009 at btinternet.com Thu Mar 27 03:13:52 2014 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 27 Mar 2014 08:13:52 +0000 (GMT) Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? Message-ID: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com> Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? Please consider my use of U+E001 in the following thread. https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books Essentially, can that effect be achieved without using a Private Use Area character? William Overington 27 March 2014 From jkorpela at cs.tut.fi Thu Mar 27 03:42:20 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Thu, 27 Mar 2014 10:42:20 +0200 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <5333E46C.702@cs.tut.fi> 2014-03-27 10:13, William_J_G Overington wrote: > Does regular Unicode have a character that looks like a space to a > human yet is not treated as a space by software please? It depends, among other things, on what you mean by ?space?. There?s U+00A0 NO-BREAK SPACE, which surely isn?t the same as U+0020 SPACE, but might be called a space. Programs can do different things to different characters. > Please consider my use of U+E001 in the following thread. > > https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books As far as I can see, the question is about indenting text in e-books. What I do in my e-books is a simple CSS setting, margin-left (or padding-left) with a suitable value. There are many other ways too. Or you could even use a sequence of U+00A0 characters at the star of a line. There is no exact definition of what should happen, but in practice, HTML user agents, including e-book readers, treat U+00A0 as yet another graphic character, which just happens to have an empty glyph. Well, they may also seem to be honoring the non-breaking property, but this might be just incidental (they generally don?t break before or after graphic characters except whitespace characters, and U+00A0 is by HTML definition not whitespace). There are also other characters that can be called ?spaces?, such as U+2002 EN SPACE. But they have properties similar to the properties of U+0020 SPACE, so we can expect some programs to handle them the same way as SPACE, in some respect. Sorry for this vagueness, but it reflects the vagueness of the question. Yucca From KalvesmakiJ at doaks.org Thu Mar 27 08:10:30 2014 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Thu, 27 Mar 2014 13:10:30 +0000 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <3281b40c-bf32-4bd2-b83f-0b46a811474d@unicode.org> Message-ID: William, try the U+2000..U+200A glyphs under General Punctuation--I think that's what you're looking for to manage precise widths of blank space. And many (most?) software routines do not treat these as part of the class of spacing characters (\s in regular expressions). Best wishes, jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 1703 32nd St. NW Washington, DC 20007 (202) 339-6435 On 3/27/14 4:13 AM, "William_J_G Overington" wrote: >Does regular Unicode have a character that looks like a space to a human >yet is not treated as a space by software please? > >Please consider my use of U+E001 in the following thread. > >https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-bo >oks > >Essentially, can that effect be achieved without using a Private Use Area >character? > >William Overington > >27 March 2014 > >_______________________________________________ >Unicode mailing list >Unicode at unicode.org >http://unicode.org/mailman/listinfo/unicode From sittipon at x10studio.com Thu Mar 27 03:14:33 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Thu, 27 Mar 2014 15:14:33 +0700 Subject: Pali in Thai Script Message-ID: Hi, I am a volunteer programmer working for Tipitaka studies foundation in Thailand. We are working on a new project about Pali in Thai script with special emphasize on the pronunciation aspect. Since, Pali here is written using an everyday use Thai characters with a couple of extra symbols. Most people will read out using their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), which make Thai spoken Pali differently from people not trained in Thailand. In order to ease this situation, we have created an orthography font (slightly modified from the existed Thai font) and used them internally. I have to admit that, currently, we are changing the glyphs from time to time. But, we are looking forward to establish the studies nationwide in the near future once everything is in place. I was wondering what is the unicode community opinion on these new characters. Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png Thai consonants with Black dot for non-aspirated and White dot for aspirated. https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png These are all the characters we need beside the normal Thai characters. Is it possible for us to submit/add these new characters to unicode once everything is in place? If it is possible, should we separate them into a new symbol for black dot and white dot, or simply call KO KAI with black dot as a new character? We are open to suggestions. Thanks a lot everyone! Sittipon From jkorpela at cs.tut.fi Thu Mar 27 10:04:13 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Thu, 27 Mar 2014 17:04:13 +0200 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: References: Message-ID: <53343DED.5040304@cs.tut.fi> 2014-03-27 15:10, Kalvesmaki, Joel wrote: > William, try the U+2000..U+200A glyphs under General Punctuation--I think > that's what you're looking for to manage precise widths of blank space. That range contains some ?fixed-width spaces?, yes. Being ?fixed-width? is rather relative here, though, and many fonts do not contain these characters. Rendering software could of course display them by just leaving suitable spacing, but that?s not common. The ?fixed-width spaces? are mostly just legacy characters, holdover from old typography. They may have their uses, though, in contexts where they work and other spacing methods don?t (for example, I recently noticed that they seem to be the only way to create a little spacing between an inline equation and normal character in MS Word). But for the purposes of indenting text lines, I don?t think they are useful. In almost all cases, there are better tools for indentation. > And many (most?) software routines do not treat these as part of the class > of spacing characters (\s in regular expressions). Well, most regexp implementations are very Ascii-oriented: notations like \s, \w, \d, etc. match Ascii characters only. Yucca From addison at lab126.com Thu Mar 27 10:07:16 2014 From: addison at lab126.com (Phillips, Addison) Date: Thu, 27 Mar 2014 15:07:16 +0000 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com> References: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com> Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB517DF043C@ex10-mbx-36009.ant.amazon.com> The thread on serif.com discusses formatting of poetry in a Kindle book. The problem is that the author would like to indent two lines. You don't want to do that by using a character that "looks like a space" yet isn't seen by the software to be a space. This would break features like dictionary lookup on the first word on each of those lines. The actual solution is to style the text as indented. There are some guidelines on the KDP site. http://www.amazon.com/gp/feature.html?docId=1000729511 https://kdp.amazon.com/help?topicId=A17W8UM0MMSQX6#para One way to achieve the desired goal is to use the 'margin' and 'text-align' CSS styles. Addison Addison Phillips Globalization Architect (Amazon Lab126) Chair (W3C I18N WG) Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > William_J_G Overington > Sent: Thursday, March 27, 2014 1:14 AM > To: unicode at unicode.org > Cc: wjgo_10009 at btinternet.com > Subject: Does regular Unicode have a character that looks like a space to a > human yet is not treated as a space by software please? > > Does regular Unicode have a character that looks like a space to a human yet is > not treated as a space by software please? > > Please consider my use of U+E001 in the following thread. > > https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books > > Essentially, can that effect be achieved without using a Private Use Area > character? > > William Overington > > 27 March 2014 > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From KalvesmakiJ at doaks.org Thu Mar 27 10:37:12 2014 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Thu, 27 Mar 2014 15:37:12 +0000 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <8fa01889-fb05-499e-a34b-57bb975769fc@unicode.org> Message-ID: Points taken. I just note for the record that in academic publishing and scholarly editions these spacing characters are actively used, particularly in InDesign files and in diplomatic editions rendered in XML. The legacy lives. jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 1703 32nd St. NW Washington, DC 20007 (202) 339-6435 > >The ?fixed-width spaces? are mostly just legacy characters, holdover >from old typography. They may have their uses, though, in contexts where >they work and other spacing methods don?t (for example, I recently >noticed that they seem to be the only way to create a little spacing >between an inline equation and normal character in MS Word). From budelberger.richard at wanadoo.fr Thu Mar 27 12:38:05 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Thu, 27 Mar 2014 18:38:05 +0100 (CET) Subject: Pali in Thai Script In-Reply-To: References: Message-ID: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> > Message du 27/03/14 15:43 > De : Sittipon Simasanti > A : unicode at unicode.org > Objet : Pali in Thai Script > > Hi, Beuar, > I am a volunteer programmer working for Tipitaka studies foundation in Thailand. > We are working on a new project about Pali in Thai script with special emphasize > on the pronunciation aspect. Since, Pali here is written using an everyday use > Thai characters with a couple of extra symbols. Most people will read out using > their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), > which make Thai spoken Pali differently from people not trained in Thailand. > In order to ease this situation, we have created an orthography font (slightly > modified from the existed Thai font) and used them internally. I have to admit > that, currently, we are changing the glyphs from time to time. But, we are looking > forward to establish the studies nationwide in the near future once everything is > in place. I was wondering what is the unicode community opinion on these new > characters. Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png Thai consonants with > Black dot for non-aspirated and White dot for aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png These are > all the characters we need beside the normal Thai characters. Is it possible for us > to submit/add these new characters to unicode once everything is in place? If it is > possible, should we separate them into a new symbol for black dot and white dot, > or simply call KO KAI with black dot as a new character? > > We are open to suggestions. Very interesting?! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script?(cf.?http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin?(cf.?http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque), Ancient?Greek?(cf.?http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in Modern?Greek?script?(cf.?http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque)?!? That? what George Kiraz called ?garshunography??(cf.?http://en.wikipedia.org/wiki/Garshuni). And now, Pali. Not Thai in Pali script, but Pali in Thai script? Do you know how many languages are concerned by this ?Paligarshunography??? Since ho many centuries?? > Thanks a lot everyone! > > Sittipon From budelberger.richard at wanadoo.fr Thu Mar 27 12:58:13 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Thu, 27 Mar 2014 18:58:13 +0100 (CET) Subject: Pali in Thai Script In-Reply-To: References: Message-ID: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11> > Message du 27/03/14 15:43 > De : Sittipon Simasanti > A : unicode at unicode.org > Objet : Pali in Thai Script > > Hi, Beuar, > I am a volunteer programmer working for Tipitaka studies foundation in Thailand. > We are working on a new project about Pali in Thai script with special emphasize > on the pronunciation aspect. Since, Pali here is written using an everyday use > Thai characters with a couple of extra symbols. Most people will read out using > their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), > which make Thai spoken Pali differently from people not trained in Thailand. > In order to ease this situation, we have created an orthography font (slightly > modified from the existed Thai font) and used them internally. I have to admit > that, currently, we are changing the glyphs from time to time. But, we are looking > forward to establish the studies nationwide in the near future once everything is > in place. I was wondering what is the unicode community opinion on these new > characters. Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png Thai consonants with > Black dot for non-aspirated and White dot for aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png These are > all the characters we need beside the normal Thai characters. Is it possible for us > to submit/add these new characters to unicode once everything is in place? If it is > possible, should we separate them into a new symbol for black dot and white dot, > or simply call KO KAI with black dot as a new character? > > We are open to suggestions. I?m afraid to say that since PHO SAMPHAO with White?dot (for aspirated)?? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l.?5 c.?2 ? may be (badly) drawn with U+0E20?THAI CHARACTER PHO?SAMPHAO and U+0325?COMBINING RING BELOW, ??????, you have to use your Internal Font? > Thanks a lot everyone! > > Sittipon From richard.wordingham at ntlworld.com Thu Mar 27 13:36:18 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 27 Mar 2014 18:36:18 +0000 Subject: Pali in Thai Script In-Reply-To: References: Message-ID: <20140327183618.0bfd3a2b@JRWUBU2> On Thu, 27 Mar 2014 15:14:33 +0700 Sittipon Simasanti wrote: > Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png > > Thai consonants with Black dot for non-aspirated and White dot for > aspirated. > https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png Those descriptions confused me - the black dot means 'voiced and not aspirated', and the white dot means 'voiced and aspirated'. > These are all the characters we need beside the normal Thai > characters. Is it possible for us to submit/add these new characters > to unicode once everything is in place? If it is possible, should we > separate them into a new symbol for black dot and white dot, or > simply call KO KAI with black dot as a new character? We are open to > suggestions. If your scheme has sufficient success, each combination of base letter and diacritic may well be encoded as a separate letter because the position of the diacritic is not obvious. I presume we're looking at no more than about 12 new characters - DO CHADA WITH BLACK DOT is an obvious competitor to THO NANGMONTHO WITH BLACK DOT. I'm disappointed you found that simply adding a black dot for the voiced consonants didn't work. If it had worked, then we might have argued that this was just a font variation. Richard. From richard.wordingham at ntlworld.com Thu Mar 27 14:12:39 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 27 Mar 2014 19:12:39 +0000 Subject: Pali in Thai Script In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <20140327191239.4cd56d46@JRWUBU2> On Thu, 27 Mar 2014 18:38:05 +0100 (CET) Richard BUDELBERGER wrote: > And now, Pali. Not Thai in Pali script, but Pali in Thai script? There is no Pali script as such, though sometimes Pali is written in a neighbour's script rather than one's own. What's more surprising is that Pali wasn't regularly written in the Thai script until Rama IV ordered the change. Instead, the Buddhist script in his domains was the Khom script (a variety of the Khmer script, with several unencoded characters for Thai) in the south and the Tai Tham script in the north. Richard. From chris.fynn at gmail.com Thu Mar 27 14:50:38 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 28 Mar 2014 01:50:38 +0600 Subject: Pali in Thai Script In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: On 27/03/2014, Richard BUDELBERGER wrote: > And now, Pali. Not Thai in Pali script, but Pali in Thai script? There is no standard script for P??i - It is often written in Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several other scripts. I do think there is quite a need for a utility to convert P??i written in any one of these scripts to any of the others, - Chris From ed.trager at gmail.com Thu Mar 27 15:08:29 2014 From: ed.trager at gmail.com (Ed Trager) Date: Thu, 27 Mar 2014 16:08:29 -0400 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: Hi, Chris, Besides the scripts you mention, there is also Tai Tham as Richard mentioned. In theory, writing a utility to convert Pali written in any of those scripts to any one of the other scripts should not be too difficult but ... : * Modern phonetically-based Lao lacks some of the traditional letters that are still preserved in Thai and other scripts. * At least as far as Tai Tham goes, it seems that Tai Tham spelling is not consistent with Central Thai spelling when it comes to Sanskrit and Pali-derived words ... I don't really know much about this -- just my own limited observations. Probably somebody else here like Richard Wordingham or Martin Hosken knows a lot more about this than I do ... ... so maybe in reality it is not so simple to do? On Thu, Mar 27, 2014 at 3:50 PM, Christopher Fynn wrote: > On 27/03/2014, Richard BUDELBERGER wrote: > > > And now, Pali. Not Thai in Pali script, but Pali in Thai script? > > There is no standard script for P??i - It is often written in > Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several > other scripts. > > I do think there is quite a need for a utility to convert P??i written > in any one of these scripts to any of the others, > > - Chris > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Thu Mar 27 16:48:05 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 28 Mar 2014 03:48:05 +0600 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: On 28/03/2014, Ed Trager wrote: > Hi, Chris, > Besides the scripts you mention, there is also Tai Tham as Richard > mentioned Several un-encoded Mon and Shan scripts too - as well as other Indic scripts. > > In theory, writing a utility to convert Pali written in any of those > scripts to any one of the other scripts should not be too difficult but ... > * Modern phonetically-based Lao lacks some of the traditional letters that > are still preserved in Thai and other scripts. Are there old Lao characters (once) used for writing P??i? Even if there is not a 1 to 1 correspondence - as long as there is consistency in the way P??i is written in each script - and you know you are dealing with P??i and not another language written in that script, it should be possible. > * At least as far as Tai Tham goes, it seems that Tai Tham spelling is not > consistent with Central Thai spelling when it comes to Sanskrit and > Pali-derived words ... I don't really know much about this -- just my own > limited observations. Probably somebody else here like Richard Wordingham > or Martin Hosken knows a lot more about this than I do ... A problem might be if scribal errors have crept in over the centuries and some of these misspellings have become accepted in one script or another. I think there is work going on to make a very carefully edited critical edition of the P??i Canon - it would be useful to be able to convert and print this out in the scripts used in the different countries where Therav?da Buddhism is popular. > ... so maybe in reality it is not so simple to do? > > - Ed > > > On Thu, Mar 27, 2014 at 3:50 PM, Christopher Fynn > wrote: > >> On 27/03/2014, Richard BUDELBERGER >> wrote: >> >> > And now, Pali. Not Thai in Pali script, but Pali in Thai script? >> >> There is no standard script for P??i - It is often written in >> Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several >> other scripts. >> >> I do think there is quite a need for a utility to convert P??i written >> in any one of these scripts to any of the others, >> >> - Chris From rick at unicode.org Thu Mar 27 17:06:36 2014 From: rick at unicode.org (Rick McGowan) Date: Thu, 27 Mar 2014 15:06:36 -0700 Subject: Pali in Thai Script In-Reply-To: References: Message-ID: <5334A0EC.6050500@unicode.org> Hello, This is an interesting discussion so far... What is the current situation of Pali written in the Thai script? Is there a scholarly tradition already? Why are new symbols being used for this purpose in this project? Is it because nothing else exists at this time? Or some other reason? Has this never been done before? I'm trying to understand the particular scholarly need that will be addressed by this project, and to know why some other existing symbols are not, or cannot, be used for this purpose. It would help to get a sense of the project scope, and how it relates to previous and current Pali scholarship in Thailand. And what alternative solutions have been discussed and/or used by the project participants. (Also to be clear: I'm only asking these questions out of personal curiosity, not an official question on behalf of the UTC or anything like that.) Thanks, Rick On 3/27/2014 1:14 AM, Sittipon Simasanti wrote: > In order to ease this situation, we have created an orthography font (slightly modified from the existed Thai font) and used them internally. I have to admit that, currently, we are changing the glyphs from time to time. But, we are looking forward to establish the studies nationwide in the near future once everything is in place. From budelberger.richard at wanadoo.fr Thu Mar 27 17:21:17 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Thu, 27 Mar 2014 23:21:17 +0100 (CET) Subject: Pali in Thai Script Message-ID: <22776183.17522.1395958877729.JavaMail.www@wwinf1m14> > Message du 27/03/14 19:43 > De : Richard Wordingham > Copie ? : unicode at unicode.org > Objet : Re: Pali in Thai Script > > On Thu, 27 Mar 2014 15:14:33 +0700 > Sittipon Simasanti wrote: > > > Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. > > https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png > > > > Thai consonants with Black dot for non-aspirated and White dot for > > aspirated. > > https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png > > Those descriptions confused me - the black dot means 'voiced and not > aspirated', and the white dot means 'voiced and aspirated'. https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png : I think that ???????????? means ?unaspirated? and ????????? ?aspirated?? (But yes, Sittipon Simasanti?s message is not very clear. See http://twitpic.com/dzk1o3 : do you understand it?? I, no.) (The True) Richard. Note?: Tipitaka Studies Foundation Internal Font uses U+0325 ?? combining ring below ? the ?voiceless? diacritic?: https://en.wikipedia.org/wiki/Voice_(phonetics) ? for (un)aspiration ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From budelberger.richard at wanadoo.fr Thu Mar 27 17:31:08 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Thu, 27 Mar 2014 23:31:08 +0100 (CET) Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <494994255.17641.1395959468087.JavaMail.www@wwinf1m14> > Message du 27/03/14 22:56 > De : Christopher Fynn > A : Ed Trager > Copie ? : Unicode List > Objet : Re: Pali in Thai Script > > On 28/03/2014, Ed Trager wrote: >> Besides the scripts you mention, there is also Tai Tham as Richard >> mentioned > > Several un-encoded Mon and Shan scripts too - as well as other Indic scripts. > >> In theory, writing a utility to convert Pali written in any of those >> scripts to any one of the other scripts should not be too difficult but ... > >> * Modern phonetically-based Lao lacks some of the traditional letters that >> are still preserved in Thai and other scripts. > > Are there old Lao characters (once) used for writing P??i? > > Even if there is not a 1 to 1 correspondence - as long as there is > consistency in the way P??i is written in each script - and you know > you are dealing with P??i and not another language written in that > script, it should be possible. What can I say with my experience from Garshuni, is that the rule is that there is no (strict) rules, and that the only consistency I saw in writing two related languages (Arabic and Syriac) is inconsistency. So, imagine with an Indic and Asiatic languages. From richard.wordingham at ntlworld.com Thu Mar 27 19:23:49 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 28 Mar 2014 00:23:49 +0000 Subject: Pali in Thai Script In-Reply-To: <5334A0EC.6050500@unicode.org> References: <5334A0EC.6050500@unicode.org> Message-ID: <20140328002349.67a826f3@JRWUBU2> On Thu, 27 Mar 2014 15:06:36 -0700 Rick McGowan wrote: > What is the current situation of Pali written in the Thai script? Is > there a scholarly tradition already? There was a scholarly tradition of writing Pali in the Khom script (Bangkok,as successor to Ayutthaya) or Tai Tham script (north and northeast). For secular writing, there were versions of the Tai/Lao script, which had additional letters whose purpose was to retain the consonant distinctions made in the religious scripts. Rama IV commanded (whether as Patriarch or later as king - I don't remember) instructed that religious writing be switched to the Thai script. He also promulgated a change in the writing system for Pali, whereby the two vowel killers, YAMAKKAN and THANTHAKHAT, were 'simplified' to a single diacritic, PHINTHU. There was thus in principal an immediate 'tradition' of writing Pali in the Thai script. Now, there is actually a problem in writing Pali in the Thai script. When a preposed vowel phonetically follows a consonant cluster in the middle of a word, does it proceed or follow the first vowel? There seems to be a lot of inconsistency, as Vinodh and I found out when trying to work out the rules so that he could transliterate his master copy of the Tipitaka into the Thai script. I was working from a Thai CD of the Tipitaka, and was quite startled by the internal inconsistency in the spelling in the Thai script. There are two other problems. The system of phinthu and implicit vowel and NIKHAHIT to write the anusvara is quite different to the way that Thai is actually written. For private recitation of Pali, a tradition has grown up of using MAI HANAKAT and SARA A, which are not used in traditional Pali spelling, to replace the implicit vowels, thus creating a Thai script writing system for Thai that is actually an alphabet rather than an abugida. The second problem is the 'great consonant shift' whereby old voicing contrasts were lost in much of East Asia, covering most Tai, Mon-Khmer and Chinese dialects. (The change is not complete - some areas have escaped the change.) Consequently, the more conservative Sinhalese and Burmese pronunciations are quite different to the Thai and Lao (and Mon and Khmer) pronunciations. The Thai and Lao pronunciations have replaced voiced stops by voiceless aspirates. > Why are new symbols being used for this purpose in this project? The ideas of the new symbols it to restore the ancient pronunciation. Just as a Classical Latin pronunciation differs greatly from English legal Latin or Roman Catholic Church Latin, and is very different to how Latin loan words are pronounced in English, the modern Thai consonant sounds are very different to the ancient Pali sounds. > I'm trying to understand the particular scholarly need that will be > addressed by this project, and to know why some other existing > symbols are not, or cannot, be used for this purpose. The problem with the traditional symbols is that they are pronounced quite differently in Thai. ????? is /budd?a/ in the ancient pronunciation, but ???? is /p?ut(t?a)/ in Thai pronunciation. (Thai doesn't use PHINTHU.) An analogy is that 'Caesar' is pronounced /si?z?/ in British English, but is approximated as /ka?sar/ in a Latin class in England. Apart from the possible examples of Pali and Sanskrit pronounced in the Indian way, most Thais are probably not accustomed to Thai letters being pronounced differently in different languages. Richard. From richard.wordingham at ntlworld.com Thu Mar 27 20:03:20 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 28 Mar 2014 01:03:20 +0000 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <20140328010320.2fcf9c6c@JRWUBU2> On Thu, 27 Mar 2014 16:08:29 -0400 Ed Trager wrote: > Hi, Chris, > > Besides the scripts you mention, there is also Tai Tham as Richard > mentioned. > > In theory, writing a utility to convert Pali written in any of those > scripts to any one of the other scripts should not be too difficult > but ... : > > * Modern phonetically-based Lao lacks some of the traditional letters > that are still preserved in Thai and other scripts. > > * At least as far as Tai Tham goes, it seems that Tai Tham spelling > is not consistent with Central Thai spelling when it comes to > Sanskrit and Pali-derived words ... I don't really know much about > this -- just my own limited observations. Modern Siamese spelling is highly Sanskritised. It has also been simplified by the elimination of final 'geminate' clusters. There are also quite a few differences in the reflexes of P/S /a/ in closed syllables, and certainly the spelling of the Mae Fah Luang dictionary reflects vowel changes that Siamese spelling simply ignores. Having said that, some Tai Tham spelling has geminates where the evidence of other varieties of Pali is that there should not be geminates - what should etymologically be written is often written . I have seen remarks that the Pali of inland SE Asia is rather different from that of Sri Lanka. There are other issues, such as the merger of HIGH SA and HIGH CHA in some varieties, so that what should be the cluster actually appears to be . There is also the tendency of to be used for other labials. Richard. From sittipon at x10studio.com Thu Mar 27 20:47:17 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Fri, 28 Mar 2014 08:47:17 +0700 Subject: Pali in Thai Script In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: > > Very interesting ! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script (cf. http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other > languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin (cf. http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque), > Ancient Greek (cf. http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in > Modern Greek script (cf. http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque) !? That? what George Kiraz called > ?garshunography? (cf. http://en.wikipedia.org/wiki/Garshuni). > > And now, Pali. Not Thai in Pali script, but Pali in Thai script? > As far as I know, Pali doesn't have its own set of characters. It is often written using languages' characters where its users are familiar with. The important thing is its voice should be the same no matter what character set you are using. We already have Pali written using Thai script. This is not entirely a new one. Just a few changes to make Pali written in Thai sounds more like written in other languages. > Do you know how many languages are concerned by this ?Paligarshunography? ? Since ho many centuries ? I have no idea. But, should be a lot. Since our neighbors, Lao, Myanmar also have Pali written in their languages. And we also have Pali written in English alphabets in our database too. Sittipon From sittipon at x10studio.com Thu Mar 27 21:07:52 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Fri, 28 Mar 2014 09:07:52 +0700 Subject: Pali in Thai Script In-Reply-To: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11> References: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11> Message-ID: <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com> > > I?m afraid to say that since PHO SAMPHAO with White dot (for aspirated) ? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l. 5 c. 2 ? may be (badly) drawn with > U+0E20 THAI CHARACTER PHO SAMPHAO and U+0325 COMBINING RING BELOW, ? ?? ?, you have to use your Internal Font? > Arr, thanks. We have considered to put them below and above as well. But, white dot above the consonants just look too much like SARA AM (U+0E33) and if we put them below black dot will look like PINTHU (U+0E3A). Both of them have already their functions in Thai language. So, it might be confusing rather than helping Pali in Thai script. If possible we would like to keep them in the same place. That's why we put them there. Sittipon. From mark at kli.org Thu Mar 27 21:28:42 2014 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 27 Mar 2014 22:28:42 -0400 Subject: Pali in Thai Script In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <5334DE5A.9030405@kli.org> On 03/27/2014 01:38 PM, Richard BUDELBERGER wrote: > Very interesting ! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script (cf. http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other > languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin (cf. http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque), > Ancient Greek (cf. http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in > Modern Greek script (cf. http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque) !? That? what George Kiraz called > ?garshunography? (cf. http://en.wikipedia.org/wiki/Garshuni). > > And now, Pali. Not Thai in Pali script, but Pali in Thai script? > It's not at all uncommon. Consider Yiddish, which is essentially German written in Hebrew script. Or various Judeo-Arabics written in Hebrew, and the Talmud, which is Aramaic written in Hebrew letters (in pretty much every printing and MS I've heard of). ~mark From budelberger.richard at wanadoo.fr Thu Mar 27 21:33:40 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Fri, 28 Mar 2014 03:33:40 +0100 (CET) Subject: Pali in Thai Script In-Reply-To: <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com> References: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11> <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com> Message-ID: <92226187.139.1395974020618.JavaMail.www@wwinf1m14> > Message du 28/03/14 03:08 > De : Sittipon Simasanti > A : Richard BUDELBERGER > Copie ? : unicode at unicode.org > Objet : Re: Pali in Thai Script > > > I?m afraid to say that since PHO SAMPHAO with White dot (for aspirated) ? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l. 5 c. 2 ? may be (badly) drawn with > > U+0E20 THAI CHARACTER PHO SAMPHAO and U+0325 COMBINING RING BELOW, ? ?? ?, you have to use your Internal Font? > > > > Arr, thanks. We have considered to put them below and above as well. But, white dot above the consonants just look too much like SARA AM (U+0E33) and if we put them below > black dot will look like PINTHU (U+0E3A). Both of them have already their functions in Thai language. So, it might be confusing rather than helping Pali in Thai script. > > If possible we would like to keep them in the same place. That's why we put them there. The tip is to say that dots are not above or below the characters, but inside them. From sittipon at x10studio.com Thu Mar 27 21:34:51 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Fri, 28 Mar 2014 09:34:51 +0700 Subject: Pali in Thai Script In-Reply-To: <20140327183618.0bfd3a2b@JRWUBU2> References: <20140327183618.0bfd3a2b@JRWUBU2> Message-ID: > > Those descriptions confused me - the black dot means 'voiced and not > aspirated', and the white dot means 'voiced and aspirated'. > Sorry for that, you are right, the picture doesn't represent the entire table so it might be confusing. Here's the entire one: https://dl.dropboxusercontent.com/u/824603/unicode/glyph3.png Only red and green columns have extra black and white dots. The rest of them are normal Thai characters. > > If your scheme has sufficient success, each combination of base letter > and diacritic may well be encoded as a separate letter because the > position of the diacritic is not obvious. I presume we're looking at no > more than about 12 new characters - DO CHADA WITH BLACK DOT is an > obvious competitor to THO NANGMONTHO WITH BLACK DOT. > Yes, 10 characters. > I'm disappointed you found that simply adding a black dot for the > voiced consonants didn't work. If it had worked, then we might > have argued that this was just a font variation. Please, see the previous email. Thanks! Sittipon From budelberger.richard at wanadoo.fr Thu Mar 27 21:59:29 2014 From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER) Date: Fri, 28 Mar 2014 03:59:29 +0100 (CET) Subject: Pali in Thai Script In-Reply-To: <5334DE5A.9030405@kli.org> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <5334DE5A.9030405@kli.org> Message-ID: <1226242372.188.1395975570026.JavaMail.www@wwinf1m14> > Message du 28/03/14 03:34 > De : Mark E. Shoulson > A : unicode at unicode.org > Objet : Re: Pali in Thai Script > > It's not at all uncommon. Consider Yiddish, which is essentially German > written in Hebrew script. Or various Judeo-Arabics written in Hebrew, > and the Talmud, which is Aramaic written in Hebrew letters (in pretty > much every printing and MS I've heard of). (What you call ??Hebrew letters ? are Aramaic letters of the alphabet adopted by Hebrew in Vth?c. BC.) Or Byelorussian written in Latin script in a Polish way? (More than Ukrainian.) From mark at kli.org Thu Mar 27 22:15:50 2014 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 27 Mar 2014 23:15:50 -0400 Subject: Pali in Thai Script In-Reply-To: <1226242372.188.1395975570026.JavaMail.www@wwinf1m14> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <5334DE5A.9030405@kli.org> <1226242372.188.1395975570026.JavaMail.www@wwinf1m14> Message-ID: <5334E966.4030700@kli.org> On 03/27/2014 10:59 PM, Richard BUDELBERGER wrote: >> Message du 28/03/14 03:34 >> De : Mark E. Shoulson >> A : unicode at unicode.org >> Objet : Re: Pali in Thai Script >> >> It's not at all uncommon. Consider Yiddish, which is essentially German >> written in Hebrew script. Or various Judeo-Arabics written in Hebrew, >> and the Talmud, which is Aramaic written in Hebrew letters (in pretty >> much every printing and MS I've heard of). > (What you call ? Hebrew letters ? are Aramaic letters of the alphabet adopted by Hebrew in Vth c. BC.) Of course. And the Samaritans still write both Hebrew and Aramaic as well using truly _Hebrew_ characters (ktav ivri, though of course developed by them through history), not the Aramaic-derived ones. But Aramaic is more associated with various Syriac alphabets. Still, I was reading Aramaic for a long time before I even knew there were Syriac alphabets that people wrote Aramaic in, and I still can't particularly read those. I think I've seen colloquial Arabic in Hebrew letters (aimed at teaching Hebrew-speakers, to be sure; maybe mostly to avoid having to teach a new alphabet). Someone once sent me a proposal for writing Esperanto in Hebrew letters (yes, Aramaic, of course: square Hebrew, ktav ashuri. What Unicode calls "HEBREW"), to what purpose I don't know (it was more or less the same as Yiddish writing). Sanskrit is also often seen in various scripts, I believe. I don't think it's unusual to find one language written in a script generally associated with another, especially if the first language doesn't have a well-established script for itself (not all the above are examples of that). ~mark From theppitak at gmail.com Thu Mar 27 22:49:13 2014 From: theppitak at gmail.com (Theppitak Karoonboonyanan) Date: Fri, 28 Mar 2014 10:49:13 +0700 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn wrote: > On 28/03/2014, Ed Trager wrote: > >> * Modern phonetically-based Lao lacks some of the traditional letters that >> are still preserved in Thai and other scripts. > > Are there old Lao characters (once) used for writing P??i? Historically no. But there was once an attempt to devise such characters by Lao Royal Institute before being dismissed by the communist revolution later. The writing principle was to use PHINTU in the same manner as Thai script, and the missing characters were borrowed from Tham script. See some sample text here: http://ic.pics.livejournal.com/saixelamphao/16569530/7323/7323_original.jpg ( Source: http://saixelamphao.livejournal.com/1326.html ) The upper part is written in Tham script, and the lower part is in the extended Lao script. The writing system was in use during 1932-1948. And some North-Eastern Thai scholars are trying to revive it at present. The full character chart, demonstrated by a font created by a Thai scholar (Facebook login is needed, sorry): http://www.facebook.com/photo.php?fbid=10201049297248857 Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From chris.fynn at gmail.com Fri Mar 28 02:09:10 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 28 Mar 2014 13:09:10 +0600 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: On 28/03/2014, Theppitak Karoonboonyanan wrote: > The full character chart, demonstrated by a font created by a Thai > scholar (Facebook login is needed, sorry): > > http://www.facebook.com/photo.php?fbid=10201049297248857 Even after logging into Facebook I only get the message: "This content is currently unavailable" "The page you requested cannot be displayed at the moment. It may be temporarily unavailable, the link you clicked on may have expired, or you may not have permission to view this page." - C From chris.fynn at gmail.com Fri Mar 28 02:29:30 2014 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 28 Mar 2014 13:29:30 +0600 Subject: Pali in Thai Script In-Reply-To: <5334E966.4030700@kli.org> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <5334DE5A.9030405@kli.org> <1226242372.188.1395975570026.JavaMail.www@wwinf1m14> <5334E966.4030700@kli.org> Message-ID: Here the case is a little different as there is no particular script associated with P??i. People in different Buddhist countries just use their own script for writing P??i.. A conversion utility, or simple way of letting users choose the script in which P??i. is displayed, would be useful so that there would be no need to type the same texts in each script. Sanskrit is strongly associated with the Devan?gar? script - but it is sometimes written in nearly all of the widely used scripts of India and some others such as Tibetan and Latin. - C From richard.wordingham at ntlworld.com Fri Mar 28 04:15:47 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 28 Mar 2014 09:15:47 +0000 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <20140328091547.64b97a4f@JRWUBU2> On Fri, 28 Mar 2014 10:49:13 +0700 Theppitak Karoonboonyanan wrote: > On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn > wrote: > > On 28/03/2014, Ed Trager wrote: > > > >> * Modern phonetically-based Lao lacks some of the traditional > >> letters that are still preserved in Thai and other scripts. > > > > Are there old Lao characters (once) used for writing P??i? > > Historically no. But there was once an attempt to devise such > characters by Lao Royal Institute before being dismissed by the > communist revolution later. The writing principle was to use PHINTU > in the same manner as Thai script, and the missing characters were > borrowed from Tham script. An older form of the Lao script is called the Thai Noi script. That script has many of the characters needed. It has the characters, to give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to their rarity, as with the lack of Vocalic L. Richard. From duerst at it.aoyama.ac.jp Fri Mar 28 05:40:37 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 28 Mar 2014 19:40:37 +0900 Subject: Fwd: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka) In-Reply-To: <53266CBF.6060209@it.aoyama.ac.jp> References: <53266CBF.6060209@it.aoyama.ac.jp> Message-ID: <533551A5.6070800@it.aoyama.ac.jp> I got informed today by your IT Dept. that the mail below never went out. Resent herewith. Martin. -------- Original Message -------- Subject: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala got great reception in Sri Lanka) Date: Mon, 17 Mar 2014 12:32:15 +0900 From: "Martin J. D?rst" On 2014/03/16 14:36, Philippe Verdy wrote: > You may still want to promote it at some government or education > institution, in order to promote it as a national standard, except that > there's little change it will ever happen when all countries in ISO have > stopoed working on standardization of new 8-bit encodings (only a few ones > are maintained; but these are the most complex ones used in China and Japan. > > Well in fact only Japan now seens to be actively updating its legacy JIS > standard; but only with the focus of converging it to use the UCS and solve > ambiguities or solve some technical problems (e.g. with emojis used by > mobile phone operators). Even China stopped updating its national standard > by publishing a final mapping table to/from the full UCS (including for > characters still not encoded in the UCS): this simplified the work because > only one standard needs to be maintained instead of 2. I'm not aware of any activity in Japan regarding the update of legacy character encodings. Can you tell me what you mean by "actively updating"? Regards, Martin. From duerst at it.aoyama.ac.jp Fri Mar 28 05:41:55 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 28 Mar 2014 19:41:55 +0900 Subject: Fwd: Re: Romanized Singhala got great reception in Sri Lanka In-Reply-To: <532689FC.7010606@it.aoyama.ac.jp> References: <532689FC.7010606@it.aoyama.ac.jp> Message-ID: <533551F3.2090409@it.aoyama.ac.jp> I got informed today by your IT Dept. that the mail below never went out. Resent herewith. Martin. -------- Original Message -------- Subject: Re: Romanized Singhala got great reception in Sri Lanka Date: Mon, 17 Mar 2014 14:37:00 +0900 From: "Martin J. D?rst" On 2014/03/17 13:16, Jean-Fran?ois Colson wrote: > >> As for Japanese (and also for Indic) I have read the warnings in RFC >> 1815: >> http://tools.ietf.org/rfc/rfc1815.txt >> >> > > RFC 1815 Character Sets ISO-10646 and ISO-10646-J-1 July 1995 > > July 1995? Is that document up-to-date? No, it's not. Not at all. It was outdated when it was published, and expresses only the opinions of the author (who was well know for not liking, and not very well understanding, Unicode). It's labeled as "Informational", which means it is not in any way part of an IETF Standard/specification. Even April 1st RFCs are classified as "Informational". The "charset" label "ISO-10646-J-1" it defines is listed at http://www.iana.org/assignments/character-sets/character-sets.xhtml, but I don't think that there is any major conversion library that supports this. Similar for what RFC 1815 labels as "ISO-10646", which appears as "ISO-10646-Unicode-Latin1" in the IANA registry (because simply using "ISO-10646" for this would be strongly misleading). Regards, Martin. From richard.wordingham at ntlworld.com Fri Mar 28 14:29:05 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 28 Mar 2014 19:29:05 +0000 Subject: Pali in Thai Script In-Reply-To: <5334A0EC.6050500@unicode.org> References: <5334A0EC.6050500@unicode.org> Message-ID: <20140328192905.60600cb8@JRWUBU2> On Thu, 27 Mar 2014 15:06:36 -0700 Rick McGowan wrote: > I'm trying to understand the particular scholarly need that will be > addressed by this project, and to know why some other existing > symbols are not, or cannot, be used for this purpose. I didn't completely answer this question. There are existing symbols that would be adequate. I can see a font-based solution that might not violate the principle of character identity. For the five voiced consonants, one could use the encodings: /g/ (??) /?/ (??) /?/ (?) /d/ (?) /b/ (?) These would be unambiguous for Pali (in this convention) whatever the font used, and thus almost immediately ready for general use. (There may be problems with the rendering of U+0331 - isn't there a minority orthography that use it as a diacritic?) A special font could be used for didactic purposes to add the black and white circles to emphasise that the normal Thai pronunciation is not to be used. One could also do that with the conventional letters for Pali voiced stops, namely ?????, which to me would be a superior solution. Richard. From sittipon at x10studio.com Fri Mar 28 21:57:50 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Sat, 29 Mar 2014 09:57:50 +0700 Subject: Pali in Thai Script In-Reply-To: <20140328192905.60600cb8@JRWUBU2> References: <5334A0EC.6050500@unicode.org> <20140328192905.60600cb8@JRWUBU2> Message-ID: <2177D6DB-E215-4F8D-8D01-731054EFD763@x10studio.com> Thanks for pointing out. I will bring this to the team's attention today. Sittipon On Mar 29, 2557 BE, at 2:29 AM, Richard Wordingham wrote: > On Thu, 27 Mar 2014 15:06:36 -0700 > Rick McGowan wrote: > >> I'm trying to understand the particular scholarly need that will be >> addressed by this project, and to know why some other existing >> symbols are not, or cannot, be used for this purpose. > > I didn't completely answer this question. There are existing symbols > that would be adequate. > > I can see a font-based solution that might not violate the principle of > character identity. For the five voiced consonants, one could use the > encodings: > > /g/ (??) > /?/ (??) > /?/ (?) > /d/ (?) > /b/ (?) > > These would be unambiguous for Pali (in this convention) whatever the > font used, and thus almost immediately ready for general use. (There > may be problems with the rendering of U+0331 - isn't there a minority > orthography that use it as a diacritic?) A special font could be used > for didactic purposes to add the black and white circles to > emphasise that the normal Thai pronunciation is not to be used. > One could also do that with the conventional letters for Pali > voiced stops, namely ?????, which to me would be a superior > solution. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From theppitak at gmail.com Fri Mar 28 22:59:09 2014 From: theppitak at gmail.com (Theppitak Karoonboonyanan) Date: Sat, 29 Mar 2014 10:59:09 +0700 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: On Fri, Mar 28, 2014 at 2:09 PM, Christopher Fynn wrote: > On 28/03/2014, Theppitak Karoonboonyanan wrote: > >> The full character chart, demonstrated by a font created by a Thai >> scholar (Facebook login is needed, sorry): >> >> http://www.facebook.com/photo.php?fbid=10201049297248857 > > Even after logging into Facebook I only get the message: > "This content is currently unavailable" > "The page you requested cannot be displayed at the moment. It may be > temporarily unavailable, the link you clicked on may have expired, or > you may not have permission to view this page." Sorry again. It seems to be shared privately. And I don't think it's appropriate to share it here against the author's will, then. There is a better image here: http://saixelamphao.livejournal.com/pics/catalog/493/8465 ( Source: http://saixelamphao.livejournal.com/1620.html ) Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From theppitak at gmail.com Fri Mar 28 23:10:52 2014 From: theppitak at gmail.com (Theppitak Karoonboonyanan) Date: Sat, 29 Mar 2014 11:10:52 +0700 Subject: Pali in Thai Script In-Reply-To: <20140328091547.64b97a4f@JRWUBU2> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140328091547.64b97a4f@JRWUBU2> Message-ID: On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham wrote: > On Fri, 28 Mar 2014 10:49:13 +0700 > Theppitak Karoonboonyanan wrote: > >> On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn >> wrote: >> > On 28/03/2014, Ed Trager wrote: >> > >> >> * Modern phonetically-based Lao lacks some of the traditional >> >> letters that are still preserved in Thai and other scripts. >> > >> > Are there old Lao characters (once) used for writing P??i? >> >> Historically no. But there was once an attempt to devise such >> characters by Lao Royal Institute before being dismissed by the >> communist revolution later. The writing principle was to use PHINTU >> in the same manner as Thai script, and the missing characters were >> borrowed from Tham script. > > An older form of the Lao script is called the Thai Noi script. That > script has many of the characters needed. It has the characters, to > give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, > DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and > Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to > their rarity, as with the lack of Vocalic L. I don't think so. From my studies so far, Tai Noi script (aka. Lao Buhan) writing system was not so different from that of contemporary Lao script. Some characters are just obsolete. In fact, I have been drafting a summarized proposal to encode Tai Noi script here: http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html There is also a project to revive the script in North-Eastern Thailand, which may urge the need for contemporary usage in computers: http://icmrpthailand.org/ The Tai Noi version with web font hack, which should be converted to Unicode instead if it were supported: http://icmrpthailand.org/is Regards, -- Theppitak Karoonboonyanan http://linux.thai.net/~thep/ From sittipon at x10studio.com Sat Mar 29 00:17:12 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Sat, 29 Mar 2014 12:17:12 +0700 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140328091547.64b97a4f@JRWUBU2> Message-ID: In my opinion (not my team), I think small underline like Richard said would be better for wider Thais audience. Since, Tai Noi is very different from modern Thai script we are using nowadays. My aim is to make subtle changes on how we already write Pali in Thai. And, if we have to change the language to cover all the pronunciation we needed. I would recommend changing to the language everyone else is using to studies Pali instead. Best, Sittipon > On Mar 29, 2014, at 11:10 AM, Theppitak Karoonboonyanan wrote: > > On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham > wrote: >> On Fri, 28 Mar 2014 10:49:13 +0700 >> Theppitak Karoonboonyanan wrote: >> >>> On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn >>> wrote: >>>> On 28/03/2014, Ed Trager wrote: >>>> >>>>> * Modern phonetically-based Lao lacks some of the traditional >>>>> letters that are still preserved in Thai and other scripts. >>>> >>>> Are there old Lao characters (once) used for writing P??i? >>> >>> Historically no. But there was once an attempt to devise such >>> characters by Lao Royal Institute before being dismissed by the >>> communist revolution later. The writing principle was to use PHINTU >>> in the same manner as Thai script, and the missing characters were >>> borrowed from Tham script. >> >> An older form of the Lao script is called the Thai Noi script. That >> script has many of the characters needed. It has the characters, to >> give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, >> DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and >> Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to >> their rarity, as with the lack of Vocalic L. > > I don't think so. From my studies so far, Tai Noi script (aka. Lao Buhan) > writing system was not so different from that of contemporary Lao script. > Some characters are just obsolete. > > In fact, I have been drafting a summarized proposal to encode Tai Noi > script here: > > http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html > > There is also a project to revive the script in North-Eastern Thailand, > which may urge the need for contemporary usage in computers: > > http://icmrpthailand.org/ > > The Tai Noi version with web font hack, which should be converted > to Unicode instead if it were supported: > > http://icmrpthailand.org/is > > Regards, > -- > Theppitak Karoonboonyanan > http://linux.thai.net/~thep/ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From asmusf at ix.netcom.com Sat Mar 29 06:01:43 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 29 Mar 2014 04:01:43 -0700 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <53343DED.5040304@cs.tut.fi> References: <53343DED.5040304@cs.tut.fi> Message-ID: <5336A817.1010205@ix.netcom.com> On managing some types of spacing between elements in running text: On 3/27/2014 8:04 AM, Jukka K. Korpela wrote: > 2014-03-27 15:10, Kalvesmaki, Joel wrote: > >> William, try the U+2000..U+200A glyphs under General Punctuation--I >> think >> that's what you're looking for to manage precise widths of blank space. > > That range contains some ?fixed-width spaces?, yes. Being > ?fixed-width? is rather relative here, though, and many fonts do not > contain these characters. Rendering software could of course display > them by just leaving suitable spacing, but that?s not common. > > The ?fixed-width spaces? are mostly just legacy characters, holdover > from old typography. They may have their uses, though, in contexts > where they work and other spacing methods don?t (for example, I > recently noticed that they seem to be the only way to create a little > spacing between an inline equation and normal character in MS Word). > They are useful when the object is to create fixed offsets between elements in running text. Unless these elements have a special nature that is widely recognized, there usually isn't any styling or markup available to create the same effect. As noted .. > But for the purposes of indenting text lines, I don?t think they are > useful. In almost all cases, there are better tools for indentation. > .. they are usually not needed for indentation and they are also not normally used for justification -- it seems somewhat of an unsettled question whether they do or do not partake in expansion / contraction based on justification and similar adjustments to the width of the variable spaces. It's the fact that indentation and justification do not need specific width for spaces that lead to the (incorrect) statement, oft repeated, that they are not needed in digital typography -- which is nonsense, of course, but unfortunately, by now, well-entrenched nonsense. >> And many (most?) software routines do not treat these as part of the >> class >> of spacing characters (\s in regular expressions). > > Well, most regexp implementations are very Ascii-oriented: notations > like \s, \w, \d, etc. match Ascii characters only. Which is an entirely different issue. A./ > > Yucca > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From richard.wordingham at ntlworld.com Sat Mar 29 17:35:59 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 29 Mar 2014 22:35:59 +0000 Subject: Unencoded Lao Characters In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140328091547.64b97a4f@JRWUBU2> Message-ID: <20140329223559.0965a007@JRWUBU2> On Sat, 29 Mar 2014 11:10:52 +0700 Theppitak Karoonboonyanan wrote, under topic 'Pali in Thai Script': > On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham > wrote: > > An older form of the Lao script is called the Thai Noi script. That > > script has many of the characters needed. It has the characters, to > > give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA, > > DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA > > and Vocalic R. The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be > > due to their rarity, as with the lack of Vocalic L. > > I don't think so. From my studies so far, Tai Noi script (aka. Lao > Buhan) writing system was not so different from that of contemporary > Lao script. Some characters are just obsolete. > > In fact, I have been drafting a summarized proposal to encode Tai Noi > script here: > > http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html That seems to be based on the analysis that the Tai Noi script is a form of the Lao script. In that case, it ought to address GHA, NYA, TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for example in the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf . The Buddhist Institute 'additions' should also be handled. There are several fonts around that make presumptions about their encoding in Unicode. I'm not convinced that the old Tai Noi and Buddhist Institute forms of each of NYA and NNA are the same character - I suspect we may have four characters here. The two versions of NYA are particularly difficult to reconcile. My though on the subscript consonants are: 1) The Lao block already has two subscript consonants, U+0EBC LAO SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps the various forms of the latter need to disunified. How does the latter's J-shaped glyph kern? 2) If we allow the Lao script to be split between planes, subscript forms could be accommodated in an 'Archaic Lao' block in the SMP. This would have the advantages that: (a) In UTF-8, a subscript consonant would only take 4 bytes, whereas using a coeng in the BMP would require 6 bytes, 3 for the coeng and and 3 for the consonant identity. The memory requirement is 4 bytes for both schemes in UTF-16. (b) Distinct subscripts for the same letter can easily be encoded distinctly. For example, the Lao letters LO, DO and NO can easily be taken to have two distinct subscript forms, and in the related Thai Nithet script (?????????????), formerly used in Northern Thailand, one can argue for four forms of the cluster HO MO - the ligature HO MO (as LAO HO MO), and HO plus (i) a purely subscript MO (gc=Mn), (ii) subscript MO with an ascender (gc=Mc), and (iii) a borrowing of Tai Tham (gc=Mn if treated as a single character). Richard. From richard.wordingham at ntlworld.com Sun Mar 30 04:50:40 2014 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 30 Mar 2014 10:50:40 +0100 Subject: Pali in Thai Script In-Reply-To: References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> Message-ID: <20140330105040.6eebbfdd@JRWUBU2> On Sat, 29 Mar 2014 10:59:09 +0700 Theppitak Karoonboonyanan wrote: > There is a better image here: > http://saixelamphao.livejournal.com/pics/catalog/493/8465 > > ( Source: http://saixelamphao.livejournal.com/1620.html ) Several letters in that image of the Buddhis Institute's system are in the wrong place, not to mention two correctly labelled vargas being in the wrong order. Properly labelled additions can be found at http://th.wikipedia.org/wiki/???????? . There are also additional characters for the two extra Sanskrit sibilants. Richard. From sittipon at x10studio.com Sun Mar 30 07:43:46 2014 From: sittipon at x10studio.com (Sittipon Simasanti) Date: Sun, 30 Mar 2014 19:43:46 +0700 Subject: Pali in Thai Script In-Reply-To: <20140330105040.6eebbfdd@JRWUBU2> References: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11> <20140330105040.6eebbfdd@JRWUBU2> Message-ID: I would like to thanks everyone for this lively and interesting discussion. Our project is still very early, and we are still changing the glyphs. After the last meeting, I found myself still have a lot of things to catch up. But, we are going to go with unicode private area for now. They are actually more than sufficient than we need at the moment. I will keep you informed on any further development. Thanks a lot, everyone! Sittipon On Sun, Mar 30, 2014 at 4:50 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Sat, 29 Mar 2014 10:59:09 +0700 > Theppitak Karoonboonyanan wrote: > > > There is a better image here: > > http://saixelamphao.livejournal.com/pics/catalog/493/8465 > > > > ( Source: http://saixelamphao.livejournal.com/1620.html ) > > Several letters in that image of the Buddhis Institute's system are in > the wrong place, not to mention two correctly labelled vargas being in > the wrong order. Properly labelled additions can be found at > http://th.wikipedia.org/wiki/???????? . There are also additional > characters for the two extra Sanskrit sibilants. > > Richard. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Sittipon Simasanti Extend Interactive Co.,Ltd. 668-6880-8490 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jkorpela at cs.tut.fi Mon Mar 31 09:05:42 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Mon, 31 Mar 2014 17:05:42 +0300 Subject: Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please? In-Reply-To: <5336A817.1010205@ix.netcom.com> References: <53343DED.5040304@cs.tut.fi> <5336A817.1010205@ix.netcom.com> Message-ID: <53397636.6060007@cs.tut.fi> 2014-03-29 13:01, Asmus Freytag wrote: > On managing some types of spacing between elements in running text: > > On 3/27/2014 8:04 AM, Jukka K. Korpela wrote: [?] >> The ?fixed-width spaces? are mostly just legacy characters, holdover >> from old typography. They may have their uses, though, in contexts >> where they work and other spacing methods don?t (for example, I >> recently noticed that they seem to be the only way to create a little >> spacing between an inline equation and normal character in MS Word). >> > They are useful when the object is to create fixed offsets between > elements in running text. In special cases, I would say. Normally, other tools are used. E.g., typesetting programs may have commands with, say, ?thin space? in their name, but they don?t really insert THIN SPACE characters but some internal representation, and the effect (width of spacing) may be settable in the program, possibly with a default that differs from the description ?a fifth of an em (or sometimes a sixth)?. > Unless these elements have a special nature > that is widely recognized, there usually isn't any styling or markup > available to create the same effect. For example, in HTML or XML, you can wrap either of the two elements in an inline element and set padding-right or padding-left on it. While this may look clumsier than using,   or   or THIN SPACE itself, it?s much more flexible?you can set any amount of spacing. Besides, quite often one of the elements is already an element in the markup, as in f(0), to take a typical example of a construct that really needs special spacing. In word processors, you would typically select a character and set spacing on it in Font settings. This is clumsy, but using styles, it is reasonably manageable. On the other hand, tuning of spacing is rather rare outside professional and ambitious typesetting. It?s really one of the things that distinguishes quality typesetting. Typesetters that do such things might be quite unaware of fixed-width spaces as characters (and might even regard it as odd to call spacing things characters). > It's the fact that indentation and justification do not need specific > width for spaces that lead to the (incorrect) statement, oft repeated, > that they are not needed in digital typography -- which is nonsense, of > course, but unfortunately, by now, well-entrenched nonsense. I would rather say that the problem is in not understanding the importance of spacing, at a more refined level than just SPACE versus no space. When the problem has been understood, the solution is usually something else than fixed-width spaces. Yucca From mpsuzuki at hiroshima-u.ac.jp Mon Mar 31 19:28:26 2014 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Tue, 01 Apr 2014 09:28:26 +0900 Subject: Call for the experts of U+3013 Message-ID: <533A082A.7030808@hiroshima-u.ac.jp> Dear all, Today I submitted a preliminary proposal to standardize Variation Selectors for U+3013, so-called "GETA" mark. ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n4572.pdf The geta mark was introduced from JIS X 0208:1990 and GB 2312-1980. When I check the original documents including the geta mark, some of the representative glyphs in these regional standards are different from original geta mark. I investigated theoretically possible visual shapes of the geta mark, and concluded the registry-based standardization of the geta mark is a considerable option. Unfortunately, the officially printed matters including the geta mark is not popular (I found only a few books in Japanese national diet library), so I want to hear the comments from the geta expert for the official proposal. Regards, mpsuzuki