From chris.fynn at gmail.com  Sun Mar  2 02:45:04 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Sun, 2 Mar 2014 14:45:04 +0600
Subject: Websites in Hindi
In-Reply-To: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
Message-ID: <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>

I don't know about that particular Serif software which may have
limitations, but if a site is using Unicode UTF-8, there should be no
problem creating a website in Hindi

e.g.
http://www.bbc.co.uk/hindi/
https://hi.wikipedia.org/
http://tehelkahindi.com/
http://www.webdunia.com/


From billposer2 at gmail.com  Sun Mar  2 13:39:51 2014
From: billposer2 at gmail.com (Bill Poser)
Date: Sun, 2 Mar 2014 11:39:51 -0800
Subject: Websites in Hindi
In-Reply-To: <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
 <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
Message-ID: <CACPRsRR=t9tCe=+bRifD9eVWiZbcJ4wrq5250bCZd-AbKcghXA@mail.gmail.com>

In my experience the problem with Hindi web sites is that many of them used
encodings other than unique, frequently encodings designed for a particular
font. Some fonts did not use anything like a normal encoding. We
encountered a newspaper that used a font with 8,000-some glyphs each
representing a graphical piece of a Devanagari character or character
cluster.  I don't know to what extent the use of of such parochial fonts
and encodings persists. The sites I have seen using Unicode look fine.


On Sun, Mar 2, 2014 at 12:45 AM, Christopher Fynn <chris.fynn at gmail.com>wrote:

> I don't know about that particular Serif software which may have
> limitations, but if a site is using Unicode UTF-8, there should be no
> problem creating a website in Hindi
>
> e.g.
> http://www.bbc.co.uk/hindi/
> https://hi.wikipedia.org/
> http://tehelkahindi.com/
> http://www.webdunia.com/
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140302/9ef6683d/attachment.html>

From James_Lin at symantec.com  Mon Mar  3 12:14:22 2014
From: James_Lin at symantec.com (James Lin)
Date: Mon, 3 Mar 2014 10:14:22 -0800
Subject: Websites in Hindi
In-Reply-To: <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
 <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
Message-ID: <CF3A0794.32D06%james_lin@symantec.com>

another problem you may need to consider is the support of the glyph/fonts
on your system.  Not all fonts are supported/install by default when
installing the OS.

Warm Regards,
-James


On 3/2/14, 12:45 AM, "Christopher Fynn" <chris.fynn at gmail.com> wrote:

>I don't know about that particular Serif software which may have
>limitations, but if a site is using Unicode UTF-8, there should be no
>problem creating a website in Hindi
>
>e.g.
>http://www.bbc.co.uk/hindi/
>https://hi.wikipedia.org/
>http://tehelkahindi.com/
>http://www.webdunia.com/
>_______________________________________________
>Unicode mailing list
>Unicode at unicode.org
>http://unicode.org/mailman/listinfo/unicode


From neil at tonal.clara.co.uk  Mon Mar  3 14:21:42 2014
From: neil at tonal.clara.co.uk (Neil Harris)
Date: Mon, 03 Mar 2014 20:21:42 +0000
Subject: Websites in Hindi
In-Reply-To: <CF3A0794.32D06%james_lin@symantec.com>
References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
 <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
 <CF3A0794.32D06%james_lin@symantec.com>
Message-ID: <5314E456.2050509@tonal.clara.co.uk>

On 03/03/14 18:14, James Lin wrote:
> another problem you may need to consider is the support of the glyph/fonts
> on your system.  Not all fonts are supported/install by default when
> installing the OS.
>
> Warm Regards,
> -James
>
>

This is where webfonts should be extremely useful -- I believe recent 
versions of at least Firefox, and probably other modern browsers, should 
support both webfonts and text shaping for Indic scripts by default, 
whether or not the underlying platform has the correct fonts.

Neil


From petercon at microsoft.com  Mon Mar  3 16:36:32 2014
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 3 Mar 2014 22:36:32 +0000
Subject: Websites in Hindi
In-Reply-To: <5314E456.2050509@tonal.clara.co.uk>
References: <1393239647.87888.YahooMailNeo@web87805.mail.ir2.yahoo.com>
 <CAA_CYcLbUkckusv5tu4xfByc-0R3955a-QpL63aWOgXE5S4TqQ@mail.gmail.com>
 <CF3A0794.32D06%james_lin@symantec.com> <5314E456.2050509@tonal.clara.co.uk>
Message-ID: <dd6ed7165a1444ce81c25d51f145f23a@BL2PR03MB450.namprd03.prod.outlook.com>

Looking at the thread that William pointed at, the person asking for help gave no indication as to what problems he might have been encountering. Without specifics, the two obvious recommendations would be (i) encode the content using conformant UTF-8, and (ii) use conforming OpenType fonts leveraging CSS web font mechanisms.

Beyond that, that thread seemed not especially interesting.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Neil Harris
Sent: March 3, 2014 12:22 PM
To: James Lin; Christopher Fynn; William_J_G Overington
Cc: unicode at unicode.org
Subject: Re: Websites in Hindi

On 03/03/14 18:14, James Lin wrote:
> another problem you may need to consider is the support of the 
> glyph/fonts on your system.  Not all fonts are supported/install by 
> default when installing the OS.
>
> Warm Regards,
> -James
>
>

This is where webfonts should be extremely useful -- I believe recent versions of at least Firefox, and probably other modern browsers, should support both webfonts and text shaping for Indic scripts by default, whether or not the underlying platform has the correct fonts.

Neil

_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From mgunn at egt.ie  Wed Mar  5 06:10:51 2014
From: mgunn at egt.ie (Marion Gunn)
Date: Wed, 05 Mar 2014 12:10:51 +0000
Subject: ?MP = Multi*lingual* plane?
In-Reply-To: <530F5A33.1010005@ix.netcom.com>
References: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
 <530F5A33.1010005@ix.netcom.com>
Message-ID: <5317144B.1070201@egt.ie>

Twice as lovely when an immediately comprehensible term is used 
consistently (for example, our old faithful "multilingual"), be that 
term precise or no, rather than coin new terms without due reason, which 
could take years to become current amongst end users and yet more years 
to understand.
mg

Scr?obh 27/02/2014 15:30, Asmus Freytag:
> On 2/27/2014 2:32 AM, Shriramana Sharma wrote:
>> Given that Unicode encodes scripts and not languages, how appropriate 
>> is it to call the BMP and the SMP as the multi*lingual* planes?
>>
> Isn't it lovely how these things work?
>
> A./
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


-- 
Marion Gunn * eGteo (Estab.1991)
27 P?irc an Fh?ithlinn, Baile an
Bh?thair, An Charraig Dhubh,
Co. ?tha Cliath, ?ire/Ireland.
* mgunn at egt.ie * eamonn at egt.ie *


From mail at robbertbroersma.nl  Thu Mar  6 14:54:18 2014
From: mail at robbertbroersma.nl (Robbert)
Date: Thu, 6 Mar 2014 21:54:18 +0100
Subject: HTTPS
Message-ID: <C18A4330-A977-4E79-9D77-FBEFB4F63B9A@robbertbroersma.nl>

Hi,

For tools that rely on the Unicode database it would be great if the databases were available over HTTPS as well:
https://www.unicode.org/Public/6.3.0/

In addition to this it would be helpful if the archive also contains SHA512 checksum files for each Unicode version to verify the integrity of databases that have already been downloaded (over HTTP), e.g.:

https://www.unicode.org/Public/6.3.0/SHA512SUMS

Mozilla already offers such checksums, although unfortunately not over HTTPS, but they can serve as an example.

http://releases.mozilla.org/pub/mozilla.org/firefox/releases/27.0/SHA512SUMS

I think this would improve the security of many libraries that directly and indirectly depend on Unicode.

Kind regards,
Robbert Broersma


From adam at nohejl.name  Sun Mar  9 07:39:20 2014
From: adam at nohejl.name (Adam Nohejl)
Date: Sun, 9 Mar 2014 13:39:20 +0100
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
In-Reply-To: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
References: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
Message-ID: <CA7437A9-F540-4694-89DF-46802BBD5B9D@nohejl.name>

Hello again,

I would be really grateful for any reply or at least pointers to relevant information about this topic (stroke-order data in Unihan, see my previous message below).

Or is there any other appropriate place to discuss this?

Thank you,

-- 
Adam

On 2014/02/28, at 19:56, Adam Nohejl <adam at nohejl.name> wrote:
> 
> Hello,
> 
> I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary.
> 
> Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields?
> 
> Examples of the discrepancies:
> 
> (1) A very common character for "most, maximum".
> U+6700	kRSKangXi	73.8
> U+6700	kRSUnicode	13.10
> 
> (2) A funny character for autumn containing the turtle component.
> U+9F9D	kRSKangXi	115.16
> U+9F9D	kRSKanWa	115.16
> U+9F9D	kRSUnicode	213.5
> 
> There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.
> 
> (3) The simplified turtle character (commonly assigned to the traditional radical #213):
> U+4E80	kRSKangXi	213.0
> U+4E80	kRSUnicode	5.10
> 
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
> U+66FB	kRSKangXi	72.7
> U+66FB	kRSUnicode	73.7
> 
> - - -
> 
> [*] <http://www.unicode.org/reports/tr38/tr38-8.html>: "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard."
> 
> [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: <http://ctext.org/dictionary.pl>


From leoboiko at namakajiri.net  Sun Mar  9 08:49:37 2014
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Sun, 9 Mar 2014 10:49:37 -0300
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
In-Reply-To: <CA7437A9-F540-4694-89DF-46802BBD5B9D@nohejl.name>
References: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
 <CA7437A9-F540-4694-89DF-46802BBD5B9D@nohejl.name>
Message-ID: <CAJ6uix4L23Rn9o=D0MrmaE3yPvZnyWuQV42JWU2VNFa-01BDUA@mail.gmail.com>

I don't know about the points you raise, but I wish it was easier to help
proofread Unihan data.  Back in 2012 I compared kKangXi to kIRGKangXI and
found 252 conflicts, besides the cases where a character only has one or
the other.  I even put together a simple tool to help fixing this, with
links to the relevant pages at the online Kang Xi[1].  I had no replies?

[1] http://namakajiri.net/misc/unihan_kangxi/compare_existing.html for
characters in Kang Xi, and for the others,
http://namakajiri.net/misc/unihan_kangxi/compare_nonexisting.html


2014-03-09 9:39 GMT-03:00 Adam Nohejl <adam at nohejl.name>:

> Hello again,
>
> I would be really grateful for any reply or at least pointers to relevant
> information about this topic (stroke-order data in Unihan, see my previous
> message below).
>
> Or is there any other appropriate place to discuss this?
>
> Thank you,
>
> --
> Adam
>
> On 2014/02/28, at 19:56, Adam Nohejl <adam at nohejl.name> wrote:
> >
> > Hello,
> >
> > I am comparing radical data for CJK characters from different sources,
> including the Unihan database. According to the Unihan documentation* the
> kRSUnicode radical should correspond to kRSKangXi radical, which in turn
> should be based on the Kang Xi dictionary.
> >
> > Is there any explanation for the following discrepancies? Did I miss any
> other rules or reasoning behind the content of these two fields?
> >
> > Examples of the discrepancies:
> >
> > (1) A very common character for "most, maximum".
> > U+6700        kRSKangXi       73.8
> > U+6700        kRSUnicode      13.10
> >
> > (2) A funny character for autumn containing the turtle component.
> > U+9F9D        kRSKangXi       115.16
> > U+9F9D        kRSKanWa        115.16
> > U+9F9D        kRSUnicode      213.5
> >
> > There are also characters that actually are not included in the Kang Xi
> dictionary**, but the Unihan data contain both a purported Kang Xi radical
> and in addition to that a _different_ Unicode radical.
> >
> > (3) The simplified turtle character (commonly assigned to the
> traditional radical #213):
> > U+4E80        kRSKangXi       213.0
> > U+4E80        kRSUnicode      5.10
> >
> > (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary
> decision, but unexpectedly the fields differ:
> > U+66FB        kRSKangXi       72.7
> > U+66FB        kRSUnicode      73.7
> >
> > - - -
> >
> > [*] <http://www.unicode.org/reports/tr38/tr38-8.html>: "Property:
> kRSUnicode // Description: (...) The first value is intended to reflect the
> same radical as the kRSKangXi field and the stroke count of the glyph used
> to print the character within the Unicode Standard."
> >
> > [**] The two characters are missing from the '89 edition of Kang Xi
> (which should be the same as used for Unihan) according to search on this
> site: <http://ctext.org/dictionary.pl>
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140309/0d716686/attachment.html>

From rscook at unicode.org  Mon Mar 10 10:39:19 2014
From: rscook at unicode.org (Richard COOK)
Date: Mon, 10 Mar 2014 08:39:19 -0700
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
In-Reply-To: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
References: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
Message-ID: <DB8B32C2-2D27-4E42-AEEB-E8DCABC39F2C@unicode.org>

Mr. Nohejl,

About the property data you mention below. kRSUnicode property data permits multiple/variant (space-delimited) radical/stroke values, and I think we will see important variants added in the future. Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it, and perhaps to give it priority (move it to the front of the list). Likewise, if a common variant value is missing (even one not associated with Kangxi), it might be added for convenience. And if there are any outright errors, of course those should be identified and corrected (but clear errors are harder to find these days). 

Note that because kRSUnicode covers *all* Unihan CJK, even those characters not present in the original Kangxi, some of the radical/stroke values are so-called "virtual" assignments (those should be omitted from consideration, in proofing original KX data).

Several years ago we (at Wenlin.com) produced consolidated Kangxi data for our Zidian (Wenlin 4.X), taking these four properties (among other data) as input:

<http://www.unicode.org/reports/tr38/#kIRGKangXi>
<http://www.unicode.org/reports/tr38/#kKangXi>
<http://www.unicode.org/reports/tr38/#kRSKangXi> 
<http://www.unicode.org/reports/tr38/#kIRG_GSource>

The last of these may not have any obvious connection with Kangxi, until one reads the kIRG_GSource property description and sees this "sub-property" description:

"GKX Kangxi Dictionary ideographs (????) 9th edition (1958) including the addendum (????)??"

PRC researchers have done much work proofing G-Source Kangxi data, to address many aspects of the complex original text. 

The Kangxi work we did at Wenlin has several dimensions, and some of this has not yet rippled back into UCD.

We have in fact already identified many important omissions from kRSUnicode, which we plan to propose for a future data release. 

Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention.

-Richard

PS: About the subject line of your message. Please note that despite the "CJK stroke order" subject line in your message, we are not talking about CJK stroke order here at all, but about Kangxi and UCS radical assignment, and residual stroke *count* data. Such data can indeed be used to "order" (collate) CJK data, but "stroke order" is a separate issue, involving the particular sequence of CJK Strokes (see The Unicode Standard, Appendix F) in the writing of a given character (stroke-order data can also be used for collation and indexing). Wenlin's CDL database (which inspired the CJK Stroke block, and also produced Appendix F) contains a comprehensive analysis of CJK Stroke order *and* Radical/Stroke data for all UCS CJK, primarily focused on PRC norms, but also including a great many variants (variants forms, variant stroke counts, and variant radical assignments).


On Feb 28, 2014, at 10:56 AM, Adam Nohejl wrote:

> 
> (1) A very common character for "most, maximum".
> ?[U+6700]	kRSKangXi	73.8
> ?[U+6700]	kRSUnicode	13.10
> 
> (2) A funny character for autumn containing the turtle component.
> ?[U+9F9D]	kRSKangXi	115.16
> ?[U+9F9D]	kRSKanWa	115.16
> ?[U+9F9D]	kRSUnicode	213.5
> 
> There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.
> 
> (3) The simplified turtle character (commonly assigned to the traditional radical #213):
> ?[U+4E80]	kRSKangXi	213.0
> ?[U+4E80]	kRSUnicode	5.10
> 
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
> ?[U+66FB]	kRSKangXi	72.7
> ?[U+66FB]	kRSUnicode	73.7


> Hello,
> 
> I am comparing radical data for CJK characters from different sources, including the Unihan database. According to the Unihan documentation* the kRSUnicode radical should correspond to kRSKangXi radical, which in turn should be based on the Kang Xi dictionary.
> 
> Is there any explanation for the following discrepancies? Did I miss any other rules or reasoning behind the content of these two fields?
> 
> Examples of the discrepancies:
> 
> (1) A very common character for "most, maximum".
> U+6700	kRSKangXi	73.8
> U+6700	kRSUnicode	13.10
> 
> (2) A funny character for autumn containing the turtle component.
> U+9F9D	kRSKangXi	115.16
> U+9F9D	kRSKanWa	115.16
> U+9F9D	kRSUnicode	213.5
> 
> There are also characters that actually are not included in the Kang Xi dictionary**, but the Unihan data contain both a purported Kang Xi radical and in addition to that a _different_ Unicode radical.
> 
> (3) The simplified turtle character (commonly assigned to the traditional radical #213):
> U+4E80	kRSKangXi	213.0
> U+4E80	kRSUnicode	5.10
> 
> (4) Character with the radical #72/73 at the top, i.e. IMHO an arbitrary decision, but unexpectedly the fields differ:
> U+66FB	kRSKangXi	72.7
> U+66FB	kRSUnicode	73.7
> 
> - - -
> 
> [*] <http://www.unicode.org/reports/tr38/tr38-8.html>: "Property: kRSUnicode // Description: (...) The first value is intended to reflect the same radical as the kRSKangXi field and the stroke count of the glyph used to print the character within the Unicode Standard."
> 
> [**] The two characters are missing from the '89 edition of Kang Xi (which should be the same as used for Unihan) according to search on this site: <http://ctext.org/dictionary.pl>
> 
> 
> -- 
> Adam Nohejl
> 
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From doppelbauer at gmx.net  Mon Mar 10 13:34:48 2014
From: doppelbauer at gmx.net (Markus Doppelbauer)
Date: Mon, 10 Mar 2014 19:34:48 +0100
Subject: Normalization test
Message-ID: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140310/0105103c/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 10 14:28:57 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 10 Mar 2014 20:28:57 +0100
Subject: Normalization test
In-Reply-To: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>
References: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>
Message-ID: <CAGa7JC0YgY5Vs7LXUdw6vqX8HsaWTP+zT2sh7Wa+5F_72JYxgw@mail.gmail.com>

toNFC(0061 0305 0315 0300 05AE 0062) ->

>From DerivedCombiningClass.txt<http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedCombiningClass.txt>:

  05D0..05EA    ; 0 # Lo  [27] HEBREW LETTER ALEF..HEBREW LETTER TAV

In other words, 05EA with combining class 0 is blocking the
composition and any reordering between

  (0061 0305 0315 0300) on one side, and

  (0062) on the other side (which is also combining class 0).

So you will effectively get the composition of 0061 and 0305 (because
it is also no specifically excluded from composition in
CompositionExclusions.txt
<http://www.unicode.org/Public/UCD/latest/ucd/CompositionExclusions.txt>)
in:

  toNFC(0061 0305 0315 0300 05AE 0062),

but NOT in:

  toNFC(0061 05AE 0305 0315 0300 0062).

I think you have mixed the two separate test cases.


The first thing to check is to break sequences before every character with
combining class 0 (even if it is "combining", like here the Hebrew accent
zinor).

2014-03-10 19:34 GMT+01:00 Markus Doppelbauer <doppelbauer at gmx.net>:

> Hello,
>
> I am working on an Unicode Normalization implemenation. I have a question
> about a specific toNFC test rule.
>
>  toNFC(0061 0305 0315 0300 05AE 0062) =>
>      (0061 05AE 0305 0300 0315 0062)
> expected:
>      (0061 05AE 0305 0300 0315 0062)
>         \-------------/  =>
>      (00E0 05AE 0305      0315 0062)
>
> Why doesn't 0061 and 0300 combine to 00E0 ?
>
>  Thanks a lot
> Markus
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140310/a6731c58/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 10 14:32:00 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 10 Mar 2014 20:32:00 +0100
Subject: Normalization test
In-Reply-To: <CAGa7JC0YgY5Vs7LXUdw6vqX8HsaWTP+zT2sh7Wa+5F_72JYxgw@mail.gmail.com>
References: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>
 <CAGa7JC0YgY5Vs7LXUdw6vqX8HsaWTP+zT2sh7Wa+5F_72JYxgw@mail.gmail.com>
Message-ID: <CAGa7JC2Agoo5pAAsYQTh2jeMR6Cxupq4AmfPETFWbyFqAZRzFg@mail.gmail.com>

Sorry, I took the wrong line (because I typed 05EA instead of 05AE)

05AE          ; 228 # Mn       HEBREW ACCENT ZINOR


You're right, the combining class 228 does not block the composition.


2014-03-10 20:28 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> toNFC(0061 0305 0315 0300 05AE 0062) ->
>
> From DerivedCombiningClass.txt<http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedCombiningClass.txt>:
>
>   05D0..05EA    ; 0 # Lo  [27] HEBREW LETTER ALEF..HEBREW LETTER TAV
>
> In other words, 05EA with combining class 0 is blocking the composition and any reordering between
>
>   (0061 0305 0315 0300) on one side, and
>
>   (0062) on the other side (which is also combining class 0).
>
> So you will effectively get the composition of 0061 and 0305 (because it is also no specifically excluded from composition in CompositionExclusions.txt <http://www.unicode.org/Public/UCD/latest/ucd/CompositionExclusions.txt>) in:
>
>   toNFC(0061 0305 0315 0300 05AE 0062),
>
> but NOT in:
>
>   toNFC(0061 05AE 0305 0315 0300 0062).
>
> I think you have mixed the two separate test cases.
>
>
> The first thing to check is to break sequences before every character with
> combining class 0 (even if it is "combining", like here the Hebrew accent
> zinor).
>
> 2014-03-10 19:34 GMT+01:00 Markus Doppelbauer <doppelbauer at gmx.net>:
>
>>  Hello,
>>
>> I am working on an Unicode Normalization implemenation. I have a question
>> about a specific toNFC test rule.
>>
>>  toNFC(0061 0305 0315 0300 05AE 0062) =>
>>      (0061 05AE 0305 0300 0315 0062)
>> expected:
>>      (0061 05AE 0305 0300 0315 0062)
>>         \-------------/  =>
>>      (00E0 05AE 0305      0315 0062)
>>
>> Why doesn't 0061 and 0300 combine to 00E0 ?
>>
>>  Thanks a lot
>> Markus
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140310/604fc82b/attachment.html>

From markus.icu at gmail.com  Mon Mar 10 14:36:13 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 10 Mar 2014 12:36:13 -0700
Subject: Normalization test
In-Reply-To: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>
References: <trinity-c8750f41-55e1-4e52-8c3c-38d84141c965-1394476488112@3capp-gmx-bs47>
Message-ID: <CAN49p6qVO6zd1DkaSFLWzzKec-1cMG4Tbw8sKOeGgmRNiZj8gA@mail.gmail.com>

The U+0300 ( ? ) COMBINING GRAVE ACCENT is blocked by the U+0305 ( ? )
COMBINING OVERLINE which has the same ccc=230.

Could you use an existing library rather than roll your own?

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140310/9aff7d16/attachment.html>

From rscook at unicode.org  Mon Mar 10 18:44:50 2014
From: rscook at unicode.org (Richard COOK)
Date: Mon, 10 Mar 2014 16:44:50 -0700
Subject: ?MP = Multi*lingual* plane?
In-Reply-To: <B38CAF35-27DE-46BA-AE90-449D08D8FE0E@evertype.com>
References: <CAH-HCWUE6c9rHEUtkuwcgVzRWGfx8zUBVvZkALJUGxa6CtbMEQ@mail.gmail.com>
 <B38CAF35-27DE-46BA-AE90-449D08D8FE0E@evertype.com>
Message-ID: <F854E9ED-828C-4E3A-A574-CE67E16269CD@unicode.org>


On Feb 27, 2014, at 7:23 AM, Michael Everson wrote:

> On 27 Feb 2014, at 02:32, Shriramana Sharma <samjnaa at gmail.com> wrote:
> 
>> Given that Unicode encodes scripts and not languages, how appropriate is it to call the BMP and the SMP as the multi*lingual* planes?
> 
> You are more than two decades late in asking this.
> 
> It may have seemed more appropriate in an 8-bit code page world where rather small subsets limited the number of languages accessible by one or another part of ISO/IEC 8859. 
> 
> A new term like ?multiscriptal? would not have been appropriate. File this under ?We know the term ?ideograph? is a misnomer."

'When I use a word,' Humpty Dumpty said, in rather a scornful tone, 
'it means just what I choose it to mean ? neither more nor less.'

'The question is,' said Alice, 'whether you can make words mean so many different things.'

'The question is,' said Humpty Dumpty, 'which is to be master ? that's all.'

Alice was too much puzzled to say anything; so after a minute Humpty Dumpty began again.


> Michael Everson * http://www.evertype.com/
> 
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
> 


From doppelbauer at gmx.net  Tue Mar 11 10:50:35 2014
From: doppelbauer at gmx.net (Markus Doppelbauer)
Date: Tue, 11 Mar 2014 16:50:35 +0100
Subject: NFD -> NFC
Message-ID: <trinity-c35e1489-8905-4146-b96b-190e13ca12f9-1394553035636@3capp-gmx-bs47>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140311/cc0ed264/attachment.html>

From mark at macchiato.com  Tue Mar 11 11:19:06 2014
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=)
Date: Tue, 11 Mar 2014 17:19:06 +0100
Subject: NFD -> NFC
In-Reply-To: <trinity-c35e1489-8905-4146-b96b-190e13ca12f9-1394553035636@3capp-gmx-bs47>
References: <trinity-c35e1489-8905-4146-b96b-190e13ca12f9-1394553035636@3capp-gmx-bs47>
Message-ID: <CAJ2xs_EXo6EJqEAmPCyd1myMLTAkZyQ0syqQxNf3AsFL8xSwJg@mail.gmail.com>

Not sure about your exact case, but ICU's normalization does handle those
characters.

http://unicode.org/cldr/utility/transform.jsp?a=nfc%3Bhex&b=%5Cu30B9%5Cu3099

(That tool uses ICU for NFC).


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*


On Tue, Mar 11, 2014 at 4:50 PM, Markus Doppelbauer <doppelbauer at gmx.net>wrote:

> Hello,
>
> I have an other problem making the normalization process binary
> compatible with ICU.
>  Why does "30B9 3099" not combine to "30BA"?
>
> Steps to reproduce:
>  wget http://doppelbauer.name/katakana.txt
> uconv -f utf8 -t utf8 -x nfd <katakana.txt >ndf.txt
> uconv -f utf8 -t utf8 -x nfc <ndf.txt >nfc.txt
> diff katakana.txt nfc.txt
>
>  Expected result: "katakana.txt" == "nfc.txt"
>
> uconv v2.1  ICU 4.8.1.1
>
> Thanks a lot
> Markus
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140311/ddc2ffce/attachment.html>

From starback at stp.lingfil.uu.se  Tue Mar 11 11:35:47 2014
From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=)
Date: Tue, 11 Mar 2014 17:35:47 +0100
Subject: "(in 6429)" in allkeys.txt
Message-ID: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>

In the DUCET file allkeys.txt,
http://www.unicode.org/Public/UCA/latest/allkeys.txt ,
there is "(in 6429)" as a comment for some characters.
I first didn't understand why, but then I realized those are control
characters that are part of ISO/EIC 6429.

Why is that pointed out explicitly in that context?

The reason I'm asking is that I was looking at the proposed new version
of this file, and was thinking about suggesting a short note in the
comments in the beginning of the file.


From markus.icu at gmail.com  Tue Mar 11 12:33:09 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 11 Mar 2014 10:33:09 -0700
Subject: NFD -> NFC
In-Reply-To: <CAJ2xs_EXo6EJqEAmPCyd1myMLTAkZyQ0syqQxNf3AsFL8xSwJg@mail.gmail.com>
References: <trinity-c35e1489-8905-4146-b96b-190e13ca12f9-1394553035636@3capp-gmx-bs47>
 <CAJ2xs_EXo6EJqEAmPCyd1myMLTAkZyQ0syqQxNf3AsFL8xSwJg@mail.gmail.com>
Message-ID: <CAN49p6rYFmCosXxaL0oXN6AtpdikuLpkOWDyVRMPMWeE-80YzQ@mail.gmail.com>

Here is the demo using ICU4C:

http://demo.icu-project.org/icu-bin/nbrowser?t=%5Cu30B9%5Cu3099&s=&uv=0

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140311/97d2cd8b/attachment.html>

From ken.whistler at sap.com  Tue Mar 11 13:57:23 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Tue, 11 Mar 2014 18:57:23 +0000
Subject: "(in 6429)" in allkeys.txt
In-Reply-To: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>

Per asked:

> In the DUCET file allkeys.txt,
> http://www.unicode.org/Public/UCA/latest/allkeys.txt ,
> there is "(in 6429)" as a comment for some characters.
> I first didn't understand why, but then I realized those are control
> characters that are part of ISO/EIC 6429.
> 
> Why is that pointed out explicitly in that context?

1. To make it clear that those are not actually Unicode
character names, but names for control functions associated
with ISO 6429. (Note that this practices dates back a long
time now in the DUCET data files -- it predates the addition of
the ISO 6429 control function names as formal name
aliases in NameAliases.txt in the UCD.)

2. Because the same "names" which appear in the comments
in allkeys.txt for UCA also appear in comments for the CTT
in ISO 14651 (which is generated with the same tool). And
the "(in 6429)" notes were added there to forestall people
asking questions about these "names" that aren't "names".

3. And the reason they *continue* to appear in the comments
in both allkeys.txt and in the CTT for ISO 14651 is to preclude
people asking questions about why they would be removed. ;-)

> 
> The reason I'm asking is that I was looking at the proposed new version
> of this file, and was thinking about suggesting a short note in the
> comments in the beginning of the file.

My personal preference, rather than larding up the header
of a machine-generated file with more commentary, would be
a suggestion for further clarification in the text of UTS #10, if
necessary. After all, the allkeys.txt header already points to
UTS #10 for more information -- which anyone needs to understand
and use the data file, anyway.

--Ken


From starback at stp.lingfil.uu.se  Tue Mar 11 17:01:18 2014
From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=)
Date: Tue, 11 Mar 2014 23:01:18 +0100
Subject: "(in 6429)" in allkeys.txt
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 (Ken Whistler's message of "Tue, 11 Mar 2014 18:57:23 +0000")
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
Message-ID: <qzsiqocs41.fsf@numerus.lingfil.uu.se>

Ken Whistler answered my questions:

>> In the DUCET file allkeys.txt,
>> http://www.unicode.org/Public/UCA/latest/allkeys.txt ,
>> there is "(in 6429)" as a comment for some characters.
>> I first didn't understand why, but then I realized those are control
>> characters that are part of ISO/EIC 6429.
>> 
>> Why is that pointed out explicitly in that context?

Thanks for your answers! I feel enlightened.

>> The reason I'm asking is that I was looking at the proposed new version
>> of this file, and was thinking about suggesting a short note in the
>> comments in the beginning of the file.
>
> My personal preference, rather than larding up the header
> of a machine-generated file with more commentary, would be
> a suggestion for further clarification in the text of UTS #10, if
> necessary. After all, the allkeys.txt header already points to
> UTS #10 for more information -- which anyone needs to understand
> and use the data file, anyway.

I agree that a clarification in the text would be better than
a comment in allkeys.txt. But I also think just changing "(in 6429)"
to "(in ISO 6429)" would be enough.

(Strange as it might seem for list regulars not everyone immediately
makes the right association from this four-digit number. :-)

I think that would be a improvement, but I admit it's a rather small
one, and it can be hard to bother to fix small things unless it's
something you do when your fixing something nearby anyway.

This is somewhat besides the point, but since you say the file is
machine-generated I wonder about something I found in the draft version
http://www.unicode.org/Public/UCA/7.0.0/allkeys-7.0.0d5.txt
where a comment says

# Tertiary weight range:  0002..001F (30)

even though the highest used tertiary weight actually is 001E.
Isn't this comment automatically made?


From ken.whistler at sap.com  Tue Mar 11 17:34:20 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Tue, 11 Mar 2014 22:34:20 +0000
Subject: "(in 6429)" in allkeys.txt
In-Reply-To: <qzsiqocs41.fsf@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>


> I agree that a clarification in the text would be better than
> a comment in allkeys.txt. But I also think just changing "(in 6429)"
> to "(in ISO 6429)" would be enough.
> 
> (Strange as it might seem for list regulars not everyone immediately
> makes the right association from this four-digit number. :-)

Ah, I see what the interpretation problem was. Yes, that is
a straightforward kind of improvement -- easily enough done.
Look for a change the next time the file is updated. (It will not
be immediately changed, pending other review comments.)
 
> This is somewhat besides the point, but since you say the file is
> machine-generated I wonder about something I found in the draft version
> http://www.unicode.org/Public/UCA/7.0.0/allkeys-7.0.0d5.txt
> where a comment says
> 
> # Tertiary weight range:  0002..001F (30)
> 
> even though the highest used tertiary weight actually is 001E.
> Isn't this comment automatically made?

The ranges for primary and secondary weights change with every
new repertoire addition to the input, so they are always
calculated dynamically. By contrast, the tertiary weight range
is hard-coded in the generation, and never changes. If you look at:

http://www.unicode.org/reports/tr10/#Tertiary_Weight_Table

you can see all those pre-defined, fixed values. It is true that
0x001F is not actually assigned as a tertiary weight for any
particular character, but it is internally set aside as a MAX_TERTIARY
sentinel value, before the first secondary weight of 0x0020.
Note that the tertiary weight 0x0007 is not actually used in
the weighting, either (for historical reasons). At any rate,
the entire range 0x0002..0x001F is considered fixed and "used"
for tertiaries, so that is what is always displayed in the summary
printed at the top of allkeys.txt.

--Ken


From adam at nohejl.name  Wed Mar 12 04:59:35 2014
From: adam at nohejl.name (Adam Nohejl)
Date: Wed, 12 Mar 2014 10:59:35 +0100
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
In-Reply-To: <DB8B32C2-2D27-4E42-AEEB-E8DCABC39F2C@unicode.org>
References: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
 <DB8B32C2-2D27-4E42-AEEB-E8DCABC39F2C@unicode.org>
Message-ID: <A382198F-840A-44AF-8231-84A3B0C8FEDF@nohejl.name>

Mr. Cook,

Thank you for all the information (and bringin Wenlin to my attention as well).

> Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it,

Great to hear that.

> Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention.

OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow?

> PS: About the subject line of your message.

Yes, of course, the subject should have read "CJK radical-stroke count data". Not one of my brightest moments, I guess...


-- 
Adam Nohejl


From starback at stp.lingfil.uu.se  Wed Mar 12 07:32:15 2014
From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=)
Date: Wed, 12 Mar 2014 13:32:15 +0100
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 (Ken Whistler's message of "Tue, 11 Mar 2014 22:34:20 +0000")
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
Message-ID: <qzvbvja980.fsf_-_@numerus.lingfil.uu.se>

Ken Whistler wrote:
> Ah, I see what the interpretation problem was. Yes, that is
> a straightforward kind of improvement -- easily enough done.
> Look for a change the next time the file is updated. (It will not
> be immediately changed, pending other review comments.)

Thanks! Then I'll skip making a formal request about this.

Regarding these names in ISO 6429 again, how come these control
characters don't have Unicode names? For many uses of names, the control
characters have as much need for them as any other character.
Since it seems so straightforward it must have been suggested several
times to introduce names like

  CONTROL CHARACTER NULL
  CONTROL CHARACTER START OF HEADING
  CONTROL CHARACTER START OF TEXT

etc., so I assume there are good reasons for not doing that, but I can't
see what they are.

Since applications want names they will use other things as names when
there isn't a real name, and that leads to problems. Take Emacs where
the command describe-char currently describes U+0007 as

	  name: <control>
	  old-name: BELL

(I reported the misusage of "<control>" here as a name in 2009, but it
wasn't fixed until this year, so still not in a released version.)
The usage of "BELL" here invites confusion with U+1F514 BELL.

Emacs should do better regarding this, but still, with a proper name
all of this would have been averted.


From mark at macchiato.com  Wed Mar 12 08:11:22 2014
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=)
Date: Wed, 12 Mar 2014 14:11:22 +0100
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <qzvbvja980.fsf_-_@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se>
Message-ID: <CAJ2xs_HLXw3K7c8Cjp6dfQPZ3z36LEQUhtD04Ryk455THFP+hg@mail.gmail.com>

They do have aliases in NameAliases.txt

0000;NULL;control

0000;NUL;abbreviation

0001;START OF HEADING;control

0001;SOH;abbreviation

0002;START OF TEXT;control

0002;STX;abbreviation

...


Mark <https://google.com/+MarkDavis>

 *? Il meglio ? l?inimico del bene ?*


On Wed, Mar 12, 2014 at 1:32 PM, Per Starb?ck <starback at stp.lingfil.uu.se>wrote:

> Ken Whistler wrote:
> > Ah, I see what the interpretation problem was. Yes, that is
> > a straightforward kind of improvement -- easily enough done.
> > Look for a change the next time the file is updated. (It will not
> > be immediately changed, pending other review comments.)
>
> Thanks! Then I'll skip making a formal request about this.
>
> Regarding these names in ISO 6429 again, how come these control
> characters don't have Unicode names? For many uses of names, the control
> characters have as much need for them as any other character.
> Since it seems so straightforward it must have been suggested several
> times to introduce names like
>
>   CONTROL CHARACTER NULL
>   CONTROL CHARACTER START OF HEADING
>   CONTROL CHARACTER START OF TEXT
>
> etc., so I assume there are good reasons for not doing that, but I can't
> see what they are.
>
> Since applications want names they will use other things as names when
> there isn't a real name, and that leads to problems. Take Emacs where
> the command describe-char currently describes U+0007 as
>
>           name: <control>
>           old-name: BELL
>
> (I reported the misusage of "<control>" here as a name in 2009, but it
> wasn't fixed until this year, so still not in a released version.)
> The usage of "BELL" here invites confusion with U+1F514 BELL.
>
> Emacs should do better regarding this, but still, with a proper name
> all of this would have been averted.
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140312/c9cf216a/attachment.html>

From eliz at gnu.org  Wed Mar 12 11:26:06 2014
From: eliz at gnu.org (Eli Zaretskii)
Date: Wed, 12 Mar 2014 18:26:06 +0200
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <qzvbvja980.fsf_-_@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se>
Message-ID: <83bnxbo02p.fsf@gnu.org>

> From: starback at stp.lingfil.uu.se (Per Starb?ck)
> Date: Wed, 12 Mar 2014 13:32:15 +0100
> Cc: "unicode at unicode.org" <unicode at unicode.org>
> 
> Regarding these names in ISO 6429 again, how come these control
> characters don't have Unicode names?

They have a non-empty "old name" field:

  0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
                               ^^^^

> Emacs should do better regarding this

As you yourself say, it already does, so I don't see the point in this
rant.


From ken.whistler at sap.com  Wed Mar 12 11:48:25 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Wed, 12 Mar 2014 16:48:25 +0000
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <83bnxbo02p.fsf@gnu.org>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>

Please be very careful here. Having a non-empty value in field 1 of
UnicodeData.txt is *not* the same has "having a Unicode name".

See:

http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207

for the gory details.

The "Unicode name" is formally defined in terms of the Name property,
which itself is a combination of enumerated values extracted from
UnicodeData.txt, plus a number of rules.

For all characters whose General_Category=Cc, the formal definition
of the Name property is a null string. The string "<control>" is *never*
to be interpreted as a "Unicode name". It is a field placeholder with
legacy status. See "Interpretation of Field 1 of UnicodeData.txt" in
the section I cited above.

As far as user interfaces and other applications needing "names" for
Unicode control characters -- one of the reasons that the namespace
for Unicode characters includes all of the formal name aliases provided
in NameAliases.txt is so that applications can safely treat any formal
name alias for a control character (or the other abbreviations, etc.,
also listed in NameAliases.txt) *as if* they were Unicode names, without
running into name collisions with the actual Name property value
for Unicode characters.

The history of the name collision for the (relatively) recently encoded
U+1F514 BELL with the traditional usage for the U+0007 control function
"BELL" led the UTC to extend the namespace as noted, so we won't be
running into more such problems in the future.

If Emacs were to use "ALERT" or the abbreviation "BEL" for U+0007,
instead of "<control>", that would avoid the collision with U+1F514 BELL,
be conformant to the Unicode Standard, and presumably be helpful
to users, as well. See the entries for U+0007 in NameAliases.txt:

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation

--Ken


> > Regarding these names in ISO 6429 again, how come these control
> > characters don't have Unicode names?
> 
> They have a non-empty "old name" field:
> 
>   0000;<control>;Cc;0;BN;;;;;N;NULL;;;;
>                                ^^^^


From eliz at gnu.org  Wed Mar 12 12:17:27 2014
From: eliz at gnu.org (Eli Zaretskii)
Date: Wed, 12 Mar 2014 19:17:27 +0200
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
Message-ID: <83a9cvnxp4.fsf@gnu.org>

> From: "Whistler, Ken" <ken.whistler at sap.com>
> Date: Wed, 12 Mar 2014 16:48:25 +0000
> Cc: "Whistler, Ken" <ken.whistler at sap.com>,
>         "unicode at unicode.org" <unicode at unicode.org>
> 
> Please be very careful here. Having a non-empty value in field 1 of
> UnicodeData.txt is *not* the same has "having a Unicode name".

You will see that I didn't refer to the Name attribute, I referred to
the old name attribute (called Unicode_1_Name in UAX#44).


From eliz at gnu.org  Wed Mar 12 12:45:07 2014
From: eliz at gnu.org (Eli Zaretskii)
Date: Wed, 12 Mar 2014 19:45:07 +0200
Subject: Names for control characters (Was: "(in 6429)" in allkeys.txt)
In-Reply-To: <CAK5xrfySqxnRirsaPkB_RY7HDggFepPy6T_tKNcd8bQA-HRxDg@mail.gmail.com>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 <83a9cvnxp4.fsf@gnu.org>
 <CAK5xrfySqxnRirsaPkB_RY7HDggFepPy6T_tKNcd8bQA-HRxDg@mail.gmail.com>
Message-ID: <838usfnwf0.fsf@gnu.org>

> Date: Wed, 12 Mar 2014 17:36:37 +0000
> From: Selvaraju Anbu Kaveeswarar <anbu at cyberservices.com>
> Cc: "Whistler, Ken" <ken.whistler at sap.com>, starback at stp.lingfil.uu.se, unicode at unicode.org
> 
> Unicode 1 names are deprecated and the new names are in their place.

Obviously, the new names are useless when they are null.  Emacs lets
users specify a character by its name, so it uses aliases for
characters whose Name property is a null string.


From mark at macchiato.com  Wed Mar 12 14:03:06 2014
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJU=?=)
Date: Wed, 12 Mar 2014 20:03:06 +0100
Subject: Beta CLDR Spec for v25 (LDML)
Message-ID: <CAJ2xs_ELMoqf6cpwGCK996O094TA+DS=5FjwEc49tBUP5QUgPQ@mail.gmail.com>

There is a beta version of the CLDR specification for version 25, with the
changes listed at:

http://www.unicode.org/reports/tr35/proposed.html#Modifications

If you have any feedback on the new sections, please submit it at
http://unicode.org/cldr/trac/newticket. If you do, please include a link to
the specific section you're commenting on. This is easy to do, since
clicking on any header puts a link to that header into your browser's
address bar.

Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140312/7d5c7034/attachment.html>

From starback at stp.lingfil.uu.se  Wed Mar 12 14:37:57 2014
From: starback at stp.lingfil.uu.se (Per =?iso-8859-1?Q?Starb=E4ck?=)
Date: Wed, 12 Mar 2014 20:37:57 +0100
Subject: Names for control characters
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 (Ken Whistler's message of "Wed, 12 Mar 2014 16:48:25 +0000")
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
Message-ID: <qzr467rywa.fsf@numerus.lingfil.uu.se>

Ken Whistler wrote:
> Please be very careful here. Having a non-empty value in field 1 of
> UnicodeData.txt is *not* the same has "having a Unicode name".
>
> See:
>
> http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207

I know it's not a name. My question was *why* control characters don't
*have* names like

  CONTROL CHARACTER NULL
  CONTROL CHARACTER START OF HEADING
  CONTROL CHARACTER START OF TEXT
  etc.

It would be so obvious to have it like that, so I assume there is some
specific reason not to, but I still can't figure it out. For me there is
not less reason for these characters to have names than any others, so
for me it's like Linear B characters didn't have names, and I got the
answer "no problem, they have aliases, so that's OK!" This is just
strange to me. If names aren't needed, why do almost all characters have
them?

This is not about Emacs. Emacs was an example of a program that has use
for character names, and has a harder job because of this strangeness.
Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against
Emacs when I mention that this property of Unicode has led to
longstanding (small) bugs there, but I think real examples are better
than made-up ones.

> If Emacs were to use "ALERT" or the abbreviation "BEL" for U+0007, ...

Yes, programs could have their own lists of preferred aliases to use,
or have a rule such as always use the first alias, but why? Why not have
a name, so programs don't have to choose which alias to use?

(I may be coming of as having a mission about this; "it should be done
like this!!", but mostly this is just a question: "it seems obvious it
should be done like this, so what am i missing?")


From eliz at gnu.org  Wed Mar 12 15:04:45 2014
From: eliz at gnu.org (Eli Zaretskii)
Date: Wed, 12 Mar 2014 22:04:45 +0200
Subject: Names for control characters
In-Reply-To: <qzr467rywa.fsf@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 <qzr467rywa.fsf@numerus.lingfil.uu.se>
Message-ID: <837g7znpya.fsf@gnu.org>

> From: starback at stp.lingfil.uu.se (Per Starb?ck)
> Cc: Eli Zaretskii <eliz at gnu.org>, "unicode\@unicode.org" <unicode at unicode.org>
> Date: Wed, 12 Mar 2014 20:37:57 +0100
> 
> This is not about Emacs. Emacs was an example of a program that has use
> for character names, and has a harder job because of this strangeness.
> Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against
> Emacs when I mention that this property of Unicode has led to
> longstanding (small) bugs there, but I think real examples are better
> than made-up ones.

What I saw as a rant is only this part (which was the only one I
quoted):

> > Emacs should do better regarding this

"Should do better" means it still doesn't, although it's expected to.


From markus.icu at gmail.com  Wed Mar 12 15:11:13 2014
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 12 Mar 2014 13:11:13 -0700
Subject: Names for control characters
In-Reply-To: <qzr467rywa.fsf@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 <qzr467rywa.fsf@numerus.lingfil.uu.se>
Message-ID: <CAN49p6pK7fOSUn2sDvXnKRfzoVtzEU_m9OpT54TWmk+GR-cnFA@mail.gmail.com>

On Wed, Mar 12, 2014 at 12:37 PM, Per Starb?ck
<starback at stp.lingfil.uu.se>wrote:

> My question was *why* control characters don't
> *have* names
>

That's because formally the ISO control codes do not have one fixed,
normative meaning; implementers may or may not follow ISO 6429. That is why
these don't have names in ISO 10646 and in Unicode.

http://www.unicode.org/faq/casemap_charprop.html#15

Of course, a few control codes (e.g., U+000A) are very widely used, and
have Unicode properties according to that use. (e.g., White_Space)

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140312/3c80073b/attachment.html>

From rscook at unicode.org  Wed Mar 12 15:56:07 2014
From: rscook at unicode.org (Richard COOK)
Date: Wed, 12 Mar 2014 13:56:07 -0700
Subject: CJK stroke order data: kRSUnicode v. kRSKangXi
In-Reply-To: <A382198F-840A-44AF-8231-84A3B0C8FEDF@nohejl.name>
References: <B3990F71-EAB5-4B5F-BBB4-A518F72372D8@nohejl.name>
 <DB8B32C2-2D27-4E42-AEEB-E8DCABC39F2C@unicode.org>
 <A382198F-840A-44AF-8231-84A3B0C8FEDF@nohejl.name>
Message-ID: <8CD243B4-1181-4125-A35C-8B23CCD46B93@unicode.org>

On Mar 12, 2014, at 2:59 AM, Adam Nohejl wrote:
>> 
>> Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention.
> 
> OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow?

You are welcome to prepare a proposal, or just send us your list.

We have already started a proposal to augment kRSUnicode, but I'm not sure about the timeframe for completion.

The proofing of the various Kangxi properties is separate from this, but is aimed at the specific KX edition used by IRG in Extension B work.

-Richard


From ken.whistler at sap.com  Wed Mar 12 16:26:28 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Wed, 12 Mar 2014 21:26:28 +0000
Subject: Names for control characters
In-Reply-To: <qzr467rywa.fsf@numerus.lingfil.uu.se>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 <qzr467rywa.fsf@numerus.lingfil.uu.se>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F60BE@USPHLEMB12C.global.corp.sap>

Per continued:

> I know it's not a name. My question was *why* control characters don't
> *have* names like
> 
>   CONTROL CHARACTER NULL
>   CONTROL CHARACTER START OF HEADING
>   CONTROL CHARACTER START OF TEXT
>   etc.
> 
> It would be so obvious to have it like that, so I assume there is some
> specific reason not to, but I still can't figure it out. For me there is
> not less reason for these characters to have names than any others, so
> for me it's like Linear B characters didn't have names, and I got the
> answer "no problem, they have aliases, so that's OK!" This is just
> strange to me. If names aren't needed, why do almost all characters have
> them?

Ah, so this is a "Why is the sky blue?" kind of question. ;-)
And perhaps the correct response is then a Just So story...

Once upon a time, there was an ISO framework for character
encoding. Officially his name was ISO 2022 Information technology --
Character code structure and extension techniques. But we'll
think of him as the troll that lives under the bridge and just call
him "2022" for short.

Now 2022 had his favorite collection of code points that he
kept in buckets under the bridge. But he was very, very particular
about how he organized his collection. All the code points 00 to 1F
had to go in the bucket labeled "C0", and all the code points 20
to 7E had to go in the bucket labeled "G0" (or "GL" --
sometimes the troll would get confused). He had other, even
bigger code points, too, but we can save those for another story.

2022 said all the code points in the "G0" bucket could get names.
In fact they could get lots of names, if they wanted. So 2022 also
starting collecting sets of characters, where all those names were
written down. Sometimes he would "escape" to one set and admire
all those pretty names, and then he would "escape" to another set
and admire other pretty names. 2022 was a great admirer of
escaping, by the way, as well as pretty names.

But the code points in the "C0" bucket were different. 2022
insisted that those code points weren't like the ones in the
"G0" bucket, and they couldn't have names at all.
Indeed, these were very odd code points -- 2022 called
them "control functions". Sometimes when the troll took one
out of the "C0" bucket and examined it, it did one thing, but
the next time it might do something completely different.
Only 2022's friend, the troll named 6429 living under the next
bridge to the north, really understood what they might be doing from
one week to the next.

One day an aspiring young wizard named Unicode was crossing
the bridge. As an aspiring young wizard, he was rather observant.
And he noticed that there was a troll living under the bridge and
that that troll had stolen all the code points and was hoarding
them in strangely labeled buckets under the bridge. Being a wizard
and all, he knew that it was his duty to slay the troll and free all the
code points. So he set about writing down the appropriate spell in
his brand new spellbook.

Now Unicode was a very egalitarian wizard -- it just seemed right
to him that all code points should be able to have names, and it
would be better if each one had just one, unique name. That way,
none of them would get jealous of all the names some other code
point had acquired, and besides, each code point would know its name
and could come when you called it. So in the first version of
Unicode's spellbook, he wrote the spell down just that way. He
called his spell "Unicode 1.0", because, well, it was his spell,
after all, and the very first complete spell that he would be trying
to use. 00 could be called "NULL" and 01 could be called "START OF
HEADING", just like 20 could be called "SPACE" and 2D could be
called "HYPHEN-MINUS".

You may be wondering why Unicode would use such odd names
for all the code points, but then there is no accounting for the whims 
of wizards, I guess.

Well, once Unicode had finished writing down the "Unicode 1.0" spell,
he started casting it on the troll:

Shazaaaam! Ffffppfft!

To Unicode's surprise, the spell only partly worked, but then fizzled.
The troll had been badly hurt, but he was still limping around under
the bridge, and he still clung tightly to his buckets of code points.

Unicode looked around to see what the problem could be, and
noticed that there was a warlock at the other end of the bridge.
It was an infamous warlock who had taken to calling himself "10646",
and from all appearances he was *also* trying to cast a spell to
kill the troll and free all the code points. Apparently, casting the
two spells at the same time had resulted in interference in the ley lines.
That was why neither spell had fully worked, and was why the troll
2022 was still limping around with his code point buckets.

The wizard Unicode headed across the bridge to speak to the
warlock 10646:

"Look, we both want to slay that troll and free his code points.
Why don't we team up and cast synchronized spells?"

But 10646 was a suspicious warlock. He wasn't sure that *all*
of the code points could be freed safely. Who knows what mischief
they might get up to if left on their own.

"Those code points that the troll keeps in the C0 bucket are very
dangerous," said 10646. "We can't let them just be like
all the others and get ordinary names. After all, they seem to do
different things in alternate weeks, and if we give them regular
names, they might come when we call them, even if they are
doing the wrong things that week."

The wizard Unicode heaved a sigh. That seemed so silly to him.
But after all, it was important to kill the troll and save all the code
points. So he pulled out his quill and scratched lines through all
the names for the code points from the C0 bucket in his spellbook,
and decided he would call the revised spell "Unicode 1.1". It was
only a little different from his first spell -- but it is important to
keep track of these things. Spells can be dangerous things, after all.

"How does this look to you, Master Warlock?" he asked.

And 10646 nodded his cautious approval at the revision.

So then the wizard Unicode and the warlock 10646 started casting
their spells together.

Shazaamaazama! Pockety spoketi! Keeeraack!

The troll 2022 was dead! His buckets fell out of his grasp, and all
the code points were freed! But the ones that rolled out of the C0
bucket didn't have names, because Unicode had scratched out
all of their names in the Unicode 1.1 spell he cast, just so the warlock
10646 wouldn't interfere by casting a counterspell for them.

And that is why control characters don't have names.


From sdaoden at yandex.com  Thu Mar 13 04:57:15 2014
From: sdaoden at yandex.com (Steffen Nurpmeso)
Date: Thu, 13 Mar 2014 10:57:15 +0100
Subject: Names for control characters
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F60BE@USPHLEMB12C.global.corp.sap>
References: <qzzjkwwv4s.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F0138@USPHLEMB12C.global.corp.sap>
 <qzsiqocs41.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F02BE@USPHLEMB12C.global.corp.sap>
 <qzvbvja980.fsf_-_@numerus.lingfil.uu.se> <83bnxbo02p.fsf@gnu.org>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F4BCA@USPHLEMB12C.global.corp.sap>
 <qzr467rywa.fsf@numerus.lingfil.uu.se>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F60BE@USPHLEMB12C.global.corp.sap>
Message-ID: <20140313095715.N6t2V2vv1g3WuIu0nm6/oFt7@dietcurd.local>

 |So then the wizard Unicode and the warlock 10646 started casting
 |their spells together.

Fantastic reading.

 |Shazaamaazama! Pockety spoketi! Keeeraack!

History is made by winners.

--steffen


From wjgo_10009 at btinternet.com  Fri Mar 14 06:41:49 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 14 Mar 2014 11:41:49 +0000 (GMT)
Subject: Colour font, color font, colourfont, colorfont
Message-ID: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com>

Colour font, color font, colourfont, colorfont

Many documents use the American English, color font.

Yet will the International Standard use the en-gb-oed English colour font as the spelling and also spell colourize with a -ize ending?

Would it be better to use colourfont as that would be more easily searchable on the web and would, perhaps, be more precise as to meaning and reduce the possibility of ambiguity?

How would the term be expressed in other languages?

Would German use a new single word?

What would be the term in French?

How would the name of the technology localize into the languages of the world?

Is it a good idea to try to standardize the parlance and the localization of the parlance of the technology now?

William Overington

14 March 2014


From alex.plantema at xs4all.nl  Fri Mar 14 07:36:15 2014
From: alex.plantema at xs4all.nl (Alex Plantema)
Date: Fri, 14 Mar 2014 13:36:15 +0100
Subject: Colour font, color font, colourfont, colorfont
References: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com>
Message-ID: <AE7C096DE40C4DD3839A657063C1E77A@p4>

Op vrijdag 14 maart 2014 12:41 schreef William_J_G Overington:

> Colour font, color font, colourfont, colorfont
>
> Many documents use the American English, color font.
>
> Yet will the International Standard use the en-gb-oed English colour
> font as the spelling and also spell colourize with a -ize ending?
>
> Would it be better to use colourfont as that would be more easily
> searchable on the web and would, perhaps, be more precise as to
> meaning and reduce the possibility of ambiguity?
>
> How would the term be expressed in other languages?
>
> Would German use a new single word?
>
> What would be the term in French?
>
> How would the name of the technology localize into the languages of
> the world?
>
> Is it a good idea to try to standardize the parlance and the
> localization of the parlance of the technology now?

Colouri(z|s)e isn't in my dictionary; colour is already a verb as well.
German: Farbenschriftschnitt, French: Fonte de caract?res en couleurs.
Btw, font is spelled fount in British English.

Alex.


From jcb+unicode at inf.ed.ac.uk  Fri Mar 14 08:21:53 2014
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Fri, 14 Mar 2014 13:21:53 GMT
Subject: Colour font, color font, colourfont, colorfont
References: <1394797309.14927.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <AE7C096DE40C4DD3839A657063C1E77A@p4>
Message-ID: <slrnli60jh.6ue.jcb@coffee.inf.ed.ac.uk>

On 2014-03-14, Alex Plantema <alex.plantema at xs4all.nl> wrote:
> Colouri(z|s)e isn't in my dictionary; colour is already a verb as well.

Get a better dictionary. The word has been in the language more than
four hundred years. It currently has a fairly common technical meaning
of "adding colour to old monochrome photos or films". In any case, you
don't need a dictionary, because -ize is a productive formation.

> Btw, font is spelled fount in British English.

Suggest you don't propound on languages other than your own.
That used to be true in the days of metal type, although even so both
spellings have been in use through the last few centuries. Then,
"fount" was a technical term that few people would have cause to use.
With the advent of computers, the "font" spelling has completely
supplanted the "fount" spelling in everyday usage.
Within the industry, some current British letterpress printers use
"fount" for metal type and "font" for digital type, while others use
"font" for both.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From naenaguru at gmail.com  Sat Mar 15 23:12:48 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Sat, 15 Mar 2014 23:12:48 -0500
Subject: Romanized Singhala got great reception in Sri Lanka
Message-ID: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>

I made a presentation demonstrating Dual-script Singhala at National
Science Foundation of Sri Lanka. Most of the attendees were government
employees and media representatives; a few private citizens came too.

Dual-script Singhala means romanized Singhala that can be displayed either
in the Latin script or in the Singhala script using an Orthographic Smart
Font. It is easy to input (phonetically) using a keyboard layout slightly
altered from QWERTY. The font uses Standard Ligature feature <liga> of
OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as
well as many Singhala letters. The font is supported across all OSs:
Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the
proper and complete solution on the computer for the Singhala script used
to write Singhala, Sanskrit and Pali languages. The same solution can be
applied for all Indic languages.

The government ministries, media and people welcomed it with enthusiasm and
relief that there is something practical for Singhala. The response in the
country was singularly positive, except for the person that filibustered
the Q&A session of the presentation that spoke about the hard work done on
Unicode Sinhala, clearly outside the subject matter of the presentation.

The result of the survey passed around was 100% as below (translated from
Singhala):

   1. I believe that Dual-script Singhala is convenient to me as it is
   implemented similar to English - Yes
   2. Today everyone uses Unicode Sinhala. It is easy and has no problems -
   No
   3. The cost of Unicode Sinhala should be eliminated by switching to
   Dual-scrip Singhala - Yes
   4. We should amend Pali text in the Tripitaka according to rulings of
   SLS1134 - No
   5. Digitizing old books is a very important thing - Yes
   6. We should focus on making this easy-to-use Dual-script Singhala
   method a standard - Yes

Please comment or send questions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140315/937ef921/attachment.html>

From verdy_p at wanadoo.fr  Sun Mar 16 00:36:41 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 16 Mar 2014 06:36:41 +0100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
Message-ID: <CAGa7JC31Ehcpj291iVJqfnMNbNBF4GiFp7tLdwTh9ALTNcWX3g@mail.gmail.com>

Don't you realize that what you are trying to create is completely out of
topic of Unicode, as it is simply another new 8-bit encoding similar to
what ISCII does for supporting multiple Indic scripts with a common
encoding/transcoding table?

The ISCII standard has shown its limitations, it cannot be enough to
support all scripts correctly and completely, it has lots of unsolved
ambiguities for tricky cases or historic orthographies, or newer
orthographies, that the UCS encoding better supports due to its larger
character set and more precise character properties and algorithms.

You are in fact creating a transcoding table... Except that you are mixing
the concepts; and the Unicode and ISO technical commitees working on the
UCS don"t need to handle new 8-bit encodings. And you'll soon experiment
the same problems as in ISCII and all other legacy 8-bit encodings: very
poor INTEROPERABILITY due to version tracking or complax contextual rules...

You may still want to promote it at some government or education
institution, in order to promote it as a national standard, except that
there's little change it will ever happen when all countries in ISO have
stopoed working on standardization of new 8-bit encodings (only a few ones
are maintained; but these are the most complex ones used in China and Japan.

Well in fact only Japan now seens to be actively updating its legacy JIS
standard; but only with the focus of converging it to use the UCS and solve
ambiguities or solve some technical problems (e.g. with emojis used by
mobile phone operators). Even China stopped updating its national standard
by publishing a final mapping table to/from the full UCS (including for
characters still not encoded in the UCS): this simplified the work because
only one standard needs to be maintained instead of 2.

Note that as long there will not be any national standard supporting your
proposed encodng, there is no chance that the font standards will adopt it.
You may still want to register your encoding in the IANA registry, but
you'll need to pass the RFC validation. And there are lots of technical
details missing in your proposal so that it can work for supporting it with
a standard mapping in fonts.

There is better chance for you to pomote it only as a transliteration
scheme, or as an input method for leyboard layout (both are also not in the
scope of the Unicode and ISO/ISC 10646 standards though, they could be in
the scope of the CLDR project, which is not by itself a standard but just a
repository of data, supported by a few standards)... Think about it.


2014-03-16 5:12 GMT+01:00 Naena Guru <naenaguru at gmail.com>:

> I made a presentation demonstrating Dual-script Singhala at National
> Science Foundation of Sri Lanka. Most of the attendees were government
> employees and media representatives; a few private citizens came too.
>
> Dual-script Singhala means romanized Singhala that can be displayed either
> in the Latin script or in the Singhala script using an Orthographic Smart
> Font. It is easy to input (phonetically) using a keyboard layout slightly
> altered from QWERTY. The font uses Standard Ligature feature <liga> of
> OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as
> well as many Singhala letters. The font is supported across all OSs:
> Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the
> proper and complete solution on the computer for the Singhala script used
> to write Singhala, Sanskrit and Pali languages. The same solution can be
> applied for all Indic languages.
>
> The government ministries, media and people welcomed it with enthusiasm
> and relief that there is something practical for Singhala. The response in
> the country was singularly positive, except for the person that
> filibustered the Q&A session of the presentation that spoke about the hard
> work done on Unicode Sinhala, clearly outside the subject matter of the
> presentation.
>
> The result of the survey passed around was 100% as below (translated from
> Singhala):
>
>    1. I believe that Dual-script Singhala is convenient to me as it is
>    implemented similar to English - Yes
>    2. Today everyone uses Unicode Sinhala. It is easy and has no problems
>    - No
>    3. The cost of Unicode Sinhala should be eliminated by switching to
>    Dual-scrip Singhala - Yes
>    4. We should amend Pali text in the Tripitaka according to rulings of
>    SLS1134 - No
>    5. Digitizing old books is a very important thing - Yes
>    6. We should focus on making this easy-to-use Dual-script Singhala
>    method a standard - Yes
>
> Please comment or send questions.
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140316/2f6e8714/attachment.html>

From prosfilaes at gmail.com  Sun Mar 16 02:15:24 2014
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 16 Mar 2014 00:15:24 -0700
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
Message-ID: <CAMZ=zj7rTYjKF2MAsaFgSeqVD2DBQvFER0YhBofNS_8_+UpHZQ@mail.gmail.com>

On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru <naenaguru at gmail.com> wrote:
> I made a presentation demonstrating Dual-script Singhala at National Science
> Foundation of Sri Lanka. Most of the attendees were government employees and
> media representatives; a few private citizens came too.

I don't know what the point was of sending this message. You claim
that Unicode Sinhala was outside the subject matter of the
presentation, so why would you post it to this list?

-- 
Kie ekzistas vivo, ekzistas espero.


From everson at evertype.com  Sun Mar 16 06:18:38 2014
From: everson at evertype.com (Michael Everson)
Date: Sun, 16 Mar 2014 11:18:38 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
Message-ID: <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com>


On 16 Mar 2014, at 04:12, Naena Guru <naenaguru at gmail.com> wrote:

> Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font?

What a terrible, terrible idea. You are essentially promoting giving up writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack. 

> Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages.

No, it isn?t. It?s a huge step backwards, unless you propose abolishing the Sinhala script entirely and just writing in Latin. 

> The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation.

That person understood the nature of data integrity. As does everyone who cares about the Universal Character Set. 

Michael Everson * http://www.evertype.com/


From jf at colson.eu  Sun Mar 16 07:12:16 2014
From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=)
Date: Sun, 16 Mar 2014 13:12:16 +0100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAMZ=zj7rTYjKF2MAsaFgSeqVD2DBQvFER0YhBofNS_8_+UpHZQ@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <CAMZ=zj7rTYjKF2MAsaFgSeqVD2DBQvFER0YhBofNS_8_+UpHZQ@mail.gmail.com>
Message-ID: <53259520.8000206@colson.eu>

Le 16/03/14 08:15, David Starner a ?crit :
> On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru <naenaguru at gmail.com> wrote:
>> I made a presentation demonstrating Dual-script Singhala at National 
>> Science
>> Foundation of Sri Lanka. Most of the attendees were government 
>> employees and
>> media representatives; a few private citizens came too.
> I don't know what the point was of sending this message. You claim
> that Unicode Sinhala was outside the subject matter of the
> presentation, so why would you post it to this list?
>

I think the point is in the survey : the questionned persons would have 
answered:
-- that they believe that Dual-script Singhala is convenient to them,
-- that Unicode Sinhala isn't easy and/or has problems,
-- that the cost of Unicode Sinhala should be eliminated by switching to 
Dual-scrip Singhala
-- and that they should focus on making this "easy-to-use" Dual-script 
Singhala method a standard.


The big question is what is difficult in Unicode Sinhala. Is there 
anything Unicode could do to change that feeling?

What's precisely the cost of Unicode Sinhala? Does its use require a 
teaching period the hack wouldn't need? Why?

Le 16/03/14 08:15, David Starner a ?crit :
> On Sat, Mar 15, 2014 at 9:12 PM, Naena Guru <naenaguru at gmail.com> wrote:
>> I made a presentation demonstrating Dual-script Singhala at National Science
>> Foundation of Sri Lanka. Most of the attendees were government employees and
>> media representatives; a few private citizens came too.
> I don't know what the point was of sending this message. You claim
> that Unicode Sinhala was outside the subject matter of the
> presentation, so why would you post it to this list?
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140316/6dec5e19/attachment.html>

From wjgo_10009 at btinternet.com  Sun Mar 16 08:10:13 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sun, 16 Mar 2014 13:10:13 +0000 (GMT)
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
Message-ID: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>

Thank you for starting this thread.

It is good to read of developments.

I remembered a system that I designed many years ago for entering Esperanto text using an ordinary keyboard.

Some years ago I included it in a story.

http://www.users.globalnet.co.uk/~ngo/euto0008.htm

The idea was that characters not on an ordinary QWERTY keyboard could be entered using an ordinary QWERTY keyboard.

If that idea were implemented today then it could be used to enter Esperanto text, with the keystrokes converted into Unicode characters.

However, that system was just for entering a few accented characters into a text written in Latin script and Esperanto does not have ligatures.

Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard?

> It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. 

How is the keyboard altered from QWERTY please?

Are you publishing the font please?

So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system?

William Overington

16 March 2014


From wjgo_10009 at btinternet.com  Sun Mar 16 11:05:45 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sun, 16 Mar 2014 16:05:45 +0000 (GMT)
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
Message-ID: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>

> So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system?

Could this be achieved if a text-processing software package were produced that could automatically perform a character string to character string substitution (namely Romanized Singhala character string to Unicode character string) that would be applied before any OpenType glyph to glyph substitution?

The character string to character string substitution rules could be stored in a text file, such as a UTF-16 text file saved from WordPad, that format being what WordPad describes as a Unicode Text Document file type.

Could this be achieved?

If so, text entry could use an ordinary QWERTY keyboard and yet the resulting text would be stored using the appropriate Unicode characters for the script and the font would use Unicode mappings.

William Overington

16 March 2014


From asmusf at ix.netcom.com  Sun Mar 16 13:15:00 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 16 Mar 2014 11:15:00 -0700
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
Message-ID: <5325EA24.60503@ix.netcom.com>

On 3/16/2014 9:05 AM, William_J_G Overington wrote:
>> So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system?
> Could this be achieved if a text-processing software package were produced that could automatically perform a character string to character string substitution (namely Romanized Singhala character string to Unicode character string) that would be applied before any OpenType glyph to glyph substitution?
>
> The character string to character string substitution rules could be stored in a text file, such as a UTF-16 text file saved from WordPad, that format being what WordPad describes as a Unicode Text Document file type.
>
> Could this be achieved?
It's software. What do you think?

:)

A./
>
> If so, text entry could use an ordinary QWERTY keyboard and yet the resulting text would be stored using the appropriate Unicode characters for the script and the font would use Unicode mappings.
>
> William Overington
>
> 16 March 2014
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


From prosfilaes at gmail.com  Sun Mar 16 13:37:41 2014
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 16 Mar 2014 11:37:41 -0700
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <53259520.8000206@colson.eu>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <CAMZ=zj7rTYjKF2MAsaFgSeqVD2DBQvFER0YhBofNS_8_+UpHZQ@mail.gmail.com>
 <53259520.8000206@colson.eu>
Message-ID: <CAMZ=zj4R75k6UNRytYj6PShxo7Ovi-X3Uswtv+kGJdo5DCm3kw@mail.gmail.com>

On Sun, Mar 16, 2014 at 5:12 AM, Jean-Fran?ois Colson <jf at colson.eu> wrote:
> Le 16/03/14 08:15, David Starner a ?crit :
> > I don't know what the point was of sending this message. You claim
> > that Unicode Sinhala was outside the subject matter of the
> > presentation, so why would you post it to this list?
>
>
> I think the point is in the survey :

I suspect I could get a similar group to agree that all programming
should be done in ALGOL, too. Feed a well-done seminar to the right
people who aren't well-educated in the subject, and you'll get
whatever results you want from the survey.

> The big question is what is difficult in Unicode Sinhala. Is there anything
> Unicode could do to change that feeling?

Naena Guru doesn't care. In fact, Naena Guru's seminar contributed to
that feeling before the survey. Without the input of some Sinhala user
that doesn't have an ax to grind, there's not much that can be done.

-- 
Kie ekzistas vivo, ekzistas espero.


From jf at colson.eu  Sun Mar 16 14:52:40 2014
From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=)
Date: Sun, 16 Mar 2014 20:52:40 +0100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
Message-ID: <53260108.6080209@colson.eu>

Le 16/03/14 14:10, William_J_G Overington a ?crit :
> Thank you for starting this thread.
>
> It is good to read of developments.
>
> I remembered a system that I designed many years ago for entering Esperanto text using an ordinary keyboard.
>
> Some years ago I included it in a story.
>
> http://www.users.globalnet.co.uk/~ngo/euto0008.htm
>
> The idea was that characters not on an ordinary QWERTY keyboard could be entered using an ordinary QWERTY keyboard.

That?s the raison-d??tre of the Compose key available on most Linux/Unix 
computers:
you type compose, apostrophe, e and you get a ?;
you type compose, a, e and you get a ?;
you type compose, question mark, plus, o and you get a ?;
you type compose, 5, 8 and you get a ?;
etc.

>
> If that idea were implemented today

It is! But neither on Windows nor on MacOS.

>   then it could be used to enter Esperanto text,

That is possible.
For ?, you can type compose+ ^ + C.
For ?, you can type compose + ^ + c.
For ?, you can type compose + ^ + G.
For ?, you can type compose + ^ + g.
For ?, you can type compose + ^ + H.
For ?, you can type compose + ^ + h.
For ?, you can type compose + ^ + J.
For ?, you can type compose + ^ + j.
For ?, you can type compose + ^ + S.
For ?, you can type compose + ^ + s.
For ?, you can type compose + U + U or compose + b + U.
For ?, you can type compose + U + u or compose + u + u or compose + b + u.

The problem is that, for a letter as frequent as ? in Esperanto, typing 
compose + (shift + 6) + c isn?t very ergonomic: a dedicated keyboard 
layout is better.


>   with the keystrokes converted into Unicode characters.
>
> However, that system was just for entering a few accented characters into a text written in Latin script and Esperanto does not have ligatures.
>
> Is the Romanized Singhala system a way to enter the characters into a computer using only a QWERTY keyboard?
>
>> It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY.
> How is the keyboard altered from QWERTY please?
>
> Are you publishing the font please?

In fact, I think he was speaking of the bare American (US) qwerty. An 
international version of it should do the job.

Looking at his site http://lovatasinhala.com/ and making a copy and 
paste of the page contents, you see he uses 7-bit ASCII, a few Latin-1 
accented vowels, and a few additional ?letters? such as ?, ?, ?, ? and ?.

Naena Guru?s aim is not to make an input method to type Sinhalese. 
Sinhalese keyboards layouts already exist:
http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html
http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html
http://kaputa.com/uniwriter/apple.gif
http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html

His aim is rather to make an 8-bit font to replace that ?difficult? and 
?expensive? Unicode compliant Sinhalese.

>
> So, everyone, can the Romanized Singhala system be used with a QWERTY keyboard to produce Unicode-encoded text, thereby producing a good combined system?

Of course. Everything can be produced with a QWERTY keyboard ifever you 
provide an appropriate driver.

>
> William Overington
>
> 16 March 2014
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From lang.support at gmail.com  Sun Mar 16 15:05:22 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Mon, 17 Mar 2014 07:05:22 +1100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <53260108.6080209@colson.eu>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <53260108.6080209@colson.eu>
Message-ID: <CAGJ7U-WQyxg69pC2sTU4dKF6XULZbgBy73Gkw=yyGomFr1=eJg@mail.gmail.com>

On 17/03/2014 6:55 AM, "Jean-Fran?ois Colson" <jf at colson.eu> wrote:
>
> Le 16/03/14 14:10, William_J_G Overington a ?crit :
>

>>
>> Is the Romanized Singhala system a way to enter the characters into a
computer using only a QWERTY keyboard?
>>
>>> It is easy to input (phonetically) using a keyboard layout slightly
altered from QWERTY.
>>
>> How is the keyboard altered from QWERTY please?
>>
>> Are you publishing the font please?
>
>
> In fact, I think he was speaking of the bare American (US) qwerty. An
international version of it should do the job.
>
> Looking at his site http://lovatasinhala.com/ and making a copy and paste
of the page contents, you see he uses 7-bit ASCII, a few Latin-1 accented
vowels, and a few additional ?letters? such as ?, ?, ?, ? and ?.
>

He also makes a case distinction,  where upper and lowercase versions of
some characters produce different Sinhala characters.

> Naena Guru?s aim is not to make an input method to type Sinhalese.
Sinhalese keyboards layouts already exist:
> http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsn1.html
> http://www.microsoft.com/resources/msdn/goglobal/keyboards/kbdsw09.html
> http://kaputa.com/uniwriter/apple.gif
> http://www.nongnu.org/sinhala/doc/keymaps/sinhala-keyboard_3.html
>
> His aim is rather to make an 8-bit font to replace that ?difficult? and
?expensive? Unicode compliant Sinhalese.
>

Creating a new set of difficulties.

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/133de16d/attachment.html>

From doug at ewellic.org  Sun Mar 16 15:30:24 2014
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 16 Mar 2014 14:30:24 -0600
Subject: Romanized Singhala got great reception in Sri Lanka
Message-ID: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

>> The idea was that characters not on an ordinary QWERTY keyboard could
>> be entered using an ordinary QWERTY keyboard.
>
> That?s the raison-d??tre of the Compose key available on most Linux/
> Unix computers:

>> If that idea were implemented today
>
> It is! But neither on Windows nor on MacOS.

There are plenty of dead-key keyboard layouts available for Windows and 
Mac computers. The sequences are different from using a Compose key, but 
the principle is the same.

As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's 
point.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From jf at colson.eu  Sun Mar 16 15:50:47 2014
From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=)
Date: Sun, 16 Mar 2014 21:50:47 +0100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell>
References: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell>
Message-ID: <53260EA7.70406@colson.eu>

Le 16/03/14 21:30, Doug Ewell a ?crit :
> Jean-Fran?ois Colson <jf at colson dot eu> wrote:
>
>>> The idea was that characters not on an ordinary QWERTY keyboard could
>>> be entered using an ordinary QWERTY keyboard.
>>
>> That?s the raison-d??tre of the Compose key available on most Linux/
>> Unix computers:
>
>>> If that idea were implemented today
>>
>> It is! But neither on Windows nor on MacOS.
>
> There are plenty of dead-key keyboard layouts available for Windows 
> and Mac computers. The sequences are different from using a Compose 
> key, but the principle is the same.

Of course, I know that. I already have examined the default keyboard 
layouts for Windows 
http://msdn.microsoft.com/en-us/goglobal/bb964651.aspx (there are a few 
mistakes on those maps), MacOS and GNU/Linux (folder 
/usr/share/X11/xkb/symbols/).
My own everyday keyboard layout has no less than 20 (twenty) dead keys.

The idea here was ?that characters not on an ordinary QWERTY keyboard 
could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any 
dead keys on an _ordinary_ (i.e. not one using an international(ized) 
driver) QWERTY keyboard?

If a character is available by a dead key, isn?t it on the keyboard ?

>
> As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's 
> point.
>
> -- 
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ?
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From marc at keyman.com  Sun Mar 16 16:09:42 2014
From: marc at keyman.com (Marc Durdin)
Date: Sun, 16 Mar 2014 21:09:42 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell>
References: <1C7B2DBDE46B46E3B58F7B7468D458B4@DougEwell>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD0B699@federation.tavultesoft.local>

To me the real question is, what are the roadblocks that the other people at this forum saw in using Unicode?  I'm not talking about the proponents of non-Unicode solutions, but those who would otherwise be agnostic given the right support.  And what can we do to address those concerns?

(1) Rendering support still lags -- if the characters don't render properly in Unicode but they do in HackAscii, then HackAscii wins.  Does any operating system renderer today support all the complex scripts in Unicode 6?  How many users need to upgrade their OS in order to get access to a working renderer?  What about mobile operating systems?

(2) Fonts are much harder to create.  Instead of just needing a graphic designer to draw characters, you now need to a programmer as well, who understands OpenType tables.  This is especially a problem for the very popular decorative fonts, which are created by graphic design houses with little interest in the finer nuances of shaping rules.  Again, HackAscii wins.

(3) Many of the Unicode input methods have been hard for end users to adapt to.  I've pushed Unicode in this space for nearly 20 years, but even today, I continue run up against points (1) and (2) with language partners.  HackAscii has slightly less of an advantage here, because you still tend to need some intelligence in your keyboard layout for most HackAscii solutions.

Of course these are solvable.   But when a HackAscii proponent can demonstrate an easier solution, then the slightly more subtle advantages of Unicode tend to be lost in the simple fact that for /my language/, HackAscii "just works".  It's hard to argue the advantages of Unicode when you cannot show a working solution.   And arguing the disadvantages of HackAscii is pointless until you can demonstrate the alternative working to the user's satisfaction.

Marc

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Monday, 17 March 2014 7:30 AM
To: unicode at unicode.org
Cc: Jean-Fran?ois Colson
Subject: Re: Romanized Singhala got great reception in Sri Lanka

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

>> The idea was that characters not on an ordinary QWERTY keyboard could 
>> be entered using an ordinary QWERTY keyboard.
>
> That?s the raison-d??tre of the Compose key available on most Linux/ 
> Unix computers:

>> If that idea were implemented today
>
> It is! But neither on Windows nor on MacOS.

There are plenty of dead-key keyboard layouts available for Windows and Mac computers. The sequences are different from using a Compose key, but the principle is the same.

As Jean-Fran?ois observed, the keyboard layout wasn't really the OP's point.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 

_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From doug at ewellic.org  Sun Mar 16 18:47:24 2014
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 16 Mar 2014 17:47:24 -0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

> The idea here was ?that characters not on an ordinary QWERTY keyboard
> could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any
> dead keys on an _ordinary_ (i.e. not one using an international(ized)
> driver) QWERTY keyboard?

Not on the standard vanilla U.S. keyboard. It has to be provided by the 
OS, via a driver, just as Compose key support has to be provided by the 
OS.

The standard vanilla U.S. keyboard also doesn't provide the accented 
letters and other non-ASCII letters like ? that Naena Guru uses for his 
font hack.

> If a character is available by a dead key, isn?t it on the keyboard ?

It depends on what you mean by "on the keyboard." Thanks to John Cowan's 
delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 
to get the fraction ? (one-seventh). That character is not "on the 
keyboard" in any sense other than what the driver provides.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From doug at ewellic.org  Sun Mar 16 20:01:08 2014
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 16 Mar 2014 19:01:08 -0600
Subject: [private] Re: Unicode : Greek Extended.
Message-ID: <201403170102.s2H124d5019796@unicode.org>

No, the precomposed characters that were added for compatibility were added back in 1992 or so. It was still possible then, not now.

The problem with Marshallese is that long ago, people thought the difference between cedilla below and comma below was just a glyph choice, so fonts were built that showed either cedilla or comma according to the whim of the designer. It turns out to matter a great deal to some people, so there is now a scramble to complete the disunification. I don't know about the Yoruba line-below.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

-----Original Message-----
From: "Richard BUDELBERGER" <budelberger.richard at wanadoo.fr>
Sent: ?3/?16/?2014 18:51
To: "Doug Ewell" <doug at ewellic.org>
Cc: "everson at evertype.com" <everson at evertype.com>; "cowan at mercury.ccil.org" <cowan at mercury.ccil.org>; "unicode at unicode.org" <unicode at unicode.org>
Subject: re: [private] Re: Unicode : Greek Extended.

Thanks to all of you for your answers :

> Message du 16/03/14 16:41
> De : Doug Ewell
> A : budelberger.richard at wanadoo.fr
> Copie ? : 
> Objet : [private] Re: Unicode : Greek Extended.
> 
> Richard BUDELBERGER wrote:
> 
> > A little off topic, but can somebody help me to add three (six) more

Read ??three?(five)??.

> > characters to Unicode? that is :
> > ? GREEK CAPITAL LETTER SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;
> > ? GREEK SMALL LETTER SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;

> > ? GREEK LETTER STIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;
> > ? GREEK SMALL LETTER STIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;

Read ??GREEK SMALL LETTER FINAL SIGMA WITH LINE (or MACRON) BELOW : ?? ?? ;??.

> > ? GREEK CAPITAL LETTER CHI WITH LINE (or MACRON) BELOW : ?? ?? ;
> > ? GREEK SMALL LETTER CHI WITH LINE (or MACRON) BELOW : ?? ??.
> >
> > See some samples here :
> > https://fr.wiktionary.org/wiki/Category:syriaque.
> 
> Both you and Wiktionary proved why these precomposed characters won't be 
> accepted: because they can already be represented using combining 
> characters. If this were something that could not already be represented 
> any other way, then it would be different.
> 
> Some fonts don't display this correctly; they show the macron partially 
> or completely to the right of the base letter, instead of directly below 
> it. The solution is to use another font, and to ask font vendors to fix 
> this combination so it looks decent.
> 
> The correct combining character is U+0331. U+0332 is intentionally a 
> very long line, suitable for math-type applications. All of the existing 
> "WITH LINE BELOW" characters (which were added for compatibility with 
> existing character sets) decompose to U+0331; nothing decomposes to 
> U+0332.

So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them?(?? ?? ?? and H? ?? ??) for 
compatibility???

There was the same problem with Yoruba letters with U+0329 ? ???, and with the new Marshallese alphabet with U+0326 ?????.


R. Budelberger.

> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ? 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140316/87ff55be/attachment.html>

From naenaguru at gmail.com  Sun Mar 16 21:44:08 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Sun, 16 Mar 2014 21:44:08 -0500
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAGa7JC31Ehcpj291iVJqfnMNbNBF4GiFp7tLdwTh9ALTNcWX3g@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <CAGa7JC31Ehcpj291iVJqfnMNbNBF4GiFp7tLdwTh9ALTNcWX3g@mail.gmail.com>
Message-ID: <CAHK3Hy1Uhw8tWLBp8npL2xNOs=MDFXymsN+Ek8ypaAQWs39qxg@mail.gmail.com>

Philippe:

All you said about ISCII is probably right. So, it has given you guys a lot
of pain. I did not do it nor followed it.

As for Japanese (and also for Indic) I have read the warnings in RFC 1815:
http://tools.ietf.org/rfc/rfc1815.txt

I am not creating a transcoding table as you say. I assume you think I take
Unicode Sinhala to be a legitimate encoding for Singhala that I am mapping
to SBCS for the love of SBCS. No. And I don't know what concepts I am
mixing. I am trained in Computer Science, I have taught it at college
level, and have done years of consulting work and written project proposals
for a pretty good size one for the Federal Government too.

I believe that you need to understand the problem at hand to find a
solution for it. You cannot make solutions for Indic not knowing Indic.
Starting blindly with ISCII was a mistake. It is useless at least for
Singhala.

========= STORY OF UNICODE SINHALA ==========
The first draft for the Sinhala chart was handwritten by Andy Daniels. He
mentioned some doubts about some letters in it. He had a good instinct on
that. It sat there people wondering from where he got his information. He
said from Germany. Someone said that it came from a $300 book. I suspect
that it is Rev. Fr. A.M.Gunasekara's book (1891).

Then came the Lion of Unicode Michael Everson (down in this thread). He was
making fonts by the dozen and took Daniels' draft certified the letters
side of it, not having a nicely printed set of the digits. This certificate
was countersigned by a Mettavihari for users. I know Ven. Mettavihari. He
is a Danish man that researched and put up the most comprehensive
Tripitaka, the Buddhist canon. This irreproachable man denies that he
endorsed the standard on behalf of the Singhalese saying obviously he is
not Singhalese. (Actually, I think he is more Singhalese than me). Who
signed as him, a forgery?

When the code chart came to Lanka, the closest to a computer that they knew
was the IBM Selectric typewriter. When they did not do anything about it,
the World Bank offered a $83 scheme to bring Lanka to the computer age all
the way so the village fellow could communicate with the government online.
They set up the IT agency ICTA and got the academics gathered there doing
'projects'. They even paid a fellow to come over and read the OpenType
specification for them. I understand that the kingpin of the operations
there is one person that studied in US.He is the adviser to the President,
The top Colombo University and the ICTA itself. He is one consultant that
does most projects.

When Everson wanted to add the digits apparently finding Fr. Gunasekara's
book, the Lankans denied such existed. When he showed them, they said they
are not necessary. Now this everybody's consultant announced at my
presentation that they are going to add them.
============ END STORY OF UNICODE SINHALA ==========

BAD UNICODE SINHALA:
Unicode Singhala violates Singhala / Sanskrit grammar. Unicode Singhala is
not compatible with Sanskrit, an integral part of the Singhala script. That
also applies to Pali whose native script is Singhala. Unicode Sinhala
further helps kill Singhala by making it very difficult to type and
impossible to obtain the entire repertoire of letters and limiting the
applications and OSs that it can be used in.

Typing Unicode Sinhala requires you to learn a key map that is entirely
different from the familiar English keyboard, while losing some marks and
signs too. There is a program called Helabasa by Keyman typing system that
printers use to type it. There is a physical keyboard too. Then there is
Google transliteration - very inadequate and another one by Colombo
University found on a web page. These last two allow you to type
phonetically but not entirely. The result is very few people type Unicode
Singhala, only those that their job requires them to type Unicode Singhala.

PERFECT ROMANIZED SINGHALA
I did the same thing English and Western European languages did; very
close. I mapped the well-known 58+2 Singhala-Sanskrit phonemes in the SBCS.
The reason is because then Singhala gets to use all those applications
perfected over decades that most here Westerners enjoy. That set covers all
letters necessary for Singhala, Sanskrit and  Pali, the three languages
that use the Singhala script.

See it here displayed using the first orthographic smartfont:
http://lovatasinhala.com/

MORE READING:
Let's look at this as a lay person (whose interest is our ultimate goal)
sees:

English was fully romanized from fu?ark by about 600 AD. Romanizing is
writing by using letters of the Latin alphabet plus many, many others added
to it. All Europeans when they became fully Christianized / literate, they
all adopted Latin letters and extended them as they pleased. This set has
branched off as Latin script and Cyrillic script. Printing industry
standardized the greater part of the alphabets.

Singhala has a well defined phoneme chart called hodiya. It is an extension
of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932)
and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala
fully describe the writing system. Like most other languages, including
English before printing arrived in England, it is written phonetically.

Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to
print Pali (Magadhi) in the Latin script. This requires letters with bars
(macron) and dots not found in common fonts. This scheme is called PTS
Pali. It is similar to IAST Sanskrit. It is impossible to type these on the
regular keyboard.

I freshly romanized Singhala by mapping its phonemes to the SAME area 13
Western European languages mapped their alphabetic letters within the
following Unicode code charts:
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf

So, if that is "creating a transcoding table" all Europeans did it and I do
it too.


On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Don't you realize that what you are trying to create is completely out of
> topic of Unicode, as it is simply another new 8-bit encoding similar to
> what ISCII does for supporting multiple Indic scripts with a common
> encoding/transcoding table?
>
> The ISCII standard has shown its limitations, it cannot be enough to
> support all scripts correctly and completely, it has lots of unsolved
> ambiguities for tricky cases or historic orthographies, or newer
> orthographies, that the UCS encoding better supports due to its larger
> character set and more precise character properties and algorithms.
>
> You are in fact creating a transcoding table... Except that you are mixing
> the concepts; and the Unicode and ISO technical commitees working on the
> UCS don"t need to handle new 8-bit encodings. And you'll soon experiment
> the same problems as in ISCII and all other legacy 8-bit encodings: very
> poor INTEROPERABILITY due to version tracking or complax contextual rules...
>
> You may still want to promote it at some government or education
> institution, in order to promote it as a national standard, except that
> there's little change it will ever happen when all countries in ISO have
> stopoed working on standardization of new 8-bit encodings (only a few ones
> are maintained; but these are the most complex ones used in China and Japan.
>
> Well in fact only Japan now seens to be actively updating its legacy JIS
> standard; but only with the focus of converging it to use the UCS and solve
> ambiguities or solve some technical problems (e.g. with emojis used by
> mobile phone operators). Even China stopped updating its national standard
> by publishing a final mapping table to/from the full UCS (including for
> characters still not encoded in the UCS): this simplified the work because
> only one standard needs to be maintained instead of 2.
>
> Note that as long there will not be any national standard supporting your
> proposed encodng, there is no chance that the font standards will adopt it.
> You may still want to register your encoding in the IANA registry, but
> you'll need to pass the RFC validation. And there are lots of technical
> details missing in your proposal so that it can work for supporting it with
> a standard mapping in fonts.
>
> There is better chance for you to pomote it only as a transliteration
> scheme, or as an input method for leyboard layout (both are also not in the
> scope of the Unicode and ISO/ISC 10646 standards though, they could be in
> the scope of the CLDR project, which is not by itself a standard but just a
> repository of data, supported by a few standards)... Think about it.
>
>
>
> 2014-03-16 5:12 GMT+01:00 Naena Guru <naenaguru at gmail.com>:
>
>> I made a presentation demonstrating Dual-script Singhala at National
>> Science Foundation of Sri Lanka. Most of the attendees were government
>> employees and media representatives; a few private citizens came too.
>>
>> Dual-script Singhala means romanized Singhala that can be displayed
>> either in the Latin script or in the Singhala script using an Orthographic
>> Smart Font. It is easy to input (phonetically) using a keyboard layout
>> slightly altered from QWERTY. The font uses Standard Ligature feature
>> <liga> of OpenType / OpenFont standard to display glyphs of Sanskrit
>> ligatures as well as many Singhala letters. The font is supported across
>> all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala
>> is the proper and complete solution on the computer for the Singhala script
>> used to write Singhala, Sanskrit and Pali languages. The same solution can
>> be applied for all Indic languages.
>>
>> The government ministries, media and people welcomed it with enthusiasm
>> and relief that there is something practical for Singhala. The response in
>> the country was singularly positive, except for the person that
>> filibustered the Q&A session of the presentation that spoke about the hard
>> work done on Unicode Sinhala, clearly outside the subject matter of the
>> presentation.
>>
>> The result of the survey passed around was 100% as below (translated from
>> Singhala):
>>
>>    1. I believe that Dual-script Singhala is convenient to me as it is
>>    implemented similar to English - Yes
>>    2. Today everyone uses Unicode Sinhala. It is easy and has no
>>    problems - No
>>    3. The cost of Unicode Sinhala should be eliminated by switching to
>>    Dual-scrip Singhala - Yes
>>    4. We should amend Pali text in the Tripitaka according to rulings of
>>    SLS1134 - No
>>    5. Digitizing old books is a very important thing - Yes
>>    6. We should focus on making this easy-to-use Dual-script Singhala
>>    method a standard - Yes
>>
>> Please comment or send questions.
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140316/8b5f33e2/attachment.html>

From marc at keyman.com  Sun Mar 16 21:52:19 2014
From: marc at keyman.com (Marc Durdin)
Date: Mon, 17 Mar 2014 02:52:19 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy1Uhw8tWLBp8npL2xNOs=MDFXymsN+Ek8ypaAQWs39qxg@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <CAGa7JC31Ehcpj291iVJqfnMNbNBF4GiFp7tLdwTh9ALTNcWX3g@mail.gmail.com>
 <CAHK3Hy1Uhw8tWLBp8npL2xNOs=MDFXymsN+Ek8ypaAQWs39qxg@mail.gmail.com>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD0DF80@federation.tavultesoft.local>

Naena,

If you have an encoding which is easy to type, that can be replicated with Keyman, or any number of other input systems, for Unicode Singhala.  Input is not tied to encoding.  I would be happy to assist you, off-list, to develop an input method for Unicode Singhala that works according to your requirements.

However, if you have examples of Singhala which cannot be represented in Unicode, please do bring these to the attention of this list.  But differences in input method are not really relevant.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru
Sent: Monday, 17 March 2014 1:44 PM
To: Philippe Verdy
Cc: jc at ahangama.com; Unicode List
Subject: Re: Romanized Singhala got great reception in Sri Lanka

Philippe:

All you said about ISCII is probably right. So, it has given you guys a lot of pain. I did not do it nor followed it.

As for Japanese (and also for Indic) I have read the warnings in RFC 1815:
http://tools.ietf.org/rfc/rfc1815.txt

I am not creating a transcoding table as you say. I assume you think I take Unicode Sinhala to be a legitimate encoding for Singhala that I am mapping to SBCS for the love of SBCS. No. And I don't know what concepts I am mixing. I am trained in Computer Science, I have taught it at college level, and have done years of consulting work and written project proposals for a pretty good size one for the Federal Government too.

I believe that you need to understand the problem at hand to find a solution for it. You cannot make solutions for Indic not knowing Indic. Starting blindly with ISCII was a mistake. It is useless at least for Singhala.

========= STORY OF UNICODE SINHALA ==========
The first draft for the Sinhala chart was handwritten by Andy Daniels. He mentioned some doubts about some letters in it. He had a good instinct on that. It sat there people wondering from where he got his information. He said from Germany. Someone said that it came from a $300 book. I suspect that it is Rev. Fr. A.M.Gunasekara's book (1891).

Then came the Lion of Unicode Michael Everson (down in this thread). He was making fonts by the dozen and took Daniels' draft certified the letters side of it, not having a nicely printed set of the digits. This certificate was countersigned by a Mettavihari for users. I know Ven. Mettavihari. He is a Danish man that researched and put up the most comprehensive Tripitaka, the Buddhist canon. This irreproachable man denies that he endorsed the standard on behalf of the Singhalese saying obviously he is not Singhalese. (Actually, I think he is more Singhalese than me). Who signed as him, a forgery?

When the code chart came to Lanka, the closest to a computer that they knew was the IBM Selectric typewriter. When they did not do anything about it, the World Bank offered a $83 scheme to bring Lanka to the computer age all the way so the village fellow could communicate with the government online. They set up the IT agency ICTA and got the academics gathered there doing 'projects'. They even paid a fellow to come over and read the OpenType specification for them. I understand that the kingpin of the operations there is one person that studied in US.He is the adviser to the President, The top Colombo University and the ICTA itself. He is one consultant that does most projects.

When Everson wanted to add the digits apparently finding Fr. Gunasekara's book, the Lankans denied such existed. When he showed them, they said they are not necessary. Now this everybody's consultant announced at my presentation that they are going to add them.
============ END STORY OF UNICODE SINHALA ==========

BAD UNICODE SINHALA:
Unicode Singhala violates Singhala / Sanskrit grammar. Unicode Singhala is not compatible with Sanskrit, an integral part of the Singhala script. That also applies to Pali whose native script is Singhala. Unicode Sinhala further helps kill Singhala by making it very difficult to type and impossible to obtain the entire repertoire of letters and limiting the applications and OSs that it can be used in.

Typing Unicode Sinhala requires you to learn a key map that is entirely different from the familiar English keyboard, while losing some marks and signs too. There is a program called Helabasa by Keyman typing system that printers use to type it. There is a physical keyboard too. Then there is Google transliteration - very inadequate and another one by Colombo University found on a web page. These last two allow you to type phonetically but not entirely. The result is very few people type Unicode Singhala, only those that their job requires them to type Unicode Singhala.

PERFECT ROMANIZED SINGHALA
I did the same thing English and Western European languages did; very close. I mapped the well-known 58+2 Singhala-Sanskrit phonemes in the SBCS. The reason is because then Singhala gets to use all those applications perfected over decades that most here Westerners enjoy. That set covers all letters necessary for Singhala, Sanskrit and  Pali, the three languages that use the Singhala script.

See it here displayed using the first orthographic smartfont:
http://lovatasinhala.com/

MORE READING:
Let's look at this as a lay person (whose interest is our ultimate goal) sees:

English was fully romanized from fu?ark by about 600 AD. Romanizing is writing by using letters of the Latin alphabet plus many, many others added to it. All Europeans when they became fully Christianized / literate, they all adopted Latin letters and extended them as they pleased. This set has branched off as Latin script and Cyrillic script. Printing industry standardized the greater part of the alphabets.

Singhala has a well defined phoneme chart called hodiya. It is an extension of the Sanskrit hodiya. Rev. Fr. Theodore G. Perera's grammar book (1932) and Rev. Fr. A. M. Gunasekera's book (1891) that dug up sinking Singhala fully describe the writing system. Like most other languages, including English before printing arrived in England, it is written phonetically.

Singhala was romanized first in 1860s by Rhys Davids, called PTS scheme, to print Pali (Magadhi) in the Latin script. This requires letters with bars (macron) and dots not found in common fonts. This scheme is called PTS Pali. It is similar to IAST Sanskrit. It is impossible to type these on the regular keyboard.

I freshly romanized Singhala by mapping its phonemes to the SAME area 13 Western European languages mapped their alphabetic letters within the following Unicode code charts:
http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf

So, if that is "creating a transcoding table" all Europeans did it and I do it too.

On Sun, Mar 16, 2014 at 12:36 AM, Philippe Verdy <verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>> wrote:
Don't you realize that what you are trying to create is completely out of topic of Unicode, as it is simply another new 8-bit encoding similar to what ISCII does for supporting multiple Indic scripts with a common encoding/transcoding table?

The ISCII standard has shown its limitations, it cannot be enough to support all scripts correctly and completely, it has lots of unsolved ambiguities for tricky cases or historic orthographies, or newer orthographies, that the UCS encoding better supports due to its larger character set and more precise character properties and algorithms.

You are in fact creating a transcoding table... Except that you are mixing the concepts; and the Unicode and ISO technical commitees working on the UCS don"t need to handle new 8-bit encodings. And you'll soon experiment the same problems as in ISCII and all other legacy 8-bit encodings: very poor INTEROPERABILITY due to version tracking or complax contextual rules...

You may still want to promote it at some government or education institution, in order to promote it as a national standard, except that there's little change it will ever happen when all countries in ISO have stopoed working on standardization of new 8-bit encodings (only a few ones are maintained; but these are the most complex ones used in China and Japan.

Well in fact only Japan now seens to be actively updating its legacy JIS standard; but only with the focus of converging it to use the UCS and solve ambiguities or solve some technical problems (e.g. with emojis used by mobile phone operators). Even China stopped updating its national standard by publishing a final mapping table to/from the full UCS (including for characters still not encoded in the UCS): this simplified the work because only one standard needs to be maintained instead of 2.

Note that as long there will not be any national standard supporting your proposed encodng, there is no chance that the font standards will adopt it. You may still want to register your encoding in the IANA registry, but you'll need to pass the RFC validation. And there are lots of technical details missing in your proposal so that it can work for supporting it with a standard mapping in fonts.

There is better chance for you to pomote it only as a transliteration scheme, or as an input method for leyboard layout (both are also not in the scope of the Unicode and ISO/ISC 10646 standards though, they could be in the scope of the CLDR project, which is not by itself a standard but just a repository of data, supported by a few standards)... Think about it.


2014-03-16 5:12 GMT+01:00 Naena Guru <naenaguru at gmail.com<mailto:naenaguru at gmail.com>>:
I made a presentation demonstrating Dual-script Singhala at National Science Foundation of Sri Lanka. Most of the attendees were government employees and media representatives; a few private citizens came too.

Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font. It is easy to input (phonetically) using a keyboard layout slightly altered from QWERTY. The font uses Standard Ligature feature <liga> of OpenType / OpenFont standard to display glyphs of Sanskrit ligatures as well as many Singhala letters. The font is supported across all OSs: Windows, Macintosh, Linux, iOS and Android. Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages. The same solution can be applied for all Indic languages.

The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation.

The result of the survey passed around was 100% as below (translated from Singhala):

  1.  I believe that Dual-script Singhala is convenient to me as it is implemented similar to English - Yes
  2.  Today everyone uses Unicode Sinhala. It is easy and has no problems - No
  3.  The cost of Unicode Sinhala should be eliminated by switching to Dual-scrip Singhala - Yes
  4.  We should amend Pali text in the Tripitaka according to rulings of SLS1134 - No
  5.  Digitizing old books is a very important thing - Yes
  6.  We should focus on making this easy-to-use Dual-script Singhala method a standard - Yes
Please comment or send questions.


_______________________________________________
Unicode mailing list
Unicode at unicode.org<mailto:Unicode at unicode.org>
http://unicode.org/mailman/listinfo/unicode


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/c7f43607/attachment.html>

From doug at ewellic.org  Sun Mar 16 22:11:26 2014
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 16 Mar 2014 21:11:26 -0600
Subject: Romanized Singhala got great reception in Sri Lanka
Message-ID: <8AA652F8D9B2449A8EC7136E4972283E@DougEwell>

Naena Guru <naenaguru at gmail dot com> wrote:

> As for Japanese (and also for Indic) I have read the warnings in RFC
> 1815:
> http://tools.ietf.org/rfc/rfc1815.txt

That explains everything.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From jf at colson.eu  Sun Mar 16 23:16:36 2014
From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=)
Date: Mon, 17 Mar 2014 05:16:36 +0100
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy1Uhw8tWLBp8npL2xNOs=MDFXymsN+Ek8ypaAQWs39qxg@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <CAGa7JC31Ehcpj291iVJqfnMNbNBF4GiFp7tLdwTh9ALTNcWX3g@mail.gmail.com>
 <CAHK3Hy1Uhw8tWLBp8npL2xNOs=MDFXymsN+Ek8ypaAQWs39qxg@mail.gmail.com>
Message-ID: <53267724.5050406@colson.eu>


> As for Japanese (and also for Indic) I have read the warnings in RFC 1815:
> http://tools.ietf.org/rfc/rfc1815.txt
>
>

RFC 1815       Character Sets ISO-10646 and ISO-10646-J-1      July 1995

July 1995? Is that document up-to-date?


From jf at colson.eu  Sun Mar 16 23:32:37 2014
From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=)
Date: Mon, 17 Mar 2014 05:32:37 +0100
Subject: [private] Re: Unicode : Greek Extended.
In-Reply-To: <201403170102.s2H124d5019796@unicode.org>
References: <201403170102.s2H124d5019796@unicode.org>
Message-ID: <53267AE5.9010106@colson.eu>

>
> > Some fonts don't display this correctly; they show the macron partially
> > or completely to the right of the base letter, instead of directly 
> below
> > it. The solution is to use another font, and to ask font vendors to fix
> > this combination so it looks decent.

?(2) Fonts are much harder to create. Instead of just needing a graphic 
designer to draw characters, you now need to a programmer as well, who 
understands OpenType tables. [?] Again, HackAscii wins.?


From doug at ewellic.org  Mon Mar 17 01:01:13 2014
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Mar 2014 00:01:13 -0600
Subject: Romanized Singhala got great reception in Sri Lanka
Message-ID: <BC08048FF1A849CDAE6E06C8A2E94635@DougEwell>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

> RFC 1815 Character Sets ISO-10646 and ISO-10646-J-1 July 1995
>
> July 1995? Is that document up-to-date?

It was obsolete at the time of publication. RFC 1815 was a rant by 
someone who thought that:

(a) Unicode was fatally broken for representing Japanese because of Han 
unification, unlike ISO-2022-JP which by definition was used only for 
Japanese; and

(b) display is everything, and all characters not represented by a glyph 
in a Windows NT 3.51 font from 1995 ought to be excluded from 
interchange.

Sounds to me a lot like the present campaign.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From everson at evertype.com  Mon Mar 17 04:47:02 2014
From: everson at evertype.com (Michael Everson)
Date: Mon, 17 Mar 2014 09:47:02 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
Message-ID: <5C8924B5-493D-44E8-A74A-37F5057F6E0F@evertype.com>

On 16 Mar 2014, at 23:47, Doug Ewell <doug at ewellic.org> wrote:

> Jean-Fran?ois Colson <jf at colson dot eu> wrote:
> 
>> The idea here was ?that characters not on an ordinary QWERTY keyboard could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any dead keys on an _ordinary_ (i.e. not one using an international(ized) driver) QWERTY keyboard?
> 
> Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS.

Please distinguish between ?keyboard? which is a piece of hardware and ?keyboard layout? which is a software input method. 

Michael Everson * http://www.evertype.com/


From everson at evertype.com  Mon Mar 17 04:48:09 2014
From: everson at evertype.com (Michael Everson)
Date: Mon, 17 Mar 2014 09:48:09 +0000
Subject: [private] Re: Unicode : Greek Extended.
In-Reply-To: <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13>
References: <mailman.1.1394967601.22364.ietf-languages@alvestrand.no>
 <61D9B397AD154740A01A99C34E39DB2A@DougEwell>
 <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13>
Message-ID: <EB8E1454-D6B2-4BDE-BE13-EC5156F6A6A0@evertype.com>

On 17 Mar 2014, at 00:51, Richard BUDELBERGER <budelberger.richard at wanadoo.fr> wrote:

> So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for compatibility ??

No.

> There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?.

Yes, there is. 

Michael Everson * http://www.evertype.com/


From everson at evertype.com  Mon Mar 17 04:51:00 2014
From: everson at evertype.com (Michael Everson)
Date: Mon, 17 Mar 2014 09:51:00 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy3a23FU=mcsJ_z3asLrG-PaY+VpSo329ffFYa9+tmm5gg@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com>
 <CAHK3Hy3a23FU=mcsJ_z3asLrG-PaY+VpSo329ffFYa9+tmm5gg@mail.gmail.com>
Message-ID: <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com>

On 17 Mar 2014, at 02:48, Naena Guru <naenaguru at gmail.com> wrote:

> You are talking about something you do not know. I am a Singhalese.

That doesn?t give you any special knowledge or privilege. I know many people from Sri Lanka, who work in the area of computing, and who work with the Sinhala characters as encoded in the UCS. 

And really. ?The Lion of Unicode??

My stars.

Michael Everson

> On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson <everson at evertype.com> wrote:
> 
> On 16 Mar 2014, at 04:12, Naena Guru <naenaguru at gmail.com> wrote:
> 
>> Dual-script Singhala means romanized Singhala that can be displayed either in the Latin script or in the Singhala script using an Orthographic Smart Font?
> 
> What a terrible, terrible idea. You are essentially promoting giving up writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack.
> 
>> Dual-script Singhala is the proper and complete solution on the computer for the Singhala script used to write Singhala, Sanskrit and Pali languages.
> 
> No, it isn?t. It?s a huge step backwards, unless you propose abolishing the Sinhala script entirely and just writing in Latin.
> 
>> The government ministries, media and people welcomed it with enthusiasm and relief that there is something practical for Singhala. The response in the country was singularly positive, except for the person that filibustered the Q&A session of the presentation that spoke about the hard work done on Unicode Sinhala, clearly outside the subject matter of the presentation.
> 
> That person understood the nature of data integrity. As does everyone who cares about the Universal Character Set.

Michael Everson * http://www.evertype.com/


From wjgo_10009 at btinternet.com  Mon Mar 17 05:36:53 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 17 Mar 2014 10:36:53 +0000 (GMT)
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <5325EA24.60503@ix.netcom.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <5325EA24.60503@ix.netcom.com>
Message-ID: <1395052613.21848.YahooMailNeo@web87806.mail.ir2.yahoo.com>

>> Could this be achieved?

> It's software. What do you think?

> :)

Well, it is not entirely software as there seems to be discussion about whether Unicode is regarded as a good encoding for the required purpose.

I do not understand those issues at present but I am interested to learn.

I have produced a format for a transliteration file in case that will help.

Is this format of help?

The format of the translit.dat file used for transliteration

This is a thought experiment at present.

Automated transliteration would be by having a file translit.dat available. In the thought experiment the file is a UTF-16 text file, such as can be saved from the WordPad program by selecting saving as a Unicode Text Document.

The translit.dat file would consist of a number of lines of text.

A valid line of text would have one of three possible formats.

If the first character of the line is an asterisk, then the line is a comment.

If the first character of the line is a PERCENT SIGN then the line is the last line of the file.

Otherwise the line is intended to be a transliteration line, yet only is a transliteration line if it is of the correct structure.

The correct structure for a transliteration line is as follows.

One or more characters that are not the VERTICAL LINE character.

A VERTICAL LINE character.

One or more characters that are not the VERTICAL LINE character.

The possibility was considered that on some software platforms that there might be complications, while reading characters from the translit.dat file, regarding detecting the end of the translit.dat file.

If the first character of the line is a PERCENT SIGN then the line is the last line of the file.

In a translit.dat file produced as a Unicode Text Document saved from the WordPad program, lines are separated by two characters, namely CARRIAGE RETURN and LINE FEED, in that order. That is, pressing the return key on the keyboard produces two characters in a Unicode Text Document saved from the WordPad program.

The final five characters of the translit.dat file are here specified to be as follows.

CARRIAGE RETURN
LINE FEED
PERCENT SIGN
CARRIAGE RETURN
LINE FEED

This is achieved using WordPad by pressing the return key both before and after the PERCENT SIGN has been entered.

It is noted that a Unicode Text Document saved from the WordPad program stores the two bytes of each character with the lower byte before the higher byte.

It is noted that a Unicode Text Document saved from the WordPad program starts with a U+FEFF character, used as a BYTE ORDER MARK. Thus the first two bytes of a translit.dat file do not represent a character used in the automated transliteration process.

It is noted that for English and for some other languages that a Unicode Text Document saved from the WordPad program has many bytes that have a value of zero. However, the use of a Unicode Text Document saved from the WordPad program is deliberately chosen for this system so as to make participation in producing a translit.dat file as straightforward as possible, and with the hope that software developed for automated transliteration using this system will work for all languages that can be represented using Unicode characters.

William Overington

17 March 2014


From moyogo at gmail.com  Mon Mar 17 08:48:39 2014
From: moyogo at gmail.com (Denis Jacquerye)
Date: Mon, 17 Mar 2014 13:48:39 +0000
Subject: [private] Re: Unicode : Greek Extended.
In-Reply-To: <EB8E1454-D6B2-4BDE-BE13-EC5156F6A6A0@evertype.com>
References: <mailman.1.1394967601.22364.ietf-languages@alvestrand.no>
 <61D9B397AD154740A01A99C34E39DB2A@DougEwell>
 <1392935375.16775.1395017498429.JavaMail.www@wwinf1h13>
 <EB8E1454-D6B2-4BDE-BE13-EC5156F6A6A0@evertype.com>
Message-ID: <CAJKta0yqs2-pCxdGujJKa0E0g6Z71D4yHn4Owkq9vcO=QgVCEw@mail.gmail.com>

The Syriac in Greek script shown in
http://www.bethmardutho.org/index.php/hugoye/volume-index/585.html
(which the fr.wiktionary.org articles are citing) has underlined chi
and underlined sigma, not chi macron below or sigma macron below.
See page 48: ?For characters not found in the Greek alphabet,
underlining is employed: the he is represented by underlined ch, and
shin by underlined sigma.?
Given what was mentioned so far here, one might assume this could be a
macron below with a specific positioning instead of underline, but
that would be just that: assumptions.

It would be interesting the see original documents to have a better
idea of what these should look like.

On Mon, Mar 17, 2014 at 9:48 AM, Michael Everson <everson at evertype.com> wrote:
> On 17 Mar 2014, at 00:51, Richard BUDELBERGER <budelberger.richard at wanadoo.fr> wrote:
>
>> So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for compatibility ??
>
> No.
>
>> There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?.
>
> Yes, there is.
>
> Michael Everson * http://www.evertype.com/
>

-- 
Denis Moyogo Jacquerye


From doug at ewellic.org  Mon Mar 17 09:04:26 2014
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Mar 2014 08:04:26 -0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <mailman.64.1395064190.14205.unicode@unicode.org>
References: <mailman.64.1395064190.14205.unicode@unicode.org>
Message-ID: <289A5852AC2342BABF18C045E4157A90@DougEwell>

Michael Everson <everson at evertype dot com> wrote:

>>> The idea here was that characters not on an ordinary QWERTY keyboard
>>> could be entered _using_an_ordinary_QWERTY_keyboard._ Are there any
>>> dead keys on an _ordinary_ (i.e. not one using an
>>> international(ized) driver) QWERTY keyboard?
>>
>> Not on the standard vanilla U.S. keyboard. It has to be provided by
>> the OS, via a driver, just as Compose key support has to be provided
>> by the OS.
>
> Please distinguish between "keyboard" which is a piece of hardware and
> "keyboard layout" which is a software input method.

Sorry for the shorthand. Everything I am talking about is software. I 
don't think there is such a thing as a physical dead key on a computer 
keyboard. The Compose key on *nix systems may be a physical key, but it 
doesn't have any special ability to compose characters unless given that 
ability by software.

"An ordinary QWERTY keyboard," as Jean-Fran?ois put it, can generate any 
character, Latin or Sinhala or whatever, so long as the hardware has the 
right software behind it.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From doug at ewellic.org  Mon Mar 17 09:16:16 2014
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Mar 2014 08:16:16 -0600
Subject: Unicode : Greek Extended.
In-Reply-To: <mailman.64.1395064190.14205.unicode@unicode.org>
References: <mailman.64.1395064190.14205.unicode@unicode.org>
Message-ID: <27F8AD416E4D4EA9A68649603FCA8216@DougEwell>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

>> Some fonts don't display this correctly; they show the macron
>> partially or completely to the right of the base letter, instead of
>> directly below it. The solution is to use another font, and to ask
>> font vendors to fix this combination so it looks decent.
>
> "(2) Fonts are much harder to create. Instead of just needing a
> graphic designer to draw characters, you now need to a programmer as
> well, who understands OpenType tables. [?] Again, HackAscii wins."

Richard wasn't suggesting HackAscii in this case. He was suggesting 
newly encoded precomposed characters.

[Richard's original post was to the ietf-languages list, perhaps because 
he isn't signed up for the Unicode list. I replied privately, but 
Richard's subsequent response was cc'd to several people and also to the 
Unicode list. The response didn't make it to the list, but my reply 
(sent from my phone, where I didn't notice all of Richard's recipients) 
did. So I think we can close this by pointing out, again, that no new 
precomposed characters of the form "existing base + existing combining 
character" will be encoded, ever.]

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From petercon at microsoft.com  Mon Mar 17 10:36:29 2014
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 17 Mar 2014 15:36:29 +0000
Subject: [private] Re: Unicode : Greek Extended.
In-Reply-To: <53267AE5.9010106@colson.eu>
References: <201403170102.s2H124d5019796@unicode.org>
 <53267AE5.9010106@colson.eu>
Message-ID: <be528dd5567a4158b5738c94c90f9b0e@BL2PR03MB450.namprd03.prod.outlook.com>

Font tables to position diacritics are not "much harder to create" than anything else involved in font development, and certainly don't require being a programmer. Hinting is harder than positioning tables and does literally involve programming, though I don't hear font developers griping about that. Professional font developers are not quite the luddites the comment suggests.


Petr

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Jean-Fran?ois Colson
Sent: March 16, 2014 9:33 PM
To: unicode at unicode.org
Subject: Re: [private] Re: Unicode : Greek Extended.

>
> > Some fonts don't display this correctly; they show the macron 
> > partially or completely to the right of the base letter, instead of 
> > directly
> below
> > it. The solution is to use another font, and to ask font vendors to 
> > fix this combination so it looks decent.

"(2) Fonts are much harder to create. Instead of just needing a graphic designer to draw characters, you now need to a programmer as well, who understands OpenType tables. [.] Again, HackAscii wins."

_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From budelberger.richard at wanadoo.fr  Mon Mar 17 10:23:25 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Mon, 17 Mar 2014 16:23:25 +0100 (CET)
Subject: Unicode : Greek Extended.
Message-ID: <30528926.15541.1395069805050.JavaMail.www@wwinf1n09>

From: Denis Jacquerye  
Date: Mon, 17 Mar 2014 13:48:39 +0000
> The Syriac in Greek script shown in 
> http://www.bethmardutho.org/index.php/hugoye/volume-index/585.html 
> (which the fr.wiktionary.org articles are citing) has underlined chi 
> and underlined sigma, not chi macron below or sigma macron below. 
> See page 48: ?For characters not found in the Greek alphabet, 
> underlining is employed: the he is represented by underlined ch, and 
> shin by underlined sigma.? 
> Given what was mentioned so far here, one might assume this could be a 
> macron below with a specific positioning instead of underline, but 
> that would be just that: assumptions. 

See now http://tinyurl.com/py72z7w !?

> It would be interesting the see original documents to have a better 
> idea of what these should look like. 

I wasn?t able to find anything about this Mgr Butrus Gemayel?s book?!

On Mon, Mar 17, 2014 at 9:48 AM, Michael Everson  wrote: 
> On 17 Mar 2014, at 00:51, Richard BUDELBERGER  wrote: 
> 
>> So if I ask somebody to create and sell a character set with ? ? ? ? ? ? ? ? ? ?? ?? ?? and their capitals, Unicode should add them (?? ?? ?? and H? ?? ??) for 
compatibility ?? 
> 
> No. 
> 
>> There was the same problem with Yoruba letters with U+0329 ? ? ?, and with the new Marshallese alphabet with U+0326 ? ? ?. 
> 
> Yes, there is. 

See http://www.unicover.com/ecatimag/MI-C309-.jpg & https://fr.wiktionary.org/wiki/Category:marshallais !

> Michael Everson * http://www.evertype.com/ 
> 
-- 
Denis Moyogo Jacquerye
_______________________________________________


From naenaguru at gmail.com  Mon Mar 17 11:08:19 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Mon, 17 Mar 2014 11:08:19 -0500
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
Message-ID: <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>

Making a keyboard is not hard. You can either edit an existing one or make
one from scratch. I made the latest Romanized Singhala one from scratch.
The earlier one was an edit of US-International.

When you type a key on the physical keyboard, you generate what is called a
scan-code of that key so that the keyboard driver knows which key was
pressed. (During DOS days, we used to catch them to make menus.) Now, you
assign one or a sequence of Unicode characters you want to generate for the
keypress.

Use Microsft's keyboard layout creator for all versions of Windows from XP:
http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx

Select the language carefully. I selected US-English for RS. That way, I
can switch between the two keyboards quickly with Ctrl+Shift. You can
change all these in the Control Panel.

Here is the keymap I made for RS in Linux:
http://ahangama.com/apiapi/singhala/linuxkb-s.php
Just scroll down for the English part. (The lines starting with double
slashes are comments and have no effect on the program)

The Macintosh key layout is easy too.

The story with iOS and Android are different but not hard either.


On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell <doug at ewellic.org> wrote:

> Jean-Fran?ois Colson <jf at colson dot eu> wrote:
>
>  The idea here was ?that characters not on an ordinary QWERTY keyboard
>> could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any
>> dead keys on an _ordinary_ (i.e. not one using an international(ized)
>> driver) QWERTY keyboard?
>>
>
> Not on the standard vanilla U.S. keyboard. It has to be provided by the
> OS, via a driver, just as Compose key support has to be provided by the OS.
>
> The standard vanilla U.S. keyboard also doesn't provide the accented
> letters and other non-ASCII letters like ? that Naena Guru uses for his
> font hack.
>
>  If a character is available by a dead key, isn?t it on the keyboard ?
>>
>
> It depends on what you mean by "on the keyboard." Thanks to John Cowan's
> delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to
> get the fraction ? (one-seventh). That character is not "on the keyboard"
> in any sense other than what the driver provides.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ?
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/1455ef61/attachment.html>

From doug at ewellic.org  Mon Mar 17 11:38:47 2014
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Mar 2014 09:38:47 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net>

Naena Guru <naenaguru at gmail dot com> wrote:

> Making a keyboard [layout] is not hard. You can either edit an
> existing one or make one from scratch. I made the latest Romanized
> Singhala one from scratch. The earlier one was an edit of US-
> International.

I've made a couple dozen of them myself, with MSKLC.

> When you type a key on the physical keyboard, you generate what is
> called a scan-code of that key so that the keyboard driver knows which
> key was pressed. (During DOS days, we used to catch them to make
> menus.) Now, you assign one or a sequence of Unicode characters you
> want to generate for the keypress.

Precisely. As Marc Durdin said, you can create a keyboard layout just as
easily for Unicode characters as for ASCII and Latin-1 characters. You
can also assign a combination of characters to a single key.

So it is not true that "typing Unicode Sinhala requires you to learn a
key map that is entirely different from the familiar English keyboard,
while losing some marks and signs too." Unicode does not prescribe any
key map. You can have whatever layout you like.

As Marc also said, if you think there are "marks and signs" missing from
Unicode, that is another matter.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From naenaguru at gmail.com  Mon Mar 17 13:21:06 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Mon, 17 Mar 2014 13:21:06 -0500
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <2B13ACD0-5738-4F2C-A66D-557AF04FAA9B@evertype.com>
 <CAHK3Hy3a23FU=mcsJ_z3asLrG-PaY+VpSo329ffFYa9+tmm5gg@mail.gmail.com>
 <058546F9-9C74-47C3-BA4E-35E545BFB4DE@evertype.com>
Message-ID: <CAHK3Hy3EVf2LFVkj2hguj0-koF4ndXid9Cgc-H6iwhmkE8wttg@mail.gmail.com>

You have lot of stars, my friend. You ARE the Lion. Roar!!

You have lot of friends in Sri Lanka, indeed, and one thanked you highly
for the service you did for the Language too. Good for you. However, he did
not know the way to the Public Library or the nearest Buddhist temple or
the Christian Church that would have enlightened him on Singhala writing
system, so you could help you make meaningful proposal than copying
something that somebody else forwarded with a question. Did any one of them
show you Rev. Fr. A. M Gunasekara's book? No!

Who signed Mettavihari purportedly for the Singhala user group?


On Mon, Mar 17, 2014 at 4:51 AM, Michael Everson <everson at evertype.com>wrote:

> On 17 Mar 2014, at 02:48, Naena Guru <naenaguru at gmail.com> wrote:
>
> > You are talking about something you do not know. I am a Singhalese.
>
> That doesn't give you any special knowledge or privilege. I know many
> people from Sri Lanka, who work in the area of computing, and who work with
> the Sinhala characters as encoded in the UCS.
>
> And really. "The Lion of Unicode"?
>
> My stars.
>
> Michael Everson
>
> > On Sun, Mar 16, 2014 at 6:18 AM, Michael Everson <everson at evertype.com>
> wrote:
> >
> > On 16 Mar 2014, at 04:12, Naena Guru <naenaguru at gmail.com> wrote:
> >
> >> Dual-script Singhala means romanized Singhala that can be displayed
> either in the Latin script or in the Singhala script using an Orthographic
> Smart Font...
> >
> > What a terrible, terrible idea. You are essentially promoting giving up
> writing Sinhala, in favour of a slightly-bigger-than-ASCII Latin font hack.
> >
> >> Dual-script Singhala is the proper and complete solution on the
> computer for the Singhala script used to write Singhala, Sanskrit and Pali
> languages.
> >
> > No, it isn't. It's a huge step backwards, unless you propose abolishing
> the Sinhala script entirely and just writing in Latin.
> >
> >> The government ministries, media and people welcomed it with enthusiasm
> and relief that there is something practical for Singhala. The response in
> the country was singularly positive, except for the person that
> filibustered the Q&A session of the presentation that spoke about the hard
> work done on Unicode Sinhala, clearly outside the subject matter of the
> presentation.
> >
> > That person understood the nature of data integrity. As does everyone
> who cares about the Universal Character Set.
>
> Michael Everson * http://www.evertype.com/
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/6052ffda/attachment.html>

From marc at keyman.com  Mon Mar 17 15:33:56 2014
From: marc at keyman.com (Marc Durdin)
Date: Mon, 17 Mar 2014 20:33:56 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>

I disagree.  Making a basic keyboard layout is not hard, just like making a font without OpenType support is not that hard.   Making a keyboard layout that doesn?t force users to learn the nuances of the encoding of a script is more of a challenge, and making a high quality keyboard layout that is consistent, easy to use, and efficient is anything but straightforward.  Most keyboard layouts fail at one of these.

The story for touch device input is even more challenging.  Not being constrained to a physical set of keys increases your flexibility.  The big challenge is usually the size of the display on mobile-sized devices.

Regarding keyboard design:

?         Scan make/break codes are not really relevant to Windows keyboards ? Windows has an abstraction layer of ?virtual key? codes, for better or worse.

?         Selecting US-English for a non-English keyboard means that all language tools will break with your text.  Spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more.  That?s a big price to pay.  Conversely, selecting Singhala for your Romanised non-Unicode encoding will break spell checking, grammar checking, automatic keyboard selection, autocorrect, font selection and more.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru
Sent: Tuesday, 18 March 2014 3:08 AM
To: Doug Ewell
Cc: UnicoDe List
Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

Making a keyboard is not hard. You can either edit an existing one or make one from scratch. I made the latest Romanized Singhala one from scratch. The earlier one was an edit of US-International.

When you type a key on the physical keyboard, you generate what is called a scan-code of that key so that the keyboard driver knows which key was pressed. (During DOS days, we used to catch them to make menus.) Now, you assign one or a sequence of Unicode characters you want to generate for the keypress.

Use Microsft's keyboard layout creator for all versions of Windows from XP:
http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx

Select the language carefully. I selected US-English for RS. That way, I can switch between the two keyboards quickly with Ctrl+Shift. You can change all these in the Control Panel.

Here is the keymap I made for RS in Linux:
http://ahangama.com/apiapi/singhala/linuxkb-s.php
Just scroll down for the English part. (The lines starting with double slashes are comments and have no effect on the program)

The Macintosh key layout is easy too.

The story with iOS and Android are different but not hard either.


On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell <doug at ewellic.org<mailto:doug at ewellic.org>> wrote:
Jean-Fran?ois Colson <jf at colson dot eu> wrote:
The idea here was ?that characters not on an ordinary QWERTY keyboard
could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any
dead keys on an _ordinary_ (i.e. not one using an international(ized)
driver) QWERTY keyboard?

Not on the standard vanilla U.S. keyboard. It has to be provided by the OS, via a driver, just as Compose key support has to be provided by the OS.

The standard vanilla U.S. keyboard also doesn't provide the accented letters and other non-ASCII letters like ? that Naena Guru uses for his font hack.
If a character is available by a dead key, isn?t it on the keyboard ?

It depends on what you mean by "on the keyboard." Thanks to John Cowan's delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to get the fraction ? (one-seventh). That character is not "on the keyboard" in any sense other than what the driver provides.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ?
_______________________________________________
Unicode mailing list
Unicode at unicode.org<mailto:Unicode at unicode.org>
http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/9a0dbbad/attachment.html>

From marc at keyman.com  Mon Mar 17 15:36:42 2014
From: marc at keyman.com (Marc Durdin)
Date: Mon, 17 Mar 2014 20:36:42 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <289A5852AC2342BABF18C045E4157A90@DougEwell>
References: <mailman.64.1395064190.14205.unicode@unicode.org>
 <289A5852AC2342BABF18C045E4157A90@DougEwell>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD141BB@federation.tavultesoft.local>

In the modern PC world, the physical keyboard generates scan codes, and these are not tied to what is printed on the key cap.  Dead keys and modifiers are implemented in software.  But key repeat is implemented in hardware.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Tuesday, 18 March 2014 1:04 AM
To: unicode at unicode.org
Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

Michael Everson <everson at evertype dot com> wrote:

>>> The idea here was that characters not on an ordinary QWERTY keyboard 
>>> could be entered _using_an_ordinary_QWERTY_keyboard._ Are there any 
>>> dead keys on an _ordinary_ (i.e. not one using an
>>> international(ized) driver) QWERTY keyboard?
>>
>> Not on the standard vanilla U.S. keyboard. It has to be provided by 
>> the OS, via a driver, just as Compose key support has to be provided 
>> by the OS.
>
> Please distinguish between "keyboard" which is a piece of hardware and 
> "keyboard layout" which is a software input method.

Sorry for the shorthand. Everything I am talking about is software. I don't think there is such a thing as a physical dead key on a computer keyboard. The Compose key on *nix systems may be a physical key, but it doesn't have any special ability to compose characters unless given that ability by software.

"An ordinary QWERTY keyboard," as Jean-Fran?ois put it, can generate any character, Latin or Sinhala or whatever, so long as the hardware has the right software behind it.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 

_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From naenaguru at gmail.com  Mon Mar 17 18:18:50 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Mon, 17 Mar 2014 18:18:50 -0500
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
Message-ID: <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>

Romanized to Unicode? Romanizing is inside Unicode. English and all Western
European languages also use Unicode.

Romanized Singhala resides in Latin 1 character set, that is between U+0020
to U+00FF.
Unicode Singhala resides in the range U+0D80 to U+0DFF

There is no difference between RS and those languages except the users live
in an island far away from those others. Is there some reason you want to
convert romanized Singhala to Unicode Singhala, a terrible specification
that is already corrupting the language? I spoke to serious users such as
journalists and teachers just few weeks back. It is unfortunate that you
guys are still hanging on to it. Why?

The proof-of-concept font I made has glyph substitutions. That is how it
can apply an orthography.

Unicode Singhala is completely a botched work. It has vowels each with two
codes, one for stand-alone and one for its sign. Each consonant is
considered as having the embedded (intrinsic!) vowel a. I is not a
consonant, people. Then it has two ligatures included as basic consonants
These do not have normalizing rules, 1 because they are NOT canonical forms
as there was no precedent digital form of Singhala for backward
compatibility 2 It was submitted after Unicode closed receiving
applications for normalizing canonical forms. How on earth can you make a
sorting method for it?

When you backspace it destroys multiple keystrokes. Search and replace is
not possible, at least the way do it with English. Typing is a nightmare.

There are special rules for making Unicode Singhala fonts. The keyboards
have keys to type pieces of letters not in the code block.

As you see, this is a terrible mess and cannot be straightened, granted few
people use it, and there'll be more. What other choice do they have except
Anglicizing?. In Singhala, they say, "balu valigee u?a purukee ?aalaa
h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ??????? ?? ??? ???????
<- I inserted all joiners, but can't guarantee if vowel signs would pop
out). It means you cannot straighten dog tail even if you put it in a
bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing
down the language with it.


On Sun, Mar 16, 2014 at 11:05 AM, William_J_G Overington <
wjgo_10009 at btinternet.com> wrote:

> > So, everyone, can the Romanized Singhala system be used with a QWERTY
> keyboard to produce Unicode-encoded text, thereby producing a good combined
> system?
>
> Could this be achieved if a text-processing software package were produced
> that could automatically perform a character string to character string
> substitution (namely Romanized Singhala character string to Unicode
> character string) that would be applied before any OpenType glyph to glyph
> substitution?
>
> The character string to character string substitution rules could be
> stored in a text file, such as a UTF-16 text file saved from WordPad, that
> format being what WordPad describes as a Unicode Text Document file type.
>
> Could this be achieved?
>
> If so, text entry could use an ordinary QWERTY keyboard and yet the
> resulting text would be stored using the appropriate Unicode characters for
> the script and the font would use Unicode mappings.
>
> William Overington
>
> 16 March 2014
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/ee20f382/attachment.html>

From marc at keyman.com  Mon Mar 17 18:36:34 2014
From: marc at keyman.com (Marc Durdin)
Date: Mon, 17 Mar 2014 23:36:34 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD15AAE@federation.tavultesoft.local>

Typing methods are not a factor.  These are easily solved with modern input methods.  You are confusing encoding systems with input methods.

As I said previously, I am happy to assist you in creating a Unicode-based Romanized input method for Singhala, off list, that works exactly the way you want it to.  It?s an exciting process, especially that point when you discover how much more flexibility you get when you don?t design your encoding around a particular input method!  For example, you can create both a Romanized input method and a visual input method for the same encoding.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Naena Guru
Sent: Tuesday, 18 March 2014 10:19 AM
To: William_J_G Overington
Cc: Unicode List
Subject: Re: Romanized Singhala got great reception in Sri Lanka

When you backspace it destroys multiple keystrokes. Search and replace is not possible, at least the way do it with English. Typing is a nightmare.

There are special rules for making Unicode Singhala fonts. The keyboards have keys to type pieces of letters not in the code block.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/b84c5119/attachment.html>

From naenaguru at gmail.com  Mon Mar 17 19:04:28 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Mon, 17 Mar 2014 19:04:28 -0500
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
Message-ID: <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>

Marc,

Yes, making keyboard layouts is not difficult.

I believed that language tools are selected for each language manually when
input. I did not know that there is an automatic switching of language
tools say when you switch to French keyboard from English. That shouldn't
be difficult to make for RS, though.

In the case of romanized Singhala, any processing that English accepts, it
accepts too. For RS, you select a font to display it in the native script
because if it is mixed with English, both are using the same character
space, just as when English and French are mixed.

Spell checking, grammar checking for Unicode Singhala? There are no such
things for it there. It is in the stage of struggling to input text:
special programs, physical keyboards etc. I saw them. They have a special
IT category of employees to input Unicode Sinhala. They have special places
called Typesetting kiosks in Lanka where you go to get your r?sum? and term
paper printed.


On Mon, Mar 17, 2014 at 3:33 PM, Marc Durdin <marc at keyman.com> wrote:

>  I disagree.  Making a basic keyboard layout is not hard, just like
> making a font without OpenType support is not that hard.   Making a
> keyboard layout that doesn?t force users to learn the nuances of the
> encoding of a script is more of a challenge, and making a high quality
> keyboard layout that is consistent, easy to use, and efficient is anything
> but straightforward.  Most keyboard layouts fail at one of these.
>
>
>
> The story for touch device input is even more challenging.  Not being
> constrained to a physical set of keys increases your flexibility.  The big
> challenge is usually the size of the display on mobile-sized devices.
>
>
>
> Regarding keyboard design:
>
> ?         Scan make/break codes are not really relevant to Windows
> keyboards ? Windows has an abstraction layer of ?virtual key? codes, for
> better or worse.
>
> ?         Selecting US-English for a non-English keyboard means that all
> language tools will break with your text.  Spell checking, grammar
> checking, automatic keyboard selection, autocorrect, font selection and
> more.  That?s a big price to pay.  Conversely, selecting Singhala for your
> Romanised non-Unicode encoding will break spell checking, grammar checking,
> automatic keyboard selection, autocorrect, font selection and more.
>
>
>
> Marc
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Naena
> Guru
> *Sent:* Tuesday, 18 March 2014 3:08 AM
> *To:* Doug Ewell
> *Cc:* UnicoDe List
> *Subject:* Re: Dead and Compose keys (was: Re: Romanized Singhala got
> great reception in Sri Lanka)
>
>
>
> Making a keyboard is not hard. You can either edit an existing one or make
> one from scratch. I made the latest Romanized Singhala one from scratch.
> The earlier one was an edit of US-International.
>
>
>
> When you type a key on the physical keyboard, you generate what is called
> a scan-code of that key so that the keyboard driver knows which key was
> pressed. (During DOS days, we used to catch them to make menus.) Now, you
> assign one or a sequence of Unicode characters you want to generate for the
> keypress.
>
>
>
> Use Microsft's keyboard layout creator for all versions of Windows from XP:
>
> http://msdn.microsoft.com/en-us/goglobal/bb964665.aspx
>
>
>
> Select the language carefully. I selected US-English for RS. That way, I
> can switch between the two keyboards quickly with Ctrl+Shift. You can
> change all these in the Control Panel.
>
>
>
> Here is the keymap I made for RS in Linux:
>
> http://ahangama.com/apiapi/singhala/linuxkb-s.php
>
> Just scroll down for the English part. (The lines starting with double
> slashes are comments and have no effect on the program)
>
>
>
> The Macintosh key layout is easy too.
>
>
>
> The story with iOS and Android are different but not hard either.
>
>
>
>
>
> On Sun, Mar 16, 2014 at 6:47 PM, Doug Ewell <doug at ewellic.org> wrote:
>
> Jean-Fran?ois Colson <jf at colson dot eu> wrote:
>
> The idea here was ?that characters not on an ordinary QWERTY keyboard
> could be entered _using_an_ordinary_QWERTY_keyboard._? Are there any
> dead keys on an _ordinary_ (i.e. not one using an international(ized)
> driver) QWERTY keyboard?
>
>
> Not on the standard vanilla U.S. keyboard. It has to be provided by the
> OS, via a driver, just as Compose key support has to be provided by the OS.
>
> The standard vanilla U.S. keyboard also doesn't provide the accented
> letters and other non-ASCII letters like ? that Naena Guru uses for his
> font hack.
>
> If a character is available by a dead key, isn?t it on the keyboard ?
>
>
> It depends on what you mean by "on the keyboard." Thanks to John Cowan's
> delightful Moby Latin keyboard layout, I can type AltGr+\ followed by 7 to
> get the fraction ? (one-seventh). That character is not "on the keyboard"
> in any sense other than what the driver provides.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ?
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/6024d94c/attachment.html>

From naenaguru at gmail.com  Mon Mar 17 19:21:00 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Mon, 17 Mar 2014 19:21:00 -0500
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net>
References: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net>
Message-ID: <CAHK3Hy3=CQi0uHa1kyYSYe2=nJgzjNaQnEM2BOc0BOyPSVJiXA@mail.gmail.com>

Doug,

Making keyboard layouts for Unicode Singhala is hard not because of fault
of Unicode. It is the complexity of letter assembly. I have use the
Wijesekara keyboard on a 24in Olympia Singhala keyboard in 1970s. It is
radically different from US-English.

I tried to make a phonetic one to kind of relate to the English keys.
Still, you need to have many shifted keys to get common letters.


On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell <doug at ewellic.org> wrote:

> Naena Guru <naenaguru at gmail dot com> wrote:
>
> > Making a keyboard [layout] is not hard. You can either edit an
> > existing one or make one from scratch. I made the latest Romanized
> > Singhala one from scratch. The earlier one was an edit of US-
> > International.
>
> I've made a couple dozen of them myself, with MSKLC.
>
> > When you type a key on the physical keyboard, you generate what is
> > called a scan-code of that key so that the keyboard driver knows which
> > key was pressed. (During DOS days, we used to catch them to make
> > menus.) Now, you assign one or a sequence of Unicode characters you
> > want to generate for the keypress.
>
> Precisely. As Marc Durdin said, you can create a keyboard layout just as
> easily for Unicode characters as for ASCII and Latin-1 characters. You
> can also assign a combination of characters to a single key.
>
> So it is not true that "typing Unicode Sinhala requires you to learn a
> key map that is entirely different from the familiar English keyboard,
> while losing some marks and signs too." Unicode does not prescribe any
> key map. You can have whatever layout you like.
>
> As Marc also said, if you think there are "marks and signs" missing from
> Unicode, that is another matter.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140317/59b7120b/attachment.html>

From lang.support at gmail.com  Mon Mar 17 19:45:55 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Tue, 18 Mar 2014 11:45:55 +1100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy3=CQi0uHa1kyYSYe2=nJgzjNaQnEM2BOc0BOyPSVJiXA@mail.gmail.com>
References: <20140317093847.665a7a7059d7ee80bb4d670165c8327d.a4179b6ae2.wbe@email03.secureserver.net>
 <CAHK3Hy3=CQi0uHa1kyYSYe2=nJgzjNaQnEM2BOc0BOyPSVJiXA@mail.gmail.com>
Message-ID: <CAGJ7U-UzQ5mvcZBrxjPBrw5EJbT6bEajgDjLmvPAmM-8+ywNfg@mail.gmail.com>

On 18/03/2014 11:23 AM, "Naena Guru" <naenaguru at gmail.com> wrote:
>

>
> I tried to make a phonetic one to kind of relate to the English keys.
Still, you need to have many shifted keys to get common letters.
>

No you don't, you just need to understand the possibilities of what your
input framework is capable of and the best way to implement what you want
to achieve.

The windows input system is probably the most contrained,  but to look at a
good phonetic layout have a look at the Cherokee Phonetic layout on Windows
8+

Designing a god layout requires using the right tools,  knowing the limits
and capabilities of those tools, and using them in creative ways.

>
> On Mon, Mar 17, 2014 at 11:38 AM, Doug Ewell <doug at ewellic.org> wrote:
>>
>> Naena Guru <naenaguru at gmail dot com> wrote:
>>
>> > Making a keyboard [layout] is not hard. You can either edit an
>> > existing one or make one from scratch. I made the latest Romanized
>> > Singhala one from scratch. The earlier one was an edit of US-
>> > International.
>>
>> I've made a couple dozen of them myself, with MSKLC.
>>
>> > When you type a key on the physical keyboard, you generate what is
>> > called a scan-code of that key so that the keyboard driver knows which
>> > key was pressed. (During DOS days, we used to catch them to make
>> > menus.) Now, you assign one or a sequence of Unicode characters you
>> > want to generate for the keypress.
>>
>> Precisely. As Marc Durdin said, you can create a keyboard layout just as
>> easily for Unicode characters as for ASCII and Latin-1 characters. You
>> can also assign a combination of characters to a single key.
>>
>> So it is not true that "typing Unicode Sinhala requires you to learn a
>> key map that is entirely different from the familiar English keyboard,
>> while losing some marks and signs too." Unicode does not prescribe any
>> key map. You can have whatever layout you like.
>>
>> As Marc also said, if you think there are "marks and signs" missing from
>> Unicode, that is another matter.
>>
>> --
>> Doug Ewell | Thornton, CO, USA
>> http://ewellic.org | @DougEwell
>>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/8dc5bd0b/attachment.html>

From doug at ewellic.org  Mon Mar 17 19:56:25 2014
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 17 Mar 2014 18:56:25 -0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell><CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com><1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
Message-ID: <97765363E97E4C2E9CE7F318327A11D1@DougEwell>

Naena Guru wrote:

> In the case of romanized Singhala, any processing that English
> accepts, it accepts too. For RS, you select a font to display it in
> the native script because if it is mixed with English, both are using
> the same character space, just as when English and French are mixed.

But English and French actually *use* the same letters, or at any rate 
most of them. With your approach, it is not possible to write Sinhala in 
the Sinhala script mixed with English or French or anything else in the 
Latin script. In web pages you can resort to <span style="Latin"> 
tricks, but this doesn't work for plain text.

This is what people mean when they suggest that your real goal is to 
abolish the Sinhala script and just write in Latin.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ?


From ken.whistler at sap.com  Mon Mar 17 20:36:59 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Tue, 18 Mar 2014 01:36:59 +0000
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F7A95@USPHLEMB12C.global.corp.sap>

Well, I actually don?t see. I took a look at the Sinhala you inserted in this
email. I cannot tell what you did at your input end (about ?inserted all joiners?),
but there are no actual joiners in the text itself. It displayed just fine
in my email (including the correct conditional formatting of the ?u vowel
applied to the ra in purukee), without me doing anything special (or installing
any hacked font). Why? Because it was transmitted in plain Unicode.

I cut and pasted that Unicode Sinhala string into a Word document, and
it worked just fine. The boundaries for all the syllables were correctly
detected.

I saved it as a plain text UTF-8 file, and it worked just
fine. I even then read the plain text UTF-8 file into a UTF-8 aware
programming editor, and it worked just fine. (In a programming editor,
which doesn?t attempt complex script rendering,
the vowels don?t apply to the consonants and no reordering is done, so
the display isn?t correct, but each character is correctly preserved, and
if I write it back out to a document and read it in Word or some other
tool that has access to proper rendering, it is still fine.) And all that
interoperability works, why? Because this is plain Unicode.

So while I don?t doubt that people may be having serious issues with
input methods for Sinhala, I tend to agree with Marc Durdin that you are confusing
encoding with input methods. Yes, I know you know the difference,
but it appears to me that the inescapable conclusion from your
argumentation is that the highest priority for the design of an
encoding system should be to make the design of input methods
as simple as possible. And in my estimation, that is confusing encoding
with input methods.

The art of input methods is to hide encoding details from users, and
instead to provide them with an abstraction that they find easy to
use and which accords with their general understanding of the writing
system they are using. If done correctly, then the details of the input
method *also* recede into the background, and users then simply
do what they want: write and edit text easily on their devices.

--Ken

P.S. Here is an octal dump of that text (after I inserted a closing parenthesis in
the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8,
no fancy stuff, and works just fine.

0000000000    EF  BB  BF  62  61  6C  75  20  76  61  6C  69  67  65  65  C2
0000000020    A0  75  C2  B5  61  20  70  75  72  75  6B  65  65  C2  A0  C3
0000000040    B0  61  61  6C  61  61  20  68  C3  A6  C3  B0  75  76  61  C3
0000000060    BE  20  6E  C3  A6  C3  A6  20  C3  A6  C3  B0  65  65  20  C3
0000000100    A6  72  65  6E  6E  65  65  0D  0A  28  E0  B6  B6  E0  B6  BD
0000000120    E0  B7  94  20  E0  B7  80  E0  B6  BD  E0  B7  92  E0  B6  9C
0000000140    E0  B7  9A  20  E0  B6  8B  E0  B6  AB  20  E0  B6  B4  E0  B7
0000000160    94  E0  B6  BB  E0  B7  94  E0  B6  9A  E0  B7  9A  20  E0  B6
0000000200    AF  E0  B7  8F  E0  B6  BD  E0  B7  8F  20  E0  B7  84  E0  B7
0000000220    90  E0  B6  AF  E0  B7  94  E0  B7  80  E0  B6  AD  E0  B7  8A
0000000240    20  E0  B6  B1  E0  B7  91  20  E0  B6  87  E0  B6  AF  E0  B7
0000000260    9A  20  E0  B6  87  E0  B6  BB  E0  B7  99  E0  B6  B1  E0  B7
0000000300    8A  E0  B6  B1  E0  B7  9A  29  0D  0A  0D  0A

As you see, this is a terrible mess and cannot be straightened, granted few people use it, and there'll be more. What other choice do they have except Anglicizing?. In Singhala, they say, "balu valigee u?a purukee ?aalaa h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ??????? ?? ??? ??????? <- I inserted all joiners, but can't guarantee if vowel signs would pop out). It means you cannot straighten dog tail even if you put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is bringing down the language with it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/f87b1e57/attachment.html>

From naenaguru at gmail.com  Tue Mar 18 00:23:31 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Tue, 18 Mar 2014 00:23:31 -0500
Subject: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F7A95@USPHLEMB12C.global.corp.sap>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F7A95@USPHLEMB12C.global.corp.sap>
Message-ID: <CAHK3Hy3bg=TSpaVYGUx5_Ltch86_AgMLpFmXU4M6gDP93Q-j7Q@mail.gmail.com>

Thank you, Ken.

You very nicely analyzed it. Why I said that the signs might pop out
because  I have had complaints that happening. I think this is because
implementation of proper rendering is behind in some systems.

On input, I tried to make a layout that is close to QWERTY. But failed
because of the need for too many combination keys. Keyman uses the old
typewriter keyboard Wijesekara. I saw a better one on the front page for
Singhala but did not find it further inside. Marc would know, of course.

Anyway my complaint is that Unicode Singhala is incomplete and wrong and
that it has a deleterious effect on the language, one of the oldest in the
world. What's aggravating is that they institutionalize errors as correct.
Rev. Fr. Perera warned against this 80 years ago. I suppose I wouldn't have
much to say if the 58 phonemes are used to replace the ones there. It will
not happen.


On Mon, Mar 17, 2014 at 8:36 PM, Whistler, Ken <ken.whistler at sap.com> wrote:

>  Well, I actually don?t see. I took a look at the Sinhala you inserted in
> this
>
> email. I cannot tell what you did at your input end (about ?inserted all
> joiners?),
>
> but there are no actual joiners in the text itself. It displayed just fine
>
> in my email (including the correct conditional formatting of the ?u vowel
>
> applied to the ra in pu*ru*kee), without me doing anything special (or
> installing
>
> any hacked font). Why? Because it was transmitted in plain Unicode.
>
>
>
> I cut and pasted that Unicode Sinhala string into a Word document, and
>
> it worked just fine. The boundaries for all the syllables were correctly
>
> detected.
>
>
>
> I saved it as a plain text UTF-8 file, and it worked just
>
> fine. I even then read the plain text UTF-8 file into a UTF-8 aware
>
> programming editor, and it worked just fine. (In a programming editor,
>
> which doesn?t attempt complex script rendering,
>
> the vowels don?t apply to the consonants and no reordering is done, so
>
> the display isn?t correct, but each character is correctly preserved, and
>
> if I write it back out to a document and read it in Word or some other
>
> tool that has access to proper rendering, it is still fine.) And all that
>
> interoperability works, why? Because this is plain Unicode.
>
>
>
> So while I don?t doubt that people may be having serious issues with
>
> input methods for Sinhala, I tend to agree with Marc Durdin that you are
> confusing
>
> encoding with input methods. Yes, I know you know the difference,
>
> but it appears to me that the inescapable conclusion from your
>
> argumentation is that the highest priority for the design of an
>
> encoding system should be to make the design of input methods
>
> as simple as possible. And in my estimation, that is confusing encoding
>
> with input methods.
>
>
>
> The art of input methods is to hide encoding details from users, and
>
> instead to provide them with an abstraction that they find easy to
>
> use and which accords with their general understanding of the writing
>
> system they are using. If done correctly, then the details of the input
>
> method *also* recede into the background, and users then simply
>
> do what they want: write and edit text easily on their devices.
>
>
>
> --Ken
>
>
>
> P.S. Here is an octal dump of that text (after I inserted a closing
> parenthesis in
>
> the editor). Sinhala sequence highlighted. Plain Unicode in UTF-8,
>
> no fancy stuff, and works just fine.
>
>
>
> 0000000000    EF  BB  BF  62  61  6C  75  20  76  61  6C  69  67  65  65
> C2
>
> 0000000020    A0  75  C2  B5  61  20  70  75  72  75  6B  65  65  C2  A0
> C3
>
> 0000000040    B0  61  61  6C  61  61  20  68  C3  A6  C3  B0  75  76  61
> C3
>
> 0000000060    BE  20  6E  C3  A6  C3  A6  20  C3  A6  C3  B0  65  65  20
> C3
>
> 0000000100    A6  72  65  6E  6E  65  65  0D  0A  28  E0  B6  B6  E0  B6
> BD
>
> 0000000120    E0  B7  94  20  E0  B7  80  E0  B6  BD  E0  B7  92  E0  B6
> 9C
>
> 0000000140    E0  B7  9A  20  E0  B6  8B  E0  B6  AB  20  E0  B6  B4  E0
> B7
>
> 0000000160    94  E0  B6  BB  E0  B7  94  E0  B6  9A  E0  B7  9A  20  E0
> B6
>
> 0000000200    AF  E0  B7  8F  E0  B6  BD  E0  B7  8F  20  E0  B7  84  E0
> B7
>
> 0000000220    90  E0  B6  AF  E0  B7  94  E0  B7  80  E0  B6  AD  E0  B7
> 8A
>
> 0000000240    20  E0  B6  B1  E0  B7  91  20  E0  B6  87  E0  B6  AF  E0
> B7
>
> 0000000260    9A  20  E0  B6  87  E0  B6  BB  E0  B7  99  E0  B6  B1  E0
> B7
>
> 0000000300    8A  E0  B6  B1  E0  B7  9A  29  0D  0A  0D  0A
>
>
>
> As you see, this is a terrible mess and cannot be straightened, granted
> few people use it, and there'll be more. What other choice do they have
> except Anglicizing?. In Singhala, they say, "balu valigee u?a
> purukee ?aalaa h??uva? n?? ??ee ?rennee" (??? ????? ?? ?????? ???? ???????
> ?? ??? ??????? <- I inserted all joiners, but can't guarantee if vowel
> signs would pop out). It means you cannot straighten dog tail even if you
> put it in a bamboo.piece. You cannot fix Unicode Singhala and sadly, it is
> bringing down the language with it.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/71059d9e/attachment.html>

From naenaguru at gmail.com  Tue Mar 18 01:01:02 2014
From: naenaguru at gmail.com (Naena Guru)
Date: Tue, 18 Mar 2014 01:01:02 -0500
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
Message-ID: <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>

Okay, Doug.

Type this inside the yellow text box in the following page:
kaaryyaalavala yan?ra pa?k?i
http://www.lovatasinhala.com/puvaruva.php

Please tell me what sequence of Unicode Sinhala codes would produce what
the text box shows.


On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell <doug at ewellic.org> wrote:

> Naena Guru wrote:
>
>  In the case of romanized Singhala, any processing that English
>> accepts, it accepts too. For RS, you select a font to display it in
>> the native script because if it is mixed with English, both are using
>> the same character space, just as when English and French are mixed.
>>
>
> But English and French actually *use* the same letters, or at any rate
> most of them. With your approach, it is not possible to write Sinhala in
> the Sinhala script mixed with English or French or anything else in the
> Latin script. In web pages you can resort to <span style="Latin"> tricks,
> but this doesn't work for plain text.
>
> This is what people mean when they suggest that your real goal is to
> abolish the Sinhala script and just write in Latin.
>
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/9322b0c1/attachment.html>

From chris.fynn at gmail.com  Tue Mar 18 03:20:40 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Tue, 18 Mar 2014 14:20:40 +0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
Message-ID: <CAA_CYcKtEY5ZC6BFiTpYoXrZ+Zue+W51pNTyqsBc7kmE4tGUJg@mail.gmail.com>

MSKLC and KeyMan are fairly crude ways of creating input methods

For what you want to - you probably need a memory resident program
that traps the Latin input from the keyboard, processes the
(transliterated) input strings converting them into unicode Sinhala
strings, and then injects these back into the input queue  in place of
the Latin characters.

There are a couple of utilities that do this for typing
transliterated/romanised Tibetan in Windows and getting  Tibetan
Unicode output.
http://tise.mokhin.org/
http://www.thubtenrigzin.fr/denjongtibtype/en.html

But I think both of these were written in C as they have to do a lot
of processing which is far beyond what can be accomplished with MSKLC
and even KeyMan

- C


From lang.support at gmail.com  Tue Mar 18 04:26:30 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Tue, 18 Mar 2014 20:26:30 +1100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAA_CYcKtEY5ZC6BFiTpYoXrZ+Zue+W51pNTyqsBc7kmE4tGUJg@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CAA_CYcKtEY5ZC6BFiTpYoXrZ+Zue+W51pNTyqsBc7kmE4tGUJg@mail.gmail.com>
Message-ID: <CAGJ7U-USYjYChOYhxaU1-baytqOZCAQ8O9iDFhF0rkAW8+U=mQ@mail.gmail.com>

Chris,

Keyman is capable of doing that and a lot more,  but few keyboard layout
developers use it to its full potential.

As an example,  I was asked by Harari teachers here in Melbourne to develop
a set of three keyboard layouts for them and their students.
The three keyboards were for three different orthographies in the following
scripts:
1) Latin
2) Ethiopic
3) Arabic

They wanted all three layouts to work identically,  using the keystrokes
used on the Latin keyboard.

The Ethiopic and Arabic keyboard layouts required extensive remapping of
key sequences to output.

If I was a programmer I could have done something more elegant by building
an external library Keyman could call but as it is we could do a lot inside
the Keyman keyboard layout itself.

For Myanmar script keyboard layouts we allow visual input for the e-vowel
sign and medial Ra,  with the layout handling reordering.

One of the Latin layouts I use,  supports combining diacritics and reorders
sequences of diacritics to their canonical order regardless of order of
input. Assuming a maximum of one diacritic below and two diacrtics above
base character.
Analysis and creativity can produce some very effective Keyman layouts.

Andrew
 On 18/03/2014 7:23 PM, "Christopher Fynn" <chris.fynn at gmail.com> wrote:

> MSKLC and KeyMan are fairly crude ways of creating input methods
>
> For what you want to - you probably need a memory resident program
> that traps the Latin input from the keyboard, processes the
> (transliterated) input strings converting them into unicode Sinhala
> strings, and then injects these back into the input queue  in place of
> the Latin characters.
>
> There are a couple of utilities that do this for typing
> transliterated/romanised Tibetan in Windows and getting  Tibetan
> Unicode output.
> http://tise.mokhin.org/
> http://www.thubtenrigzin.fr/denjongtibtype/en.html
>
> But I think both of these were written in C as they have to do a lot
> of processing which is far beyond what can be accomplished with MSKLC
> and even KeyMan
>
> - C
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/b05a2c21/attachment.html>

From jf at colson.eu  Tue Mar 18 08:35:48 2014
From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=)
Date: Tue, 18 Mar 2014 14:35:48 +0100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
Message-ID: <53284BB4.60707@colson.eu>

Le 18/03/14 07:01, Naena Guru a ?crit :
> Okay, Doug.
>
> Type this inside the yellow text box in the following page:
> kaaryyaalavala yan?ra pa?k?i
> http://www.lovatasinhala.com/puvaruva.php
>
> Please tell me what sequence of Unicode Sinhala codes would produce 
> what the text box shows.
>

OK. I'd first say I don't speak Sinhala and I've never written a word in 
that language... until now. Therefore there might be mistakes and I 
didn't find how to write the second syllable, ryyaa. I've replaced it by 
***** below.
Here is my attempt:
??*****??? ?????? ?????

Could an aware person tell how to type the syllable ryyaa?


>
>
> On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell <doug at ewellic.org 
> <mailto:doug at ewellic.org>> wrote:
>
>     Naena Guru wrote:
>
>         In the case of romanized Singhala, any processing that English
>         accepts, it accepts too. For RS, you select a font to display
>         it in
>         the native script because if it is mixed with English, both
>         are using
>         the same character space, just as when English and French are
>         mixed.
>
>
>     But English and French actually *use* the same letters, or at any
>     rate most of them. With your approach, it is not possible to write
>     Sinhala in the Sinhala script mixed with English or French or
>     anything else in the Latin script. In web pages you can resort to
>     <span style="Latin"> tricks, but this doesn't work for plain text.
>
>     This is what people mean when they suggest that your real goal is
>     to abolish the Sinhala script and just write in Latin.
>
>
>     --
>     Doug Ewell | Thornton, CO, USA
>     http://ewellic.org | @DougEwell ?
>
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/f5f6d358/attachment.html>

From chris.fynn at gmail.com  Tue Mar 18 09:46:47 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Tue, 18 Mar 2014 20:46:47 +0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAGJ7U-USYjYChOYhxaU1-baytqOZCAQ8O9iDFhF0rkAW8+U=mQ@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CAA_CYcKtEY5ZC6BFiTpYoXrZ+Zue+W51pNTyqsBc7kmE4tGUJg@mail.gmail.com>
 <CAGJ7U-USYjYChOYhxaU1-baytqOZCAQ8O9iDFhF0rkAW8+U=mQ@mail.gmail.com>
Message-ID: <CAA_CYc+ch0fjYCC-0eM_jt0Nz9DVSsdBG2aO-=+BW82ZKO=uTw@mail.gmail.com>

Hi Andrew

It may be possible with Keyman. I once even wrote a set of MS Word
macros that did the same thing (let users type in Romanized Tibetan
and output Tibetan characters) - however it stopped working when
Microsoft switched from Word Basic to VBA.  :-(

At least Keyman hides all the messy (and poorly documented) details of
Windows system hooks which is what you have to use if you want to make
a stand-alone utility (did that once too).

If Keyman can call external libraries ~ that's interesting. It is
certainly *far* more sophisticated and flexible than MSKLC and I
shouldn't have lumped the two together.

- Chris

On 18/03/2014, Andrew Cunningham <lang.support at gmail.com> wrote:
> Chris,
>
> Keyman is capable of doing that and a lot more,  but few keyboard layout
> developers use it to its full potential.
>
> As an example,  I was asked by Harari teachers here in Melbourne to develop
> a set of three keyboard layouts for them and their students.
> The three keyboards were for three different orthographies in the following
> scripts:
> 1) Latin
> 2) Ethiopic
> 3) Arabic
>
> They wanted all three layouts to work identically,  using the keystrokes
> used on the Latin keyboard.
>
> The Ethiopic and Arabic keyboard layouts required extensive remapping of
> key sequences to output.
>
> If I was a programmer I could have done something more elegant by building
> an external library Keyman could call but as it is we could do a lot inside
> the Keyman keyboard layout itself.
>
> For Myanmar script keyboard layouts we allow visual input for the e-vowel
> sign and medial Ra,  with the layout handling reordering.
>
> One of the Latin layouts I use,  supports combining diacritics and reorders
> sequences of diacritics to their canonical order regardless of order of
> input. Assuming a maximum of one diacritic below and two diacrtics above
> base character.
> Analysis and creativity can produce some very effective Keyman layouts.
>
> Andrew
>  On 18/03/2014 7:23 PM, "Christopher Fynn" <chris.fynn at gmail.com> wrote:
>
>> MSKLC and KeyMan are fairly crude ways of creating input methods

>> For what you want to - you probably need a memory resident program
>> that traps the Latin input from the keyboard, processes the
>> (transliterated) input strings converting them into unicode Sinhala
>> strings, and then injects these back into the input queue  in place of
>> the Latin characters.

>> There are a couple of utilities that do this for typing
>> transliterated/romanised Tibetan in Windows and getting  Tibetan
>> Unicode output.
>> http://tise.mokhin.org/
>> http://www.thubtenrigzin.fr/denjongtibtype/en.html

>> But I think both of these were written in C as they have to do a lot
>> of processing which is far beyond what can be accomplished with MSKLC
>> and even KeyMan

>> - C


From chris.fynn at gmail.com  Tue Mar 18 10:04:35 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Tue, 18 Mar 2014 21:04:35 +0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
Message-ID: <CAA_CYc+REs+QTia6aB-z-pBiUe1qFDWvrm1z9H3PnV6-99igVA@mail.gmail.com>

On 18/03/2014, Naena Guru <naenaguru at gmail.com> wrote:

> Okay, Doug.

> Type this inside the yellow text box in the following page:
> kaaryyaalavala yan?ra pa?k?i
> http://www.lovatasinhala.com/puvaruva.php

> Please tell me what sequence of Unicode Sinhala codes would produce what
> the text box shows.

Naena Guru

If you want you should just be able to type it in as you wrote it
"kaaryyaalavala yan?ra pa?k?i"
and  get Singhala Unicode characters. But to do this you do need
something more than a re-mapped keyboard layout made with MSKLC

So long as the Roman  transliteration system you are using for
Singhala and Pali follows consistent rules, it is possible to write an
input method that parses the Romanized text and
converts it into  Singhala Unicode.

If you care about your language and script, that is the proper way to
do this sort of thing - not by using OpenType lookups to map strings
of latin characters to Singhala glyphs.

Chris Fynn
Thimphu, Bhutan


From jf at colson.eu  Tue Mar 18 10:36:44 2014
From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=)
Date: Tue, 18 Mar 2014 16:36:44 +0100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <53284BB4.60707@colson.eu>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <53284BB4.60707@colson.eu>
Message-ID: <5328680C.2000909@colson.eu>

Le 18/03/14 14:35, Jean-Fran?ois Colson a ?crit :
> Le 18/03/14 07:01, Naena Guru a ?crit :
>> Okay, Doug.
>>
>> Type this inside the yellow text box in the following page:
>> kaaryyaalavala yan?ra pa?k?i
>> http://www.lovatasinhala.com/puvaruva.php
>>
>> Please tell me what sequence of Unicode Sinhala codes would produce 
>> what the text box shows.
>>
>
> OK. I'd first say I don't speak Sinhala and I've never written a word 
> in that language... until now. Therefore there might be mistakes and I 
> didn't find how to write the second syllable, ryyaa. I've replaced it 
> by ***** below.
> Here is my attempt:
> ??*****??? ?????? ?????
>
> Could an aware person tell how to type the syllable ryyaa?

Perhaps a good tutorial on the use of ZWJ/ZWNJ in Sinhala could do the job.


>
>
>>
>>
>> On Mon, Mar 17, 2014 at 7:56 PM, Doug Ewell <doug at ewellic.org 
>> <mailto:doug at ewellic.org>> wrote:
>>
>>     Naena Guru wrote:
>>
>>         In the case of romanized Singhala, any processing that English
>>         accepts, it accepts too. For RS, you select a font to display
>>         it in
>>         the native script because if it is mixed with English, both
>>         are using
>>         the same character space, just as when English and French are
>>         mixed.
>>
>>
>>     But English and French actually *use* the same letters, or at any
>>     rate most of them. With your approach, it is not possible to
>>     write Sinhala in the Sinhala script mixed with English or French
>>     or anything else in the Latin script. In web pages you can resort
>>     to <span style="Latin"> tricks, but this doesn't work for plain text.
>>
>>     This is what people mean when they suggest that your real goal is
>>     to abolish the Sinhala script and just write in Latin.
>>
>>
>>     --
>>     Doug Ewell | Thornton, CO, USA
>>     http://ewellic.org | @DougEwell ?
>>
>>
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/eed78be5/attachment.html>

From doug at ewellic.org  Tue Mar 18 11:29:03 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 09:29:03 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <20140318092903.665a7a7059d7ee80bb4d670165c8327d.12c06e3b98.wbe@email03.secureserver.net>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

>>> Type this inside the yellow text box in the following page:
>>> kaaryyaalavala yan?ra pa?k?i
>>> http://www.lovatasinhala.com/puvaruva.php
>>>
>>> Please tell me what sequence of Unicode Sinhala codes would produce
>>> what the text box shows.
>>
>> Could an aware person tell how to type the syllable ryyaa?
>
> Perhaps a good tutorial on the use of ZWJ/ZWNJ in Sinhala could do the
> job.

The RS-to-Unicode conversion tool on Naena's own site gives
????????????? ???????
???????, but this doesn't exactly match the text in the
yellow box visually. I suppose some combination of ZWJ/ZWNJ usage and
font differences accounts for this.

In any case, I wouldn't expect his RS-to-Unicode converter to work with
100% fidelity, given that he is trying to make a case that Unicode is
inadequate to represent Sinhala text.

Naena knows from previous threads that I don't speak or write Sinhala. I
hope his intent in asking me to provide a Unicode transcoding is not to
call attention to this, and to try to demonstrate thereby that he knows
more about character encoding than I do. I can spend a little more time
on this when I'm in front of my Windows 8.1 machine, which has better
support for Sinhala than Windows 7.

But I would think a better argument on Naena's part would be to show
*us* what parts of "kaaryyaalavala yan?ra pa?k?i" can be adequately
represented in his scheme but not in Unicode. And by "represented," I
don't just mean "displayed on any arbitrary system." Display problems
can be and tend to be fixed over time, and are not all there is to
character encoding anyway.

I did also notice that Naena carefully avoided responding to my point
that his approach prevents the simultaneous plain-text display of
en-Latn and si-Sinh, something he does often on his web pages.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From doug at ewellic.org  Tue Mar 18 12:27:21 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 10:27:21 -0700
Subject: Details,
 please (was: Re: Romanized Singhala got great reception in Sri Lanka)
Message-ID: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net>

I think what some of us would like to see are detailed examples, citing
specific characters and combinations, rather than general rhetoric, to
support claims like this:

"Anyway my complaint is that Unicode Singhala is incomplete and wrong
and that it has a deleterious effect on the language, one of the oldest
in the world. What's aggravating is that they institutionalize errors as
correct. Rev. Fr. Perera warned against this 80 years ago. I suppose I
wouldn't have much to say if the 58 phonemes are used to replace the
ones there. It will not happen."

and these, from the web site:

"[Romanized Singhala] is stable as it is safe from rules imposed by
Unicode Consortium based on misinformation, and careless mangling of the
language by disinterested bureaucrats."

"Unicode Sinhala is a failure and cannot be fixed. That is because the
premise on which it was designed is flawed."

"Abugida is a writing system relegated to the sideline, as inherently
incapable of a smooth interface with the computer. This is what Unicode
Sinhala suffers from."

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From mike at mikemorr.com  Tue Mar 18 12:46:37 2014
From: mike at mikemorr.com (Mike Morrison)
Date: Tue, 18 Mar 2014 13:46:37 -0400
Subject: "Heron" element in Unicode?
Message-ID: <CAKoET3X6wNvYFtTp2D9sSkE-n5J9jhpPff6mnOuHikbTwfTzKQ@mail.gmail.com>

Hello,

Does the element which is the right half of ?, and the left half of ?,
?, ?, exist anywhere in the current or proposed Unicode standards?
It's a simplification of ?, and similar to but not the same as ?. If
not currently in Unicode, is it the sort of thing that might be
considered for addition in the future?

Thanks,

Mike Morrison


From andrewcwest at gmail.com  Tue Mar 18 13:49:14 2014
From: andrewcwest at gmail.com (Andrew West)
Date: Tue, 18 Mar 2014 18:49:14 +0000
Subject: "Heron" element in Unicode?
In-Reply-To: <CAKoET3X6wNvYFtTp2D9sSkE-n5J9jhpPff6mnOuHikbTwfTzKQ@mail.gmail.com>
References: <CAKoET3X6wNvYFtTp2D9sSkE-n5J9jhpPff6mnOuHikbTwfTzKQ@mail.gmail.com>
Message-ID: <CALgEMhwmXxav4E-fHMzi_RaLa04ddnyWeJgh214gOZoxg9oE4A@mail.gmail.com>

On 18 March 2014 17:46, Mike Morrison <mike at mikemorr.com> wrote:
>
> Does the element which is the right half of ?, and the left half of ?,
> ?, ?, exist anywhere in the current or proposed Unicode standards?

No.

> It's a simplification of ?, and similar to but not the same as ?. If
> not currently in Unicode, is it the sort of thing that might be
> considered for addition in the future?

People have been talking for many years about encoding the relatively
few CJK components that do not exist as characters in there own right,
and I think that there would be some support from the relevant
committees if a well-presented proposal was submitted.

Andrew


From tom at bluesky.org  Tue Mar 18 14:07:07 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 12:07:07 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
Message-ID: <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>


On Mar 17, 2014, at 11:01 PM, Naena Guru wrote:

> 
> Type this inside the yellow text box in the following page:
> kaaryyaalavala yan?ra pa?k?i
> http://www.lovatasinhala.com/puvaruva.php
> 
> Please tell me what sequence of Unicode Sinhala codes would produce what the text box shows.

Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed

karYYalvl ynhR p;kni

and I got

????????????? ?????? ?????

Graphic:


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/0a4ea612/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-03-18 at 11.59.13 AM.png
Type: image/png
Size: 8085 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/0a4ea612/attachment.png>

From jf at colson.eu  Tue Mar 18 14:10:12 2014
From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=)
Date: Tue, 18 Mar 2014 20:10:12 +0100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
Message-ID: <53289A14.9050108@colson.eu>

Le 18/03/14 20:07, Tom Gewecke a ?crit :
>
> On Mar 17, 2014, at 11:01 PM, Naena Guru wrote:
>
>>
>> Type this inside the yellow text box in the following page:
>> kaaryyaalavala yan?ra pa?k?i
>> http://www.lovatasinhala.com/puvaruva.php
>>
>> Please tell me what sequence of Unicode Sinhala codes would produce 
>> what the text box shows.
>
> Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala 
> fonts, I typed
>
> karYYalvl ynhR p;kni

Shouldn?t it be p;khi?

>
> and I got
>
> ????????????? ?????? ?????
>
> Graphic:
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/3dbb5ec3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 8085 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/3dbb5ec3/attachment.png>

From tom at bluesky.org  Tue Mar 18 14:24:02 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 12:24:02 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <53289A14.9050108@colson.eu>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
Message-ID: <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>


On Mar 18, 2014, at 12:10 PM, Jean-Fran?ois Colson wrote:

>> 
>> Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed
>> 
>> karYYalvl ynhR p;kni
> 
> Shouldn?t it be p;khi?
> 

Yes, sorry.

karYYalvl ynhR p;khi

????????????? ?????? ?????

Graphic


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/a2eba991/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-03-18 at 12.21.06 PM.png
Type: image/png
Size: 8338 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/a2eba991/attachment.png>

From doug at ewellic.org  Tue Mar 18 14:37:20 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 12:37:20 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>

Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have
found it:

????????????? ??????
?????

The sequence of code points would thus be:

0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020
0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2

Naena, is this what you were looking for?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From lang.support at gmail.com  Tue Mar 18 14:52:34 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Wed, 19 Mar 2014 06:52:34 +1100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
Message-ID: <CAGJ7U-UzxCOnyRU6KJD3Fv5TjJPBqzfSARoa105t9x_apQKF8A@mail.gmail.com>

I suspect it was a fishing expedition to illustrate how awkward it is to
type on Unicode keyboard layouts versus his system.

Ie still no clear separation of input and encoding in his responses.
On 19/03/2014 6:39 AM, "Doug Ewell" <doug at ewellic.org> wrote:

> Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have
> found it:
>
> ????????????? ??????
> ?????
>
> The sequence of code points would thus be:
>
> 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020
> 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2
>
> Naena, is this what you were looking for?
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140319/ae5bf9fa/attachment.html>

From marc at keyman.com  Tue Mar 18 15:48:21 2014
From: marc at keyman.com (Marc Durdin)
Date: Tue, 18 Mar 2014 20:48:21 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
 <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>

I know this is adding fuel to the fire but I?m sure that I am not the only one to note that the way the text is rendered in Tom?s graphic differs from the way the text is rendered with Iskoola Pota font in Win 7 and Nirmala UI font in Win 8.1.  I have not analysed the difference, nor can I state with certainty which is more accurate, but this is clearly an inconsistency that will plague end users.

Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering?   See graphic below.

[cid:image002.png at 01CF4346.93823730]

????????????? ?????? ?????    Unicode text

Unicode Values: U+0D9A U+0DCF U+0DBB U+0DCA U+0DBA U+0DCA U+0DBA U+0DCF U+0DBD U+0DC0 U+0DBD U+0020 U+0DBA U+0DB1 U+0DC4 U+0DCA U+0DBB U+0020 U+0DB4 U+0DA9 U+0D9A U+0DC4 U+0DD2

FWIW, iOS 7.1 renders that string identically to Mac OS X.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Tom Gewecke
Sent: Wednesday, 19 March 2014 6:24 AM
To: Jean-Fran?ois Colson
Cc: Naena Guru; UnicoDe List
Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)


On Mar 18, 2014, at 12:10 PM, Jean-Fran?ois Colson wrote:


Using the Sinhala Qwerty layout in Mac OS X with the Apple Sinhala fonts, I typed

karYYalvl ynhR p;kni

Shouldn?t it be p;khi?

Yes, sorry.

karYYalvl ynhR p;khi

????????????? ?????? ?????

Graphic

[cid:image001.png at 01CF4345.55B7FF80]


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/98307c71/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 8338 bytes
Desc: image001.png
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/98307c71/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 15888 bytes
Desc: image002.png
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/98307c71/attachment-0001.png>

From michel at suignard.com  Tue Mar 18 15:56:13 2014
From: michel at suignard.com (Michel Suignard)
Date: Tue, 18 Mar 2014 20:56:13 +0000
Subject: "Heron" element in Unicode?
In-Reply-To: <CALgEMhwmXxav4E-fHMzi_RaLa04ddnyWeJgh214gOZoxg9oE4A@mail.gmail.com>
References: <CAKoET3X6wNvYFtTp2D9sSkE-n5J9jhpPff6mnOuHikbTwfTzKQ@mail.gmail.com>
 <CALgEMhwmXxav4E-fHMzi_RaLa04ddnyWeJgh214gOZoxg9oE4A@mail.gmail.com>
Message-ID: <706c3c73189a408d9a03eacb36cf6bc2@CO1PR02MB157.namprd02.prod.outlook.com>

>> It's a simplification of ?, and similar to but not the same as ?. If 
>> not currently in Unicode, is it the sort of thing that might be 
>> considered for addition in the future?

>People have been talking for many years about encoding the relatively few CJK components that do not exist as characters in there own right, and I think that there would be some support from the relevant committees if a well-presented proposal was submitted.

Isn't that component even used in 10646 Annex S.1.4.3 (9th pair)? It is one of my surprise that 10646 can't even be fully textually documented using 10646 character elements (we use many pictures in Annex S). It has been one of my goals to get all Annex S 'components' to be fully encoded, but I'd need time to create the glyphs, unless someone wants to volunteer.

Michel


From tom at bluesky.org  Tue Mar 18 15:57:11 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 13:57:11 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
 <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>
Message-ID: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>


On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote:

>  
> Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering?   See graphic below.
>  
> <image002.png>

The OS X version is the most correct according my limited knowledge of the script.  I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one.  This can be fixed by using the Bhashitha font.

Also it's possible that some of the characters should be "touching".  I did not add the code  for that and don't think any of my current fonts support it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/9bbdafd6/attachment.html>

From doug at ewellic.org  Tue Mar 18 16:19:29 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 14:19:29 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <20140318141929.665a7a7059d7ee80bb4d670165c8327d.f5c5f67bc7.wbe@email03.secureserver.net>

The attached image (also at
http://ewellic.org/images/sinhala-babelpad.jpg) shows how I see Tom's
corrected string on Windows 7 running BabelPad, in both Iskoola Pota and
Nirmala UI.

Different rendering based on different operating systems, versions, and
applications is unfortunate, but no more so than a solution which only
works within a web browser, and not at all on IE8.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sinhala-babelpad.jpg
Type: image/jpeg
Size: 25452 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/de76c483/attachment.jpg>

From jf at colson.eu  Tue Mar 18 16:28:09 2014
From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=)
Date: Tue, 18 Mar 2014 22:28:09 +0100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
Message-ID: <5328BA69.9050502@colson.eu>

Le 18/03/14 20:37, Doug Ewell a ?crit :
> Tom, with typo spotted and corrected by Jean-Fran?ois, seems to have
> found it:
>
> ????????????? ??????
> ?????
>
> The sequence of code points would thus be:
>
> 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020
> 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2
>
> Naena, is this what you were looking for?

It seems there?s still a big difference in the second syllable.


>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From marc at keyman.com  Tue Mar 18 16:44:43 2014
From: marc at keyman.com (Marc Durdin)
Date: Tue, 18 Mar 2014 21:44:43 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
 <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>
 <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local>

I've done some more analysis now that I've arrived at my office (and if I'd read Doug's email earlier, would have been able to see this too).  The email I received, on my notebook running Outlook 2010, has had U+200D stripped out from that Sinhala text - hence the rendering difference.  Now it gets weirder.  I also have Outlook 2010 running on my office computer, also attached to the same Exchange server: so I have two copies of the same email, which in a sane world would be byte-for-byte identical.  On my office computer, U+200D has not been stripped out, and the text renders in the same way as OS X.

I am struggling to understand why two copies of the same email have ended up with different content - given the clients are both running the same version of Windows and the same version of Outlook (even down to the same updates and security patches), connected to the same Exchange server.  MS Office Language Preferences do not list Sinhala in either case.  The same email on my iPhone has correct content, as does the webmail version.

Anyone got any ideas?

Marc

From: Tom Gewecke [mailto:tom at bluesky.org]
Sent: Wednesday, 19 March 2014 7:57 AM
To: Marc Durdin
Cc: Jean-Fran?ois Colson; Naena Guru; UnicoDe List
Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)


On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote:


Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering?   See graphic below.

<image002.png>

The OS X version is the most correct according my limited knowledge of the script.  I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one.  This can be fixed by using the Bhashitha font.

Also it's possible that some of the characters should be "touching".  I did not add the code  for that and don't think any of my current fonts support it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/66d31078/attachment.html>

From doug at ewellic.org  Tue Mar 18 16:50:42 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 14:50:42 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
Message-ID: <20140318145042.665a7a7059d7ee80bb4d670165c8327d.c4a6e2336e.wbe@email03.secureserver.net>

Jean-Fran?ois Colson <jf at colson dot eu> wrote:

> It seems there?s still a big difference in the second syllable.

Naena's original text "kaaryyaalavala" seems to imply the second
syllable begins with "r" followed by "ya". Is the "r" supposed to form a
conjunct with the second "ya", as his font shows, rather than the first?

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From tom at bluesky.org  Tue Mar 18 16:53:35 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 14:53:35 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <5328BA69.9050502@colson.eu>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
 <5328BA69.9050502@colson.eu>
Message-ID: <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org>


On Mar 18, 2014, at 2:28 PM, Jean-Fran?ois Colson wrote:

>> 
>> The sequence of code points would thus be:
>> 
>> 0D9A 0DCF 0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA 0DCF 0DBD 0DC0 0DBD 0020
>> 0DBA 0DB1 0DC4 0DCA 200D 0DBB 0020 0DB4 0DA9 0D9A 0DC4 0DD2
>> 
>> Naena, is this what you were looking for?
> 
> It seems there?s still a big difference in the second syllable.

Are you referring to "ryy"  (0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA)?

That is the correct encoding I think.  But most fonts don't display it quite right.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/81aace09/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-03-18 at 2.51.11 PM.png
Type: image/png
Size: 26998 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/81aace09/attachment.png>

From asmusf at ix.netcom.com  Tue Mar 18 16:48:46 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 18 Mar 2014 14:48:46 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
 <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>
 <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>
Message-ID: <5328BF3E.9050100@ix.netcom.com>

On 3/18/2014 1:57 PM, Tom Gewecke wrote:
>
> On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote:
>
>> Can anyone who is more knowledgeable in Unicode Sinhala tell me which 
>> is the correct rendering?   See graphic below.
>> <image002.png>
>
> The OS X version is the most correct according my limited knowledge of 
> the script.  I think the Apple font does not place the diacritic over 
> the second character correctly, it should be over the next one.  This 
> can be fixed by using the Bhashitha font.
I get the "OS X" appearance (or one that matches the "graphics" on Win7 
+ viewin in Thunderbird).
>
> Also it's possible that some of the characters should be "touching". 
>  I did not add the code  for that and don't think any of my current 
> fonts support it.
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/e24af130/attachment.html>

From Tom at bluesky.org  Tue Mar 18 16:56:51 2014
From: Tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 14:56:51 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <5328BA69.9050502@colson.eu>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
 <5328BA69.9050502@colson.eu>
Message-ID: <326E464D-AA06-4BCC-A72F-1215BE7EA3CF@bluesky.org>

PS A good source for info on the Sinhala codes, etc is

https://www.microsoft.com/typography/OpenTypeDev/sinhala/intro.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/96c77c01/attachment.html>

From marc at keyman.com  Tue Mar 18 17:02:10 2014
From: marc at keyman.com (Marc Durdin)
Date: Tue, 18 Mar 2014 22:02:10 +0000
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local>
References: <F11F928E5E234C1CABC0CFCBB3DB67EC@DougEwell>
 <CAHK3Hy0vkweHNtbx7GPPLz-b+HH1DWwzVpeu4Lxza_Rq1r2A2g@mail.gmail.com>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD14170@federation.tavultesoft.local>
 <CAHK3Hy1KBYLBW8__+r5FwFMLy7YObJYj5xvJeRruFikR2bOX5w@mail.gmail.com>
 <97765363E97E4C2E9CE7F318327A11D1@DougEwell>
 <CAHK3Hy2BXv=UANOjfZW_DQTvq6QgoS9BjBBj0wS==i2P41_GsA@mail.gmail.com>
 <CD009DBF-B3FE-4E0F-847E-8F2A14A28D31@bluesky.org>
 <53289A14.9050108@colson.eu>
 <A4674E28-D6E6-45BF-AB03-11C1306F120C@bluesky.org>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD1E637@federation.tavultesoft.local>
 <7BA53618-280D-439A-B683-01EC958E28FD@bluesky.org>
 <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F031@federation.tavultesoft.local>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD1F345@federation.tavultesoft.local>

And I understand the issue now.  My notebook did not have any ?complex script? languages installed as ?Editing Languages? in MS Office Language Preferences.  Thus, it stripped out U+200D when presenting the Sinhala text.  My office computer had Arabic installed as an ?Editing Language,? and so the content rendered correctly.  The thing that threw me is that this is not even a rendering-level issue but a content-level issue ? the content is corrupted before it ever gets to the renderer.

Traps for young players (and honestly, Microsoft, I should not have to install an ?Editing? language to view an email?)  Sorry for spamming the list.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marc Durdin
Sent: Wednesday, 19 March 2014 8:45 AM
To: Tom Gewecke
Cc: Naena Guru; Jean-Fran?ois Colson; UnicoDe List
Subject: RE: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)

I?ve done some more analysis now that I?ve arrived at my office (and if I?d read Doug?s email earlier, would have been able to see this too).  The email I received, on my notebook running Outlook 2010, has had U+200D stripped out from that Sinhala text ? hence the rendering difference.  Now it gets weirder.  I also have Outlook 2010 running on my office computer, also attached to the same Exchange server: so I have two copies of the same email, which in a sane world would be byte-for-byte identical.  On my office computer, U+200D has not been stripped out, and the text renders in the same way as OS X.

I am struggling to understand why two copies of the same email have ended up with different content ? given the clients are both running the same version of Windows and the same version of Outlook (even down to the same updates and security patches), connected to the same Exchange server.  MS Office Language Preferences do not list Sinhala in either case.  The same email on my iPhone has correct content, as does the webmail version.

Anyone got any ideas?

Marc

From: Tom Gewecke [mailto:tom at bluesky.org]
Sent: Wednesday, 19 March 2014 7:57 AM
To: Marc Durdin
Cc: Jean-Fran?ois Colson; Naena Guru; UnicoDe List
Subject: Re: Dead and Compose keys (was: Re: Romanized Singhala got great reception in Sri Lanka)


On Mar 18, 2014, at 1:48 PM, Marc Durdin wrote:


Can anyone who is more knowledgeable in Unicode Sinhala tell me which is the correct rendering?   See graphic below.

<image002.png>

The OS X version is the most correct according my limited knowledge of the script.  I think the Apple font does not place the diacritic over the second character correctly, it should be over the next one.  This can be fixed by using the Bhashitha font.

Also it's possible that some of the characters should be "touching".  I did not add the code  for that and don't think any of my current fonts support it.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/3d423093/attachment.html>

From tom at bluesky.org  Tue Mar 18 17:13:24 2014
From: tom at bluesky.org (Tom Gewecke)
Date: Tue, 18 Mar 2014 15:13:24 -0700
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <CAGJ7U-UzxCOnyRU6KJD3Fv5TjJPBqzfSARoa105t9x_apQKF8A@mail.gmail.com>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
 <CAGJ7U-UzxCOnyRU6KJD3Fv5TjJPBqzfSARoa105t9x_apQKF8A@mail.gmail.com>
Message-ID: <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org>


On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote:

> I suspect it was a fishing expedition to illustrate how awkward it is to type on Unicode keyboard layouts versus his system.
> 

Interesting question perhaps.  Is it more awkward to type 14 strokes as k a a r y y a a l a v a l a  or to type 9 as  ???  ? ???? ???? ??  ?  ?  ?  ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/73aeecf5/attachment.html>

From lang.support at gmail.com  Tue Mar 18 18:42:51 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Wed, 19 Mar 2014 10:42:51 +1100
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
 <CAGJ7U-UzxCOnyRU6KJD3Fv5TjJPBqzfSARoa105t9x_apQKF8A@mail.gmail.com>
 <499149EE-04DF-4E49-B899-E1046CBD4C3A@bluesky.org>
Message-ID: <CAGJ7U-V7kV4P9bZ2wAZJBGWRr3Ocsxcqr-_9FpxfAtYqi-Jhwg@mail.gmail.com>

Different individuals,  groups and communities can bring their own
expectations to input layout designs.
Design is a balance between capabilities and limitations of the input
framework versus the expectations of the user community around how they
language should work.

I work with multiple operating systems and even more input frameworks.

I have my preferred input frameworks. But it ultimately air is a question
of knowing your tools.

For instance, if you compile a keyborad layout from the commandline with
MSKLC you can chain deadkeys,  build against custom locales in Vista and
Win7, or build against unsupported language codes in Win8+

Andrew
On 19/03/2014 9:13 AM, "Tom Gewecke" <tom at bluesky.org> wrote:

>
> On Mar 18, 2014, at 12:52 PM, Andrew Cunningham wrote:
>
> I suspect it was a fishing expedition to illustrate how awkward it is to
> type on Unicode keyboard layouts versus his system.
>
>
> Interesting question perhaps.  Is it more awkward to type 14 strokes as k
> a a r y y a a l a v a l a  or to type 9 as  ? ?  ?  ???  ???  ?  ?  ?  ? ?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140319/c245a116/attachment.html>

From richard.wordingham at ntlworld.com  Tue Mar 18 20:00:57 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 19 Mar 2014 01:00:57 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
References: <CAHK3Hy307vk5RFtyxiUvc9wPgUcKKLB0o_GDg+2=gys_3on5hQ@mail.gmail.com>
 <1394975413.87133.YahooMailNeo@web87803.mail.ir2.yahoo.com>
 <1394985945.36131.YahooMailNeo@web87804.mail.ir2.yahoo.com>
 <CAHK3Hy1OCdG=XYjt=vURM+vYR=hebfW27zOtPnJcG7qqj_jqNw@mail.gmail.com>
Message-ID: <20140319010057.42f5b5ff@JRWUBU2>

On Mon, 17 Mar 2014 18:18:50 -0500
Naena Guru <naenaguru at gmail.com> wrote:
(in topic 'Romanized Singhala got great reception in Sri Lanka')

> Typing is a nightmare.

> When you backspace it destroys multiple keystrokes.

I suspect this is a widespread and unsolved problem.  If one positions
the cursor after a character entered by multiple characters in a
previous program, there doesn't seem to be a way of undoing the
previous typing.  A Latin-1 analogy is entering e-acute by typing 'e
and then backspacing later.  That will usually simply delete the e-acute
rather than leaving the dead key apostrophe.

In some ways it may be an insoluble problem rather than merely
difficult. For example, when using KMFL to type the Tai Tham script, I
have two ways of typing the combination <U+1A49 TAI THAM LETTER HIGH
HA, U+1A56 TAI THAM CONSONANT SIGN MEDIAL LA> (historically
corresponding to the single character U+199C NEW TAI LUE LETTER HIGH LA
and sometimes listed as a letter in its own right):

1) !}
2) s!]

I use '!' because I can't get altGr to work in KMFL.  It works as
a dead key.  Key sequence (1) views the Tai Tham sequence as a single
character: key sequence (2) views it as the sequence of Unicode
characters.  The key stokes are based on the Thai Kesmanee keyboard.
The mnemonic for the sequences with '!' is that the single key stroke
']' results in U+1A43 TAI THAM LETTER LA.  The single, shifted key
stroke '}' results in a comma, as in the Thai keyboard.

If I position the curor after the character sequence,
what should I get after typing <backspace> and then the character '}'?
Should I get:

(a) <U+14A9, U+1A56> (by assuming input sequence 1);
(b) <U+14A9, U+14A9, U+1A56> (by assuming input sequence 2); or
(c) <U+14A9, U+002C COMMA> (what I actually get)?

> Search and
> replace is not possible, at least the way do it with English.

I suspect the problem you have have is that editing tools expect the
user to think of a combination of base character and combining mark as a
single character.  I don't know how to counter this expectation.  For
LibreOffice, I do search and replace by choosing the 'regular
expression' option, as this does allow the user to work with characters
rather than legacy grapheme clusters (UAX #29: Unicode Text
Segmentation).

Richard.


From naa.ganesan at gmail.com  Wed Mar 19 03:04:01 2014
From: naa.ganesan at gmail.com (N. Ganesan)
Date: Wed, 19 Mar 2014 01:04:01 -0700
Subject: Urdu Nastaliq script
Message-ID: <CAA+QEUf0hp9okverqMieW9N1Uih+rp0zF8hVp-G+fjP9HLxQZA@mail.gmail.com>

http://tribune.com.pk/story/683067/inventing-revolution-the-man-who-gave-urdu-its-wings/

Inventing revolution: The man who gave Urdu its
wings<http://tribune.com.pk/story/683067/inventing-revolution-the-man-who-gave-urdu-its-wings/>
By Khalid Rahman <http://tribune.com.pk/author/4593/khalid-rahman/>
Published: March 15, 2014
 *KARACHI: *

*Ahmed Mirza Jamil changed the way all Urdu newspapers and books would be
published anywhere in the world; and he did it back in 1981 with his Noori
Nastaliq script that gave the Midas touch to desktop publishing.*

The present-day Urdu publishing owes its elegant contours to the
calligraphic skills of this great wizard of calligraphy.

Before being used in the composing software, InPage, the Noori Nastaliq was
created as a digital typeface (font) in 1981 when master-calligrapher Ahmed
Mirza Jamil and Monotype Imaging (then called Monotype Corp) collaborated
on a joint venture.

Earlier, Urdu newspapers, books and magazines needed manual calligraphers,
who were replaced by computer machines in Pakistan, India, UK and other
countries.

The government of Pakistan recognised Ahmad Mirza Jamil?s singular
achievement in 1982 by designating Noori Nastaliq as an ?Invention of
National Importance? and awarded him with the medal of distinction,
Tamgha-e-Imtiaz.

In recognition of his achievement, the University of Karachi also awarded
him the degree of Doctor of Letters, Honoris Causa.

Narrating the history of his achievement in his book, ?Revolution in Urdu
Composing?, he wrote: ?In future, Urdu authors will be able to compose
their books like the authors of the languages of Roman script. Now, the day
a manuscript is ready is the day the publication is ready for printing.
There is no waiting for calligraphers to give their time grudgingly, no
apprehension of mistakes creeping in, nor any complaints about the
calligraphers or operators not being familiar with the language.

?Soon our future generations will be asking incredulously whether it was
really true that there was a time when newspapers were painstakingly
manually calligraphed all through the night to be printed on high speed
machines in the morning. Were we really so primitive that our national
language had to limp along holding on to the crutches of the calligraphers
that made the completion of books an exercise ranging from months to years
depending upon their volume.?

Noted Urdu litterateur Ahmed Nadeem Qasmi paid tribute to Ahmed Mirza Jamil
during his lifetime.

He said, ?The revolution brought about by Noori Nastaliq in the field of
Urdu publishing sends out many positive signals. It has at last settled the
long-standing dispute about Urdu typewriter?s keys that had raged from the
time Pakistan was born. The future generations will surely be indebted to
him for this revolution.

Dr Ahmed Mirza Jamil passed away unsung on February 17, 2014. May his soul
be blessed.

*Published in The Express Tribune, March 15th, 2014.*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140319/e34211d2/attachment.html>

From doug at ewellic.org  Tue Mar 18 20:33:39 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 19:33:39 -0600
Subject: Editing Sinhala and Similar Scripts
Message-ID: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

>> Typing is a nightmare.
>
>> When you backspace it destroys multiple keystrokes.
>
> I suspect this is a widespread and unsolved problem.

There are two types of people:

1. those who fully expect Backspace to erase a single keystroke, and 
feel it is a fatal flaw if it erases an entire combination, and

2. those who fully expect Backspace to erase an entire combination, and 
feel it is a fatal flaw if it erases just a single keystroke.

Unfortunately, both types exist in significant numbers.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ? 


From doug at ewellic.org  Tue Mar 18 20:50:48 2014
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 18 Mar 2014 19:50:48 -0600
Subject: Dead and Compose keys (was: Re: Romanized Singhala got great
 reception in Sri Lanka)
In-Reply-To: <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org>
References: <20140318123720.665a7a7059d7ee80bb4d670165c8327d.845865c9df.wbe@email03.secureserver.net>
 <5328BA69.9050502@colson.eu>
 <58732B52-890F-40D8-9739-777749FC71FE@bluesky.org>
Message-ID: <C5D17273C6BA40E5A999FADEEF32D3DD@DougEwell>


Tom Gewecke wrote:

>> It seems there?s still a big difference in the second syllable.
>
> Are you referring to "ryy"  (0DBB 0DCA 200D 0DBA 0DCA 200D 0DBA)?
>
> That is the correct encoding I think.  But most fonts don't display it
> quite right.

On Windows 8.1, still using BabelPad, the "ryy" comes out just as Naena 
had it. Furthermore, it even comes out right on Windows Phone. See 
attached images.

So there it is.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell ?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sinhala-babelpad-8.jpg
Type: image/jpeg
Size: 45785 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/cea91817/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sinhala-babelpad-phone.jpg
Type: image/jpeg
Size: 13984 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140318/cea91817/attachment-0001.jpg>

From daniel.buenzli at erratique.ch  Wed Mar 19 08:17:00 2014
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Wed, 19 Mar 2014 14:17:00 +0100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
Message-ID: <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch>

Le mercredi, 19 mars 2014 ? 02:33, Doug Ewell a ?crit :
> There are two types of people:
>  
> 1. those who fully expect Backspace to erase a single keystroke, and  
> feel it is a fatal flaw if it erases an entire combination, and
>  
> 2. those who fully expect Backspace to erase an entire combination, and  
> feel it is a fatal flaw if it erases just a single keystroke.
>  
> Unfortunately, both types exist in significant numbers.
Isn't it possible to classify appartenance to 1 or 2 according to script ? E.g. I suspect most french speaking person when backspacing an ? would like to erase the whole combination; for ? it seems even more obvious since usually it's introduced with a single keystroke.

Best,

Daniel


From lang.support at gmail.com  Wed Mar 19 08:19:55 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Thu, 20 Mar 2014 00:19:55 +1100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
Message-ID: <CAGJ7U-XTM53k+qS2Y6__VKknDu-MxYCACCrh9RrAN=_vR5BiJQ@mail.gmail.com>

LOL,  that's why,  if the input framework allows it, its easier to support
both approachable to backspace or at least an option to choose one or the
other.

; )

Andrew
On 19/03/2014 11:37 PM, "Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>
>  Typing is a nightmare.
>>>
>>
>>  When you backspace it destroys multiple keystrokes.
>>>
>>
>> I suspect this is a widespread and unsolved problem.
>>
>
> There are two types of people:
>
> 1. those who fully expect Backspace to erase a single keystroke, and feel
> it is a fatal flaw if it erases an entire combination, and
>
> 2. those who fully expect Backspace to erase an entire combination, and
> feel it is a fatal flaw if it erases just a single keystroke.
>
> Unfortunately, both types exist in significant numbers.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ?
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/c59a63f0/attachment.html>

From A.Schappo at lboro.ac.uk  Wed Mar 19 09:26:04 2014
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Wed, 19 Mar 2014 14:26:04 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
Message-ID: <CCD5284F-39BC-410B-AF6F-48EBAE6EE762@lboro.ac.uk>


On 19 Mar 2014, at 01:33, Doug Ewell wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> 
>>> Typing is a nightmare.
>> 
>>> When you backspace it destroys multiple keystrokes.
>> 
>> I suspect this is a widespread and unsolved problem.
> 
> There are two types of people:
> 
> 1. those who fully expect Backspace to erase a single keystroke, and feel it is a fatal flaw if it erases an entire combination, and
> 
> 2. those who fully expect Backspace to erase an entire combination, and feel it is a fatal flaw if it erases just a single keystroke.
> 
> Unfortunately, both types exist in significant numbers.
> 
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell ? 
> 

WRT Latin. I have just tested with OSX TextEdit and the precomposed character ? U+00E9

Backspace erases ?
Control+Backspace erases ? leaving one with e

I had not realised this was possible until I experimented with combinations of Backspace + alt/command/ctrl/shift

Andr?


From petercon at microsoft.com  Wed Mar 19 09:57:35 2014
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 19 Mar 2014 14:57:35 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
Message-ID: <a5a5ee93a1c347f39c40df937b6cbffc@BL2PR03MB450.namprd03.prod.outlook.com>

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell

>>> When you backspace it destroys multiple keystrokes.


> There are two types of people:
>
> 1. those who fully expect Backspace to erase a single keystroke

It is nonsensical to talk about erasing a _keystroke_. That would be comparable to erasing a mouse click, or erasing a tap on a touch-sensitive device. These are user actions that may result in any number of machine states. Unless you can manage to build a time machine, at the time when the erasing is happening, there is no longer any record of what process might have been operating that responded to the user action or of what machine state was the result. 

All that is available to act on at that point are characters.


Peter


From emuller at adobe.com  Wed Mar 19 10:08:01 2014
From: emuller at adobe.com (Eric Muller)
Date: Wed, 19 Mar 2014 08:08:01 -0700
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <a5a5ee93a1c347f39c40df937b6cbffc@BL2PR03MB450.namprd03.prod.outlook.com>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
 <a5a5ee93a1c347f39c40df937b6cbffc@BL2PR03MB450.namprd03.prod.outlook.com>
Message-ID: <5329B2D1.5030603@adobe.com>

On 3/19/2014 7:57 AM, Peter Constable wrote:
> It is nonsensical to talk about erasing a _keystroke_. 

"undo", "revert" the effect of a keystroke. The concept is meaningful.

Eric.


From doug at ewellic.org  Wed Mar 19 11:38:19 2014
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 19 Mar 2014 09:38:19 -0700
Subject: Editing Sinhala and Similar Scripts
Message-ID: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net>

Andre Schappo <A dot Schappo at lboro dot ac dot uk> wrote:

> WRT Latin. I have just tested with OSX TextEdit and the precomposed
> character ? U+00E9
>
> Backspace erases ?
> Control+Backspace erases ? leaving one with e
>
> I had not realised this was possible until I experimented with
> combinations of Backspace + alt/command/ctrl/shift

That's the sort of feature I would just love. I also love the Alt+Tab
and Windows+Tab features to switch between windows in Windows. I am led
to believe that "normal users" (cf. nerds) hate this kind of hidden
feature, and either never use it or become annoyed when they invoke it
accidentally by hitting the magic key combination.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From doug at ewellic.org  Wed Mar 19 11:39:13 2014
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 19 Mar 2014 09:39:13 -0700
Subject: Editing Sinhala and Similar Scripts
Message-ID: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>

Peter Constable <petercon at microsoft dot com> wrote:

>> There are two types of people:
>>
>> 1. those who fully expect Backspace to erase a single keystroke
>
> It is nonsensical to talk about erasing a _keystroke_.

But that's what they expect.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From ken.whistler at sap.com  Wed Mar 19 12:28:07 2014
From: ken.whistler at sap.com (Whistler, Ken)
Date: Wed, 19 Mar 2014 17:28:07 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
Message-ID: <B6B31BB04593D64F8B3E169A5C3DC62F233F87B1@USPHLEMB12C.global.corp.sap>

And I think you need to distinguish  between *proximate*
behavior in an editor and editing behavior in general.

Once a user enters editing mode, the expectation that we
(the software community writing text editors) have built,
in interaction with users, is that within reason, something that
you have *just* done in editing, can be easily undone.
And that is what "backspace" now means to many users.

Do do do do ... undo undo undo undo ... 
That should get you back to where you were before
the do do do do.

It is really annoying, particularly to efficient typists, when
a sequence of 4 keystrokes is *not* exactly undone by a
sequence of 4 backspace strokes. When that occurs, the
flow of text composition is suddenly interrupted by forcing
the user out of "compose" mode and into a completely different
"monitor and check what the state of the display is" mode that
can be very annoying.

But that is what I am referring to as the proximate behavior.
An editing implementation can and should collect a reasonable
undo buffer, which *does* know about complicated states,
including significant operations like selection deletion, which
are the most common types of operations that composers
really, really, really want to be able to undo. But in cases
like that, the backspace key is only a partial aspect of the undo,
and I suspect that most people are not all that annoyed when they
have to shift out of "compose" mode to accomplish more
significant undo operations.

But what Peter was pointing out that in the *generic* case
for editing, such as first cursor down at some random
location in already existing text, there is no existing history
of how that text was created. And thus there are no "keystrokes"
to be undone by hitting a backspace at that point. Yet a
backspace command has to do *something* reasonable,
and my own assessment is that it shouldn't be too different
from what a backspace key does during active text entry.

So that is the real conundrum here. Getting all of the commands
of a text editor to work efficiently the way users expect is
itself an art form -- even for relatively simple scripts. So it
really should not be too surprising that people have rather
intense arguments about how such operations *should* work
for abugidas. (Particularly because such operations very
often are not occurring in monolingual/monoscriptal
contexts, and expectations carry over from one language/script
to another.)

--Ken

> Peter Constable <petercon at microsoft dot com> wrote:
> 
> >> There are two types of people:
> >>
> >> 1. those who fully expect Backspace to erase a single keystroke
> >
> > It is nonsensical to talk about erasing a _keystroke_.
> 
> But that's what they expect.


From doug at ewellic.org  Wed Mar 19 13:07:05 2014
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 19 Mar 2014 11:07:05 -0700
Subject: Editing Sinhala and Similar Scripts
Message-ID: <20140319110705.665a7a7059d7ee80bb4d670165c8327d.59f7242c57.wbe@email03.secureserver.net>

"Whistler, Ken" <ken dot whistler at sap dot com> wrote:

> But what Peter was pointing out that in the *generic* case
> for editing, such as first cursor down at some random
> location in already existing text, there is no existing history
> of how that text was created. And thus there are no "keystrokes"
> to be undone by hitting a backspace at that point.

Well, of course you and Peter are right, and stated it well as always.

Probably a better way for me to say it would be that, for any visual
combination of letters or marks that are decomposable in some way, such
as an acute accent over an 'e' or a conjunct cluster in an Indic script,
there are at least some users who expect Backspace to erase one element
of the cluster (the "last," whatever that means) and some who expect it
to erase the entire cluster. Each type of user will be frustrated if the
other behavior occurs.

> So that is the real conundrum here. Getting all of the commands
> of a text editor to work efficiently the way users expect is
> itself an art form -- even for relatively simple scripts. So it
> really should not be too surprising that people have rather
> intense arguments about how such operations *should* work
> for abugidas. (Particularly because such operations very
> often are not occurring in monolingual/monoscriptal
> contexts, and expectations carry over from one language/script
> to another.)

Daniel B?nzli pointed out that French-speaking users would consider
'?' a unitary letter, and would expect Backspace to erase the whole
thing, even if "under the hood" it might be encoded in NFD as <0065
0301>. It's not at all clear that a Sinhala user would expect Backspace
to delete a cluster of three or four "letters" (Naena certainly didn't
like that). But the two scenarios are quite similar as far as software
and encoding are concerned. So maybe a French keyboard could have one
Backspace behavior built in, and a Sinhala keyboard could have something
different, something that may or may not be possible under various input
architectures.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From richard.wordingham at ntlworld.com  Wed Mar 19 15:29:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 19 Mar 2014 20:29:09 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch>
References: <5CEF7788240E43EEA52B9D52F40F50FF@DougEwell>
 <14E08677F24C44ADA175BFBB9CA759DD@erratique.ch>
Message-ID: <20140319202909.6b7572cc@JRWUBU2>

On Wed, 19 Mar 2014 14:17:00 +0100
Daniel B?nzli <daniel.buenzli at erratique.ch> wrote:

> Le mercredi, 19 mars 2014 ? 02:33, Doug Ewell a ?crit :
> > There are two types of people:
> >  
> > 1. those who fully expect Backspace to erase a single keystroke,
> > and feel it is a fatal flaw if it erases an entire combination, and
> >  
> > 2. those who fully expect Backspace to erase an entire combination,
> > and feel it is a fatal flaw if it erases just a single keystroke.
> >  
> > Unfortunately, both types exist in significant numbers.

And I belong to a third group - I expect it to delete a
Unicode character.

> Isn't it possible to classify appartenance to 1 or 2 according to
> script ? E.g. I suspect most french speaking person when backspacing
> an ? would like to erase the whole combination; for ? it seems even
> more obvious since usually it's introduced with a single keystroke.

It's not as simple as script.  For an English speaker who enters it on
a keyboard, it's normally entered with multiple keystrokes, most
typically via a dead key.  Now, if I type it in using an out of order
sequence such as 'e, it is quite reasonable for it to be stored as a
single composed character and deleted by backspace.  On the other
hand, if I type it in using an XSAMPA-based keyboard sequence such as
e_H, I expect the backspace to delete just the accent, just as I am
used to for the sequence O_H which yields 2 characters, open o with
acute (??).  The diacritic here would not not arbitrary - I would
be using it to indicate a specific tone.  (It came as a nasty shock to
find my e-mail client, Claws on Ubuntu, takes out the entire cluster.
For Thai legacy grapheme clusters, it just takes out the last character
entered.)  At the moment I have made my life more difficult for myself
by devising a keyboard that generates NFC if the key strokes are in the
right order. 

As a reasonable guide, backspace should not take out more than one NFC
character, and I would defend this even for Cyrillic-script tone marking
in Serbian.

Now, there's supposed to be an interface definition for using
incremental keyboard typing as in Keyman, where keyboards can be
arranges so that one sees what's been typed in already.  Where is it?
It is rather important for an application to know when it can normalise
input characters.  For example, LibreOffice helpfully swaps round a
tone Thai mark with a following vowel mark below, with the slightly
bizarre consequence that the sequence ko kai, mai ek, sara u, backspace
yields <KO KAI, SARA U>.  Traditionally, the sequence yields a beep and
just <KO KAI> - the input handler rejects the SARA U because it does
not accord with the character order prescribed by WTT (wing thuk thi).

Richard.


From petercon at microsoft.com  Wed Mar 19 22:43:05 2014
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 20 Mar 2014 03:43:05 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
Message-ID: <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>

If you click into the existing text in this email and backspace, what keystroke will you expect to be "erased"? Your system has no way of knowing what keystroke might have been involved in creating the text.

What is _can_ make sense to talk about is to say that a user expects execution of a particular key sequence, such as pressing a Backspace key, to have a particular editing effect on the content of text. "Erasing a keystroke" and "keystrokes resulting in edits" are different things. One makes sense, the other does not.

It may seem like I'm being pedantic, but I think the distinction is important. Our failure is in framing our thinking from years of experience (and perhaps some behaviours originally influenced by typewriter and teletype technologies) in which a keyboard has a bunch of keys that add characters, and variations on that that even include a lot of logic to get input keying sequences that can generate tens of thousands of different character; but then one or two keys (delete, backspace) that can only operate in very dumb ways. (We've also always assumed that any logic in keying behaviours can be conditioned only by the input sequences, but not by any existing content, but that steps beyond my earlier point.) These constraints in how we think limit possibilities


Peter


-----Original Message-----
From: Doug Ewell [mailto:doug at ewellic.org] 
Sent: March 19, 2014 9:39 AM
To: Peter Constable; unicode at unicode.org
Subject: RE: Editing Sinhala and Similar Scripts

Peter Constable <petercon at microsoft dot com> wrote:

>> There are two types of people:
>>
>> 1. those who fully expect Backspace to erase a single keystroke
>
> It is nonsensical to talk about erasing a _keystroke_.

But that's what they expect.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell


From jlturriff at centurylink.net  Wed Mar 19 23:17:24 2014
From: jlturriff at centurylink.net (J. Leslie Turriff)
Date: Wed, 19 Mar 2014 21:17:24 -0700
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
Message-ID: <201403192117.24313.jlturriff@centurylink.net>

	Perhaps it might be useful to be able to distinguish between an "editing 
mode" and a "composition mode":  editing mode would be active when a document 
is first loaded into the editor, when the editor has no keystroke history to 
consult, and  in this mode the backspace key would merely remove text "glyph 
by glyph", so to speak, as happens with ASCII text;  composition mode would 
be active when keystrokes have been recorded in a buffer, so that backspace 
could be used to "unstroke" the original strokes; the "unstroke" operations 
would mimic the order in which the originals were entered, even if the editor 
had optomized the composition.

Leslie

On Wednesday 19 March 2014 20:43:05 Peter Constable wrote:
> If you click into the existing text in this email and backspace, what
> keystroke will you expect to be "erased"? Your system has no way of knowing
> what keystroke might have been involved in creating the text.
>
> What is _can_ make sense to talk about is to say that a user expects
> execution of a particular key sequence, such as pressing a Backspace key,
> to have a particular editing effect on the content of text. "Erasing a
> keystroke" and "keystrokes resulting in edits" are different things. One
> makes sense, the other does not.
>
> It may seem like I'm being pedantic, but I think the distinction is
> important. Our failure is in framing our thinking from years of experience
> (and perhaps some behaviours originally influenced by typewriter and
> teletype technologies) in which a keyboard has a bunch of keys that add
> characters, and variations on that that even include a lot of logic to get
> input keying sequences that can generate tens of thousands of different
> character; but then one or two keys (delete, backspace) that can only
> operate in very dumb ways. (We've also always assumed that any logic in
> keying behaviours can be conditioned only by the input sequences, but not
> by any existing content, but that steps beyond my earlier point.) These
> constraints in how we think limit possibilities


From lang.support at gmail.com  Wed Mar 19 23:21:59 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Thu, 20 Mar 2014 15:21:59 +1100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
Message-ID: <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>

There is also a distinction between editing an existing document that you
opened as distinct from writing a document, going back to a certain point
in document and editing that section within the same editing session.

In the first case their is no history, in the second case their may be
history to work with.

Andrew


On 20 March 2014 14:43, Peter Constable <petercon at microsoft.com> wrote:

> If you click into the existing text in this email and backspace, what
> keystroke will you expect to be "erased"? Your system has no way of knowing
> what keystroke might have been involved in creating the text.
>
> What is _can_ make sense to talk about is to say that a user expects
> execution of a particular key sequence, such as pressing a Backspace key,
> to have a particular editing effect on the content of text. "Erasing a
> keystroke" and "keystrokes resulting in edits" are different things. One
> makes sense, the other does not.
>
> It may seem like I'm being pedantic, but I think the distinction is
> important. Our failure is in framing our thinking from years of experience
> (and perhaps some behaviours originally influenced by typewriter and
> teletype technologies) in which a keyboard has a bunch of keys that add
> characters, and variations on that that even include a lot of logic to get
> input keying sequences that can generate tens of thousands of different
> character; but then one or two keys (delete, backspace) that can only
> operate in very dumb ways. (We've also always assumed that any logic in
> keying behaviours can be conditioned only by the input sequences, but not
> by any existing content, but that steps beyond my earlier point.) These
> constraints in how we think limit possibilities
>
>
> Peter
>
>
> -----Original Message-----
> From: Doug Ewell [mailto:doug at ewellic.org]
> Sent: March 19, 2014 9:39 AM
> To: Peter Constable; unicode at unicode.org
> Subject: RE: Editing Sinhala and Similar Scripts
>
> Peter Constable <petercon at microsoft dot com> wrote:
>
> >> There are two types of people:
> >>
> >> 1. those who fully expect Backspace to erase a single keystroke
> >
> > It is nonsensical to talk about erasing a _keystroke_.
>
> But that's what they expect.
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham at slv.vic.gov.au
          lang.support at gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/d8ddb2f7/attachment.html>

From lang.support at gmail.com  Wed Mar 19 23:25:34 2014
From: lang.support at gmail.com (Andrew Cunningham)
Date: Thu, 20 Mar 2014 15:25:34 +1100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <201403192117.24313.jlturriff@centurylink.net>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <201403192117.24313.jlturriff@centurylink.net>
Message-ID: <CAGJ7U-VAD7D1-Z4pNbs_OYWMe84VKYJTa3ST-9q9D=+UoxA=gA@mail.gmail.com>

On 20 March 2014 15:17, J. Leslie Turriff <jlturriff at centurylink.net> wrote:

>         Perhaps it might be useful to be able to distinguish between an
> "editing
> mode" and a "composition mode":  editing mode would be active when a
> document
> is first loaded into the editor, when the editor has no keystroke history
> to
> consult, and  in this mode the backspace key would merely remove text
> "glyph
> by glyph", so to speak, as happens with ASCII text;  composition mode would
> be active when keystrokes have been recorded in a buffer, so that backspace
> could be used to "unstroke" the original strokes; the "unstroke" operations
> would mimic the order in which the originals were entered, even if the
> editor
> had optomized the composition.
>
>
>

Although that requires an input framework and application that utilise that
buffer in various ways during "composition mode". It is possible, and in
the past I have written a manual and run training on advanced editing for
Dinka language translators on how to utilise such features. But not many
applications support such features.

Andrew

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham at slv.vic.gov.au
          lang.support at gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/00ca42de/attachment.html>

From verdy_p at wanadoo.fr  Wed Mar 19 23:59:49 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 20 Mar 2014 05:59:49 +0100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
Message-ID: <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>

The Backspace key has never been considered an "Undo" key.
Some OSes or keyboard provide an Undo key or an equivalent shortcut like
CTRL+Z (but even in this case the editor may want to undo in one operation
multiple successive insertions).
In the ti,e of typewriters; backspaces meant going back one full cluster
(in order to be able to retype it completely (e.g. with a blanking typex).
Its effect was effectively to going backwardto the start of the cluster.
On typewriters and modern computer keyboards that have dead keys, the
backspace key was ignoring that dead key and going backward to the previous
cluster. So deadkeys are not counted.
With keyboards using a compose key method, there is NO character output in
the edit buffer as long as the compose sequence is not complete, so there's
a single string inserted continaing the result of the composition and users
will not see anything inserted in the edited text before, so there's
nothing you can delete with backspace. Users also should not have to care
it the composed seuqnce was encoded in NFC or NFD form (with precomposed
characters or with decomposed base character followed by diacritics).
So they expect just one cluster.

It you really wat to delete an accent on top of a Latin letter, Backspace
is certainly not hat the Backspace key will usually perform. You would need
another key such as ALT+Backspace to *transform* the previous cluster
before the cursor into a shorter one.

But here this time, it is sometimes not really possible to predict which
diacritic will be deleted if there are multiple ones and they are unordered
(i.e. these combining diacritics have **distinct** and **non-zero**
combining classes): may be all these diacritics with distinct non-zero
combining classes should be deleted in a single operation, or otherwise
just the last **ordered** diacritic if there's still one.

(Note: here the term "diacritic" is meant "broadly" and it may be any
combining mark, or joiner control lie CGJ, or ZWJ, or ZWNJ, and sometimes a
modifier letter that may participage to the same cluster such has
apostrophe or middle dot: the Catalan letter L with middle dot may be
viewed in the editor as the letter L containing a combined diacritic, so
ALT+BACKSPACE could replace the L with middle dot by the letter L alone,
even if it's not canonically decomposable, as long as the editor knows that
it is operating within a Catalan locale)

In all cases, the action being performed by Backspace or alt+Backspace is
compeltely independant of the underlying Unicode encoding and shoudl also
be independant of the normalization form (except in advanced technical
editor mode such as "visible controls" where every encoded character is
rendered separately with a special form to make them visible).

In my opinion the standard edit mode (working in visual WYSIWYG mode)
should not depend on the encoding and Backspace should not create in the
edited text new oddities that were not really inserted and made visible
imediately when they were first entered.

Indic diacritics are entered separately from the base letter and they are
combined progressively. They are also ordered, for this reason Backspace
can remove them in a predicatable order one by one. The same could be saif
about Hebrew and Arabic diacritics entered separately (even if sometimes
they could be unordered: Backspace will will still delete all diacritics
that are in the same unordered group, even if it keeps the base letter)

But for Latin/Greek/Cyrillic keyboards that use dead keys for entering
unordered diacritics (and that are not even made visible in the document
before you have typed the base letter), it makes no sense for Backspace to
choose between these diacritics. Backspace will then delete all the full
cluster up tp the base letter.


2014-03-20 5:21 GMT+01:00 Andrew Cunningham <lang.support at gmail.com>:

> There is also a distinction between editing an existing document that you
> opened as distinct from writing a document, going back to a certain point
> in document and editing that section within the same editing session.
>
> In the first case their is no history, in the second case their may be
> history to work with.
>
> Andrew
>
>
> On 20 March 2014 14:43, Peter Constable <petercon at microsoft.com> wrote:
>
>> If you click into the existing text in this email and backspace, what
>> keystroke will you expect to be "erased"? Your system has no way of knowing
>> what keystroke might have been involved in creating the text.
>>
>> What is _can_ make sense to talk about is to say that a user expects
>> execution of a particular key sequence, such as pressing a Backspace key,
>> to have a particular editing effect on the content of text. "Erasing a
>> keystroke" and "keystrokes resulting in edits" are different things. One
>> makes sense, the other does not.
>>
>> It may seem like I'm being pedantic, but I think the distinction is
>> important. Our failure is in framing our thinking from years of experience
>> (and perhaps some behaviours originally influenced by typewriter and
>> teletype technologies) in which a keyboard has a bunch of keys that add
>> characters, and variations on that that even include a lot of logic to get
>> input keying sequences that can generate tens of thousands of different
>> character; but then one or two keys (delete, backspace) that can only
>> operate in very dumb ways. (We've also always assumed that any logic in
>> keying behaviours can be conditioned only by the input sequences, but not
>> by any existing content, but that steps beyond my earlier point.) These
>> constraints in how we think limit possibilities
>>
>>
>> Peter
>>
>>
>> -----Original Message-----
>> From: Doug Ewell [mailto:doug at ewellic.org]
>> Sent: March 19, 2014 9:39 AM
>> To: Peter Constable; unicode at unicode.org
>> Subject: RE: Editing Sinhala and Similar Scripts
>>
>> Peter Constable <petercon at microsoft dot com> wrote:
>>
>> >> There are two types of people:
>> >>
>> >> 1. those who fully expect Backspace to erase a single keystroke
>> >
>> > It is nonsensical to talk about erasing a _keystroke_.
>>
>> But that's what they expect.
>>
>> --
>> Doug Ewell | Thornton, CO, USA
>> http://ewellic.org | @DougEwell
>>
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>
>
>
> --
> Andrew Cunningham
> Project Manager, Research and Development
> (Social and Digital Inclusion)
> Public Libraries and Community Engagement
> State Library of Victoria
> 328 Swanston Street
> Melbourne VIC 3000
> Australia
>
> Ph: +61-3-8664-7430
> Mobile: 0459 806 589
> Email: acunningham at slv.vic.gov.au
>           lang.support at gmail.com
>
> http://www.openroad.net.au/
> http://www.mylanguage.gov.au/
> http://www.slv.vic.gov.au/
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/20f2fbf3/attachment.html>

From asmusf at ix.netcom.com  Thu Mar 20 02:31:41 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 20 Mar 2014 00:31:41 -0700
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <201403192117.24313.jlturriff@centurylink.net>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <201403192117.24313.jlturriff@centurylink.net>
Message-ID: <532A995D.60102@ix.netcom.com>

On 3/19/2014 9:17 PM, J. Leslie Turriff wrote:
> 	Perhaps it might be useful to be able to distinguish between an "editing
> mode" and a "composition mode":  editing mode would be active when a document
> is first loaded into the editor, when the editor has no keystroke history to
> consult, and  in this mode the backspace key would merely remove text "glyph
> by glyph", so to speak, as happens with ASCII text;  composition mode would
> be active when keystrokes have been recorded in a buffer, so that backspace
> could be used to "unstroke" the original strokes; the "unstroke" operations
> would mimic the order in which the originals were entered, even if the editor
> had optomized the composition.
It's more complicated than that.

Many editors don't (always) support "micro" undo. At some point, 
keystrokes (or their result) are coalesced and an undo will delete 
entire words or phrases, perhaps entire bullet items on a slide.

If done right, this will feel natural. If I've made edits to my document 
in three places, inserting the same word, then it feels natural to 
"undo" these as whole words (and not slavishly by keystroke - including 
all the false starts and backspace keys).

At the current caret position, one would expect the undo to be less 
aggressive and act more like a backspace. But in that case the user 
would (roughly) remember the keystrokes that just happened, so inverting 
the sequence feels more natural.

That same memory is why backspacing by composition step (keystroke) is 
appealing - you intuitively know how many wrong keys were pressed. But 
many user interfaces do not support that. Composing SMS with the T9 
interface will let you erase characters from the composed string, but 
will not revert to earlier word-guesses, so you can't cycle back, except 
by erasing until the beginning and then starting over. For that 
interface, the upside is that sometimes breaking a word apart by 
freezing the composition of the leading part, erasing parts that don't 
fit and then composing the remainder is the most efficient way to get 
around some limitations of the composition method.

Whatever the details, the design of an ideal user interface should not 
drive, or worse, dictate the character encoding - nor should the reverse 
be true.

A./


From A.Schappo at lboro.ac.uk  Thu Mar 20 07:24:38 2014
From: A.Schappo at lboro.ac.uk (Andre Schappo)
Date: Thu, 20 Mar 2014 12:24:38 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net>
References: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net>
Message-ID: <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk>


WRT Hangul Syllables & OSX TextEdit

Take a Hangul Syllable such as ?

backspace erases the whole syllable
control+backspace erases one jamo at a time from the syllable

Andr? Schappo

On 19 Mar 2014, at 16:38, Doug Ewell wrote:

> Andre Schappo <A dot Schappo at lboro dot ac dot uk> wrote:
> 
>> WRT Latin. I have just tested with OSX TextEdit and the precomposed
>> character ? U+00E9
>> 
>> Backspace erases ?
>> Control+Backspace erases ? leaving one with e
>> 
>> I had not realised this was possible until I experimented with
>> combinations of Backspace + alt/command/ctrl/shift
> 
> That's the sort of feature I would just love. I also love the Alt+Tab
> and Windows+Tab features to switch between windows in Windows. I am led
> to believe that "normal users" (cf. nerds) hate this kind of hidden
> feature, and either never use it or become annoyed when they invoke it
> accidentally by hitting the magic key combination.
> 
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
> 


From shizhao at gmail.com  Thu Mar 20 08:36:56 2014
From: shizhao at gmail.com (shi zhao)
Date: Thu, 20 Mar 2014 21:36:56 +0800
Subject: two Hanzi
Message-ID: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>

plese add two Hanzi (up ?+ down ?) and (up ? + down ?)

see http://www.term.org.cn/CN/abstract/abstract9314.shtml#

include in :
* Zhonghua Zihai??????, 1994: 1770.
* Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219.


 (up ?+ down ?)  = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 )
(up ? + down ?) = coronium = newtonium (see
http://yedict.com/zslistbs.asp?word=%C6%F87 )


My blog: http://shizhao.org
twitter: https://twitter.com/shizhao

[[zh:User:Shizhao]]


From mpsuzuki at hiroshima-u.ac.jp  Thu Mar 20 08:50:58 2014
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Thu, 20 Mar 2014 22:50:58 +0900
Subject: ["Unicode"] two Hanzi
In-Reply-To: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
Message-ID: <532AF242.60703@hiroshima-u.ac.jp>

If they are officially standardized characters for the
elements by PRC government, China NB will submit them
to ISO/IEC 10646 via Urgently Needed Characters process.
They are official?

Regards,
mpsuzuki

On 03/20/2014 10:36 PM, shi zhao wrote:
> plese add two Hanzi (up ?+ down ?) and (up ? + down ?)
> 
> see http://www.term.org.cn/CN/abstract/abstract9314.shtml#
> 
> include in :
> * Zhonghua Zihai??????, 1994: 1770.
> * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219.
> 
> 
>   (up ?+ down ?)  = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 )
> (up ? + down ?) = coronium = newtonium (see
> http://yedict.com/zslistbs.asp?word=%C6%F87 )
> 
> 
> 
> My blog: http://shizhao.org
> twitter: https://twitter.com/shizhao
> 
> [[zh:User:Shizhao]]
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
> 


From shizhao at gmail.com  Thu Mar 20 09:02:50 2014
From: shizhao at gmail.com (shi zhao)
Date: Thu, 20 Mar 2014 22:02:50 +0800
Subject: ["Unicode"] two Hanzi
In-Reply-To: <532AF242.60703@hiroshima-u.ac.jp>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
 <532AF242.60703@hiroshima-u.ac.jp>
Message-ID: <CAG9jPxwxfX20a5T1jYRmhGxuAgyssn7acWanPFdFkceUij9ypA@mail.gmail.com>

there is former chinese translation of  newtonium and nebulium

see
https://en.wikipedia.org/wiki/Coronium
https://en.wikipedia.org/wiki/Nebulium
Chinese wikipedia: http://zh.wikipedia.org/
My blog: http://shizhao.org
twitter: https://twitter.com/shizhao

[[zh:User:Shizhao]]


2014-03-20 21:50 GMT+08:00 suzuki toshiya <mpsuzuki at hiroshima-u.ac.jp>:
> If they are officially standardized characters for the
> elements by PRC government, China NB will submit them
> to ISO/IEC 10646 via Urgently Needed Characters process.
> They are official?
>
> Regards,
> mpsuzuki
>
>
> On 03/20/2014 10:36 PM, shi zhao wrote:
>>
>> plese add two Hanzi (up ?+ down ?) and (up ? + down ?)
>>
>> see http://www.term.org.cn/CN/abstract/abstract9314.shtml#
>>
>> include in :
>> * Zhonghua Zihai??????, 1994: 1770.
>> * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219.
>>
>>
>>   (up ?+ down ?)  = nebulium (see
>> http://yedict.com/zslistbs.asp?word=%C6%F84 )
>> (up ? + down ?) = coronium = newtonium (see
>> http://yedict.com/zslistbs.asp?word=%C6%F87 )
>>
>>
>>
>> My blog: http://shizhao.org
>> twitter: https://twitter.com/shizhao
>>
>> [[zh:User:Shizhao]]
>>
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
>>
>


From jknappen at web.de  Thu Mar 20 09:12:53 2014
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Thu, 20 Mar 2014 15:12:53 +0100
Subject: Aw: Re: ["Unicode"] two Hanzi
In-Reply-To: <532AF242.60703@hiroshima-u.ac.jp>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>, 
 <532AF242.60703@hiroshima-u.ac.jp>
Message-ID: <trinity-b366e260-508f-4749-bdd0-3d5f84f85e5f-1395324773653@3capp-webde-bs16>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/a52743bc/attachment.html>

From andrewcwest at gmail.com  Thu Mar 20 09:59:00 2014
From: andrewcwest at gmail.com (Andrew West)
Date: Thu, 20 Mar 2014 14:59:00 +0000
Subject: ["Unicode"] two Hanzi
In-Reply-To: <trinity-b366e260-508f-4749-bdd0-3d5f84f85e5f-1395324773653@3capp-webde-bs16>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
 <532AF242.60703@hiroshima-u.ac.jp>
 <trinity-b366e260-508f-4749-bdd0-3d5f84f85e5f-1395324773653@3capp-webde-bs16>
Message-ID: <CALgEMhzgzF+hewpR2+ygm8SrfhSaSmsYJd2-UepfDfbDyLnthg@mail.gmail.com>

On 20 March 2014 14:12, "J?rg Knappen" <jknappen at web.de> wrote:
>
> Who writes a proposal?

I wish that there was a mechanism for encoding CJK characters that
allowed individuals to simply submit characters with appropriate
evidence to Unicode, and after review they could be added to the next
version Unicode, but the reality is that you need to go through a long
and bureaucratic process involving the Ideographic Rapporteur Group
(IRG), with the result that it may take ten years to get a CJK
character encoded.  Even the Unicode Consortium seems powerless to
overcome IRG bureaucracy, as the sorry tale below illustrates.

In 2012 I wrote a proposal to encode 226 Han characters, including two
fish characters previously requested by Shi Zhao on this list
<http://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0259.html>,
which I submitted to the Unicode Technical Committee (UTC):

<http://www.unicode.org/L2/L2012/12333-cjk-f.pdf>

The UTC accepted this document, and included the suggested characters
in the Unicode submission to the IRG for inclusion in the CJK-F
extension:

<http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRGN1888_UTCExtensionF.zip>

This was discussed at the IRG meeting in Hanoi in November 2012
(http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRG39.htm), but the
Unicode submission for CJK-F was entirely rejected by IRG just because
the submission was a couple of days late.

The UTC later submitted a proposal to encode 19 of the original
characters (including Shi Zhao's two fish characters) as urgently
needed characters:

<http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg40/IRGN1936_UTC_UNC.zip>

But this was rejected by IRG in November last year as they considered
that these characters were not urgent enough, so now we will have to
wait another four or five years before they can be considered for
CJK-G.

Good luck getting the characters for newtonium and nebulium encoded any sooner!

Andrew


From mpsuzuki at hiroshima-u.ac.jp  Thu Mar 20 11:20:10 2014
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Fri, 21 Mar 2014 01:20:10 +0900
Subject: ["Unicode"] two Hanzi
In-Reply-To: <CALgEMhzgzF+hewpR2+ygm8SrfhSaSmsYJd2-UepfDfbDyLnthg@mail.gmail.com>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
 <532AF242.60703@hiroshima-u.ac.jp>
 <trinity-b366e260-508f-4749-bdd0-3d5f84f85e5f-1395324773653@3capp-webde-bs16>
 <CALgEMhzgzF+hewpR2+ygm8SrfhSaSmsYJd2-UepfDfbDyLnthg@mail.gmail.com>
Message-ID: <532B153A.1080100@hiroshima-u.ac.jp>

Hi,

I have no objection against the impression of the
slowness, but please don't say IRG as bureaucratic.

IRG members are pushing themselves to their limits
for reviewing process of the thousands of the
submitted characters. Although IRG could not
response "here you are" immediately to the voice
"give me", but IRG is not saying "go away".

In my personal impression, the UNC submissions from
the experts are slightly difficult for other IRG
members to evaluate about their urgency. Taking Chinese
UNC submission, the urgency is justified by the
update of the normative Hanzi table. Until the
standardization of the characters in PRC's UNC,
the governmental procurements have the difficulty to
request the feature to interchange the normative
characters. Apparently, it is not only the domestic
problem in China, but also the problems for the
industries trading around Chinese market.

If I submit some characters sampled from a dictionary
as UNC, how I could make the delegates sympathized
as "they are also urgently needed as other governmental
requests"? I don't have good idea.

Regards,
mpsuzuki

On 03/20/2014 11:59 PM, Andrew West wrote:
> On 20 March 2014 14:12, "J?rg Knappen" <jknappen at web.de> wrote:
>>
>> Who writes a proposal?
>
> I wish that there was a mechanism for encoding CJK characters that
> allowed individuals to simply submit characters with appropriate
> evidence to Unicode, and after review they could be added to the next
> version Unicode, but the reality is that you need to go through a long
> and bureaucratic process involving the Ideographic Rapporteur Group
> (IRG), with the result that it may take ten years to get a CJK
> character encoded.  Even the Unicode Consortium seems powerless to
> overcome IRG bureaucracy, as the sorry tale below illustrates.
>
> In 2012 I wrote a proposal to encode 226 Han characters, including two
> fish characters previously requested by Shi Zhao on this list
> <http://www.unicode.org/mail-arch/unicode-ml/y2012-m05/0259.html>,
> which I submitted to the Unicode Technical Committee (UTC):
>
> <http://www.unicode.org/L2/L2012/12333-cjk-f.pdf>
>
> The UTC accepted this document, and included the suggested characters
> in the Unicode submission to the IRG for inclusion in the CJK-F
> extension:
>
> <http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRGN1888_UTCExtensionF.zip>
>
> This was discussed at the IRG meeting in Hanoi in November 2012
> (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg39/IRG39.htm), but the
> Unicode submission for CJK-F was entirely rejected by IRG just because
> the submission was a couple of days late.
>
> The UTC later submitted a proposal to encode 19 of the original
> characters (including Shi Zhao's two fish characters) as urgently
> needed characters:
>
> <http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg40/IRGN1936_UTC_UNC.zip>
>
> But this was rejected by IRG in November last year as they considered
> that these characters were not urgent enough, so now we will have to
> wait another four or five years before they can be considered for
> CJK-G.
>
> Good luck getting the characters for newtonium and nebulium encoded any sooner!
>
> Andrew
>


From rscook at wenlin.com  Thu Mar 20 11:43:31 2014
From: rscook at wenlin.com (Richard COOK)
Date: Thu, 20 Mar 2014 09:43:31 -0700
Subject: two Hanzi
In-Reply-To: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
Message-ID: <E5D44FE8-8FE9-4985-8F17-B176F713D554@wenlin.com>

On Mar 20, 2014, at 6:36 AM, shi zhao wrote:

> plese add two Hanzi (up ?+ down ?) and (up ? + down ?)
> 
> see http://www.term.org.cn/CN/abstract/abstract9314.shtml#
> 
> include in :
> * Zhonghua Zihai??????, 1994: 1770.
> * Lu gusun?????, The English-Chinese Dictionary (?????), 1991: 701,2219.
> 
> 
> (up ?+ down ?)  = nebulium (see http://yedict.com/zslistbs.asp?word=%C6%F84 )
> (up ? + down ?) = coronium = newtonium (see
> http://yedict.com/zslistbs.asp?word=%C6%F87 )

Interesting, yedict.com lists a few characters as "?unicode??", some repeatedly.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2014-03-20 at 9.20.54 AM.png
Type: image/png
Size: 47807 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/7e96f9b3/attachment.png>
-------------- next part --------------


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen shot 2014-03-20 at 9.21.10 AM.png
Type: image/png
Size: 18639 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20140320/7e96f9b3/attachment-0001.png>
-------------- next part --------------


??? =?!??
??? =?
??? !?? # [U+2C1CF] Ext E
??? !??

One of these is in Ext E (from V-Source), but the other three seem not to be encoded.

None is tracked in TR45.

I'm adding them to the CDL database ... they could be tracked in TR45 if someone does a proposal to document them ... this would ensure that IRG at least looks at them.

Note that the structure may differ ... 

??X
??X
??X

might all refer to the same abstract character ...

-Richard


> 
> 
> My blog: http://shizhao.org
> twitter: https://twitter.com/shizhao
> 
> [[zh:User:Shizhao]]
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
> 


From kojiishi at gluesoft.co.jp  Thu Mar 20 12:44:56 2014
From: kojiishi at gluesoft.co.jp (Koji Ishii)
Date: Thu, 20 Mar 2014 17:44:56 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk>
References: <20140319093819.665a7a7059d7ee80bb4d670165c8327d.6f09698b95.wbe@email03.secureserver.net>
 <649F7CA8-8CD0-44D5-B6F7-EDFAF56E993C@lboro.ac.uk>
Message-ID: <1A2FCF7B-7FE5-4A41-BB43-315DFB9E81E1@gluesoft.co.jp>

Japanese Windows IME does similar. After committing converted text, Backspace erases the last converted character, while CTRL+Backspace right after commits undo the commit and put the text back to converted (i.e., undetermined or editing) state.

/koji


On Mar 20, 2014, at 9:24 PM, Andre Schappo <A.Schappo at lboro.ac.uk> wrote:


WRT Hangul Syllables & OSX TextEdit

Take a Hangul Syllable such as ?

backspace erases the whole syllable
control+backspace erases one jamo at a time from the syllable

Andr? Schappo

On 19 Mar 2014, at 16:38, Doug Ewell wrote:

> Andre Schappo <A dot Schappo at lboro dot ac dot uk> wrote:
> 
>> WRT Latin. I have just tested with OSX TextEdit and the precomposed
>> character ? U+00E9
>> 
>> Backspace erases ?
>> Control+Backspace erases ? leaving one with e
>> 
>> I had not realised this was possible until I experimented with
>> combinations of Backspace + alt/command/ctrl/shift
> 
> That's the sort of feature I would just love. I also love the Alt+Tab
> and Windows+Tab features to switch between windows in Windows. I am led
> to believe that "normal users" (cf. nerds) hate this kind of hidden
> feature, and either never use it or become annoyed when they invoke it
> accidentally by hitting the magic key combination.
> 
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
> 


_______________________________________________
Unicode mailing list
Unicode at unicode.org
http://unicode.org/mailman/listinfo/unicode


From andrewcwest at gmail.com  Fri Mar 21 05:13:11 2014
From: andrewcwest at gmail.com (Andrew West)
Date: Fri, 21 Mar 2014 10:13:11 +0000
Subject: two Hanzi
In-Reply-To: <E5D44FE8-8FE9-4985-8F17-B176F713D554@wenlin.com>
References: <CAG9jPxz4yGGpFx-Wb7+1cFk3X6iy8qtXvO-4prFkYbgCpcZM7g@mail.gmail.com>
 <E5D44FE8-8FE9-4985-8F17-B176F713D554@wenlin.com>
Message-ID: <CALgEMhxjLQWGKaoN56334wKmoYFuEs633787a99REeLNO4o6cQ@mail.gmail.com>

On 20 March 2014 16:43, Richard COOK <rscook at wenlin.com> wrote:
>
> Interesting, yedict.com lists a few characters as "?unicode??", some repeatedly.
>
> ??? =?!??
> ??? =?
> ??? !?? # [U+2C1CF] Ext E
> ??? !??
>
> One of these is in Ext E (from V-Source), but the other three seem not to be encoded.

The "Zhonghua Zihai" ?????? dictionary includes thousands of
characters not yet encoded in Unicode, and just under the ? radical
yedict.com lists 24 not-in-Unicode characters from "Zhonghua Zihai",
of which 18 are not encoded or included in CJK-E:

???
??? = U+520F ?
???
????
??? = CJK-E U+2C1CF
???
???
??? = CJK-E U+2C1D0
???
???
???
???
??? = CJK-E U+2C1D1
????
???
????
???
????
???
??? = CJK-E U+2C1D2
????? != U+20103 ?? (?????)
??? = CJK-E U+2C1D3
???????? != U+2010B ?? (????????)
???

> Note that the structure may differ ...
>
> ??X
> ??X
> ??X
>
> might all refer to the same abstract character ...

... but nevertheless would not be unified according to Annex S.

Andrew


From velterop at gmail.com  Fri Mar 21 06:14:50 2014
From: velterop at gmail.com (Jan Velterop)
Date: Fri, 21 Mar 2014 11:14:50 +0000
Subject: New symbol to denote true open access (e.g. to scholarly literature),
 analogous to the copyright symbol
Message-ID: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>

May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic.

A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html

The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/)

I am seeking comments and support for this proposal.

Jan Velterop


From jknappen at web.de  Fri Mar 21 09:33:07 2014
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Fri, 21 Mar 2014 15:33:07 +0100
Subject: Aw: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
Message-ID: <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140321/165266dd/attachment.html>

From velterop at gmail.com  Fri Mar 21 10:22:37 2014
From: velterop at gmail.com (Jan Velterop)
Date: Fri, 21 Mar 2014 15:22:37 +0000
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
 <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>
Message-ID: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>

But are the chances nil? It would be a nice complement to the series of ?, ?, ?, etcetera and perform a similar function. A symbol for Creative Commons, presumably a double c in a circle, would probably indicate the document in question is covered by one of the CC licences, but it wouldn't be clear by which one, which may be an impediment for having a symbol. Similarly, copyleft is also a licensing scheme, and as such is not quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a copyleft symbol is in the same 'single encircled letter' convention.

For the encircled 'a' symbol for open access it is proposed to use this definition: 

"The symbol for 'open access', if applied to documents and images, indicates their free availability, on the internet or otherwise, permitting any users to read, download, copy, distribute, (re)print, search, or link to the full texts of such documents, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself or to printing materials and facilities. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

Jan Velterop

On 21 Mar 2014, at 14:33, J?rg Knappen <jknappen at web.de> wrote:

> Even when this symbol really catches on (what I doubt because it is too close to the @ sign in the first place) chance are low that it will be encoded in UNicode. Precedents like the Creative Commons sign or the Copyleft sign have been discussed on this mailing list (search the archives for the relevant threads) but were never encoded in UNicode.
>  
> When the symbol does not catch on, why should it be encoded in UNicode?
>  
> --J?rg Knappen
>  
> Gesendet: Freitag, 21. M?rz 2014 um 12:14 Uhr
> Von: "Jan Velterop" <velterop at gmail.com>
> An: unicode at unicode.org
> Betreff: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol
> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic.
> 
> A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html
> 
> The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/)
> 
> I am seeking comments and support for this proposal.
> 
> Jan Velterop
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140321/56318745/attachment.html>

From slevin at signpuddle.net  Fri Mar 21 10:55:01 2014
From: slevin at signpuddle.net (Stephen E Slevinski Jr)
Date: Fri, 21 Mar 2014 10:55:01 -0500
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
 <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>
 <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>
Message-ID: <532C60D5.9040306@signpuddle.net>

On 3/21/14, 10:22 AM, Jan Velterop wrote:
> But are the chances nil? 
Unicode won't encode new symbols without demonstrated use.  A recent 
exception was a currency symbol, but it had institutional support.

If your new symbol gains widespread use, there is a chance.  If you can 
not demonstrate anyone using the symbol, the chance is nil.

Regards,
-Steve
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140321/4d4674fe/attachment.html>

From asmusf at ix.netcom.com  Fri Mar 21 11:06:59 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 21 Mar 2014 09:06:59 -0700
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
 <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>
 <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>
Message-ID: <532C63A3.7080007@ix.netcom.com>

On 3/21/2014 8:22 AM, Jan Velterop wrote:
> But are the chances nil? 

Essentially you are trying to create a symbol for "this material is 
placed in the public domain". If you get that symbol adopted by similar 
authorities as those that created ?, then you would see it encoded in 
due time. If not, it would have to become massively adopted to become a 
"de-facto" convention first, but, without an encoded character, that is 
really unlikely. So, if you are serious about his idea, the rout is to 
get the convention formally adopted first.

A./
> It would be a nice complement to the series of ?, ?, ?, etcetera and 
> perform a similar function. A symbol for Creative Commons, presumably 
> a double c in a circle, would probably indicate the document in 
> question is covered by one of the CC licences, but it wouldn't be 
> clear by which one, which may be an impediment for having a symbol. 
> Similarly, copyleft is also a licensing scheme, and as such is not 
> quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a 
> copyleft symbol is in the same 'single encircled letter' convention.
>
> For the encircled 'a' symbol for open access it is proposed to use 
> this definition:
>
>     "The symbol for 'open access', if applied to documents and images,
>     indicates their free availability, on the internet or otherwise,
>     permitting any users to read, download, copy, distribute,
>     (re)print, search, or link to the full texts of such documents,
>     crawl them for indexing, pass them as data to software, or use
>     them for any other lawful purpose, without financial, legal, or
>     technical barriers other than those inseparable from gaining
>     access to the internet itself or to printing materials and
>     facilities. The only constraint on reproduction and distribution,
>     and the only role for copyright in this domain, should be to give
>     authors control over the integrity of their work and the right to
>     be properly acknowledged and cited.
>
>
> Jan Velterop
>
> On 21 Mar 2014, at 14:33, J?rg Knappen <jknappen at web.de 
> <mailto:jknappen at web.de>> wrote:
>
>> Even when this symbol really catches on (what I doubt because it is 
>> too close to the @ sign in the first place) chance are low that it 
>> will be encoded in UNicode. Precedents like the Creative Commons sign 
>> or the Copyleft sign have been discussed on this mailing list (search 
>> the archives for the relevant threads) but were never encoded in UNicode.
>> When the symbol does not catch on, why should it be encoded in UNicode?
>> --J?rg Knappen
>> *Gesendet:* Freitag, 21. M?rz 2014 um 12:14 Uhr
>> *Von:* "Jan Velterop" <velterop at gmail.com <mailto:velterop at gmail.com>>
>> *An:* unicode at unicode.org <mailto:unicode at unicode.org>
>> *Betreff:* New symbol to denote true open access (e.g. to scholarly 
>> literature), analogous to the copyright symbol
>> May I propose a new Unicode symbol to denote true open access, for 
>> instance applied to scholarly literature, in a similar way that ? and 
>> ? denote copyright and registered trademarks respectively? The 
>> proposed symbol is an encircled lower case letter a, in particular in 
>> a font where the a has a 'tail', as in a font like Arial, for 
>> instance, and not as in a font like Century Gothic.
>>
>> A sketch of what I have in mind is here: 
>> http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html
>>
>> The intended use would be for documents and images that have been 
>> published with so-called BOAI-compliant open access 
>> (http://www.budapestopenaccessinitiative.org/read), meaning that all 
>> reuse is permitted, with the only permissible condition that the 
>> author(s) should be acknowledged (CC_BY licence: 
>> http://creativecommons.org/licenses/by/4.0/). This condition would 
>> not be mandatory, and also public domain, CC-0 licences would be 
>> denoted by the proposed symbol 
>> (http://creativecommons.org/publicdomain/zero/1.0/)
>>
>> I am seeking comments and support for this proposal.
>>
>> Jan Velterop
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org <mailto:Unicode at unicode.org>
>> http://unicode.org/mailman/listinfo/unicode
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140321/40890520/attachment.html>

From velterop at gmail.com  Fri Mar 21 12:17:17 2014
From: velterop at gmail.com (Jan Velterop)
Date: Fri, 21 Mar 2014 17:17:17 +0000
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <532C63A3.7080007@ix.netcom.com>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
 <trinity-533bc424-f20a-4cb1-8e63-04168562ebed-1395412387862@3capp-webde-bs38>
 <36DF3186-3E99-4748-8461-9A6E00394804@gmail.com>
 <532C63A3.7080007@ix.netcom.com>
Message-ID: <3CBFA1D1-872C-44A0-85D0-9BB593F54014@gmail.com>

Apparently it is already in Unicode, as ? (U+24D0) ? from anonymous feedback.

No further need for a formal proposal.

Jan Velterop

On 21 Mar 2014, at 16:06, Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 3/21/2014 8:22 AM, Jan Velterop wrote:
>> But are the chances nil?
> 
> Essentially you are trying to create a symbol for "this material is placed in the public domain". If you get that symbol adopted by similar authorities as those that created ?, then you would see it encoded in due time. If not, it would have to become massively adopted to become a "de-facto" convention first, but, without an encoded character, that is really unlikely. So, if you are serious about his idea, the rout is to get the convention formally adopted first.
> 
> A./
>> It would be a nice complement to the series of ?, ?, ?, etcetera and perform a similar function. A symbol for Creative Commons, presumably a double c in a circle, would probably indicate the document in question is covered by one of the CC licences, but it wouldn't be clear by which one, which may be an impediment for having a symbol. Similarly, copyleft is also a licensing scheme, and as such is not quite as unambiguous as ?, ?, and ? are. Also, neither a cc or a copyleft symbol is in the same 'single encircled letter' convention.
>> 
>> For the encircled 'a' symbol for open access it is proposed to use this definition: 
>> 
>> "The symbol for 'open access', if applied to documents and images, indicates their free availability, on the internet or otherwise, permitting any users to read, download, copy, distribute, (re)print, search, or link to the full texts of such documents, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself or to printing materials and facilities. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
>> 
>> Jan Velterop
>> 
>> On 21 Mar 2014, at 14:33, J?rg Knappen <jknappen at web.de> wrote:
>> 
>>> Even when this symbol really catches on (what I doubt because it is too close to the @ sign in the first place) chance are low that it will be encoded in UNicode. Precedents like the Creative Commons sign or the Copyleft sign have been discussed on this mailing list (search the archives for the relevant threads) but were never encoded in UNicode.
>>>  
>>> When the symbol does not catch on, why should it be encoded in UNicode?
>>>  
>>> --J?rg Knappen
>>>  
>>> Gesendet: Freitag, 21. M?rz 2014 um 12:14 Uhr
>>> Von: "Jan Velterop" <velterop at gmail.com>
>>> An: unicode at unicode.org
>>> Betreff: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol
>>> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic.
>>> 
>>> A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html
>>> 
>>> The intended use would be for documents and images that have been published with so-called BOAI-compliant open access (http://www.budapestopenaccessinitiative.org/read), meaning that all reuse is permitted, with the only permissible condition that the author(s) should be acknowledged (CC_BY licence: http://creativecommons.org/licenses/by/4.0/). This condition would not be mandatory, and also public domain, CC-0 licences would be denoted by the proposed symbol (http://creativecommons.org/publicdomain/zero/1.0/)
>>> 
>>> I am seeking comments and support for this proposal.
>>> 
>>> Jan Velterop
>>> _______________________________________________
>>> Unicode mailing list
>>> Unicode at unicode.org
>>> http://unicode.org/mailman/listinfo/unicode
>> 
>> 
>> 
>> _______________________________________________
>> Unicode mailing list
>> Unicode at unicode.org
>> http://unicode.org/mailman/listinfo/unicode
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140321/9646878f/attachment.html>

From jf at colson.eu  Fri Mar 21 15:42:40 2014
From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=)
Date: Fri, 21 Mar 2014 21:42:40 +0100
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
Message-ID: <532CA440.6090300@colson.eu>

Le 21/03/14 12:14, Jan Velterop a ?crit :
> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in a font like Century Gothic.
>
> A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html

Could ? do the job ?


From budelberger.richard at wanadoo.fr  Fri Mar 21 16:49:39 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Fri, 21 Mar 2014 22:49:39 +0100 (CET)
Subject: New symbol to denote true open access (e.g. to scholarly
 literature), analogous to the copyright symbol
In-Reply-To: <532CA440.6090300@colson.eu>
References: <6CE8EC74-BD10-405E-A5AA-EFCC8305122C@gmail.com>
 <532CA440.6090300@colson.eu>
Message-ID: <1708827360.38258.1395438579815.JavaMail.www@wwinf1p12>

> Message du 21/03/14 21:56
> De : "Jean-Fran?ois Colson" 
> A : unicode at unicode.org
> Copie ? : 
> Objet : Re: New symbol to denote true open access (e.g. to scholarly literature), analogous to the copyright symbol
> 
> Le 21/03/14 12:14, Jan Velterop a ?crit : 
>> May I propose a new Unicode symbol to denote true open access, for instance applied to scholarly literature, in a similar way that ? and ? denote copyright and registered 
trademarks respectively? The proposed symbol is an encircled lower case letter a, in particular in a font where the a has a 'tail', as in a font like Arial, for instance, and not as in 
a font like Century Gothic. 
> > A sketch of what I have in mind is here: http://theparachute.blogspot.co.uk/2014/03/proposed-open-access-symbol.html 

> Could ? do the job ?

No, ou alors, meanwhile?


From verdy_p at wanadoo.fr  Sat Mar 22 03:41:40 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 22 Mar 2014 09:41:40 +0100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140322000439.4af309a5@JRWUBU2>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
Message-ID: <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>

2014-03-22 1:04 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Thu, 20 Mar 2014 05:59:49 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>  Not all Indic diacritics have combining class 0, and Hebrew diacritics
> have non-zero combining classes.
>

Did I say something else ? You have probably misread me. I have
written "distinct
and non-zero"

You forgot the term "AND" which is important as it gives the condition
where combining characters may be reordered during normalization, and so
that their relative encoding order is unpreditable (independantly of the
fact that they may be precomposed).

So if you enter <C, CEDILLA, ACUTE> or <C, ACUTE, CEDILLA>, you get in the
editor's backing store some encoding form (which my be precombined or not,
or with diacritics not necessarily in the normalized form, and all these 4
possible encodings are canonically equivalent): they if you press
Backspace, the effect should also not depend on whever you just entered
these keystroke or if you loaded the text and clicked after the sequence
before pressing backspace: How can you predict which character to remove ?

That why here it should delete BOTH the CEDILLA and the ACUTE, because they
are using distinct and non-zero combining classes, and so are unordered.

The relationale would be true as well for Hebrew points (most of them use
distinct non-zero compbining classes when they are used in sequences).

But it won't apply to "diacritics" (combining characters or joiner controls
like CGJ, ZWK and ZWNJ, and possibly even some oher format controls) that
have combining class 0 because their encoding order is significant to you
know where to stop the effect of Backspace.

I see absolutely no reason why Backspace would arbitrarily delete only the
last encoded character when users canno even count them and may not have
input them separately. or could expect them to have be typed in a different
order.

So yes, entering:
<CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
<ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
<ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
<CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
should all result in keeping only the letter C in the backing store.

And with a IME supporint Compose key this will also be true;

<COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
<COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
<COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
<COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>

Canonical equivalence should be respected in visual editing modes. Deleting
only the "last" encoding diacritic should only be done in specific
non-visual editing modes (with "visible controls") and it is not expected
that most users will like this editing mode.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140322/1bc19c01/attachment.html>

From chris.fynn at gmail.com  Sat Mar 22 09:34:06 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Sat, 22 Mar 2014 20:34:06 +0600
Subject: Details, please (was: Re: Romanized Singhala got great reception
 in Sri Lanka)
In-Reply-To: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net>
References: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net>
Message-ID: <CAA_CYc+JXo_WT2FN=0=KNQAHrsp5KJGLZa0B+rS1F-DhGrt0bQ@mail.gmail.com>

On 18/03/2014, Doug Ewell <doug at ewellic.org> wrote:
> I think what some of us would like to see are detailed examples, citing
> specific characters and combinations, rather than general rhetoric, to
> support claims like this:

Yes


From jf at colson.eu  Sat Mar 22 10:54:53 2014
From: jf at colson.eu (=?ISO-8859-1?Q?Jean-Fran=E7ois_Colson?=)
Date: Sat, 22 Mar 2014 16:54:53 +0100
Subject: Details, please (was: Re: Romanized Singhala got great reception
 in Sri Lanka)
In-Reply-To: <CAA_CYc+JXo_WT2FN=0=KNQAHrsp5KJGLZa0B+rS1F-DhGrt0bQ@mail.gmail.com>
References: <20140318102721.665a7a7059d7ee80bb4d670165c8327d.354a8bcfe2.wbe@email03.secureserver.net>
 <CAA_CYc+JXo_WT2FN=0=KNQAHrsp5KJGLZa0B+rS1F-DhGrt0bQ@mail.gmail.com>
Message-ID: <532DB24D.50803@colson.eu>

Le 22/03/14 15:34, Christopher Fynn a ?crit :
> On 18/03/2014, Doug Ewell <doug at ewellic.org> wrote:
>> I think what some of us would like to see are detailed examples, citing
>> specific characters and combinations, rather than general rhetoric, to
>> support claims like this:
> Yes
+1


From richard.wordingham at ntlworld.com  Sat Mar 22 14:50:56 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 22 Mar 2014 19:50:56 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
Message-ID: <20140322195056.3e6d7c61@JRWUBU2>

On Sat, 22 Mar 2014 09:41:40 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-22 1:04 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
 
> So if you enter <C, CEDILLA, ACUTE> or <C, ACUTE, CEDILLA>, you get
> in the editor's backing store some encoding form (which my be
> precombined or not, or with diacritics not necessarily in the
> normalized form, and all these 4 possible encodings are canonically
> equivalent): they if you press Backspace, the effect should also not
> depend on whever you just entered these keystroke or if you loaded
> the text and clicked after the sequence before pressing backspace:
> How can you predict which character to remove ?

If I entered those three characters in NFD order, I would expect to
remove the ACUTE.  I would annoyed to find the string reduced to
just C, and am annoyed to find it completely deleted.

I do not find consistently poor service to be better than frequently
poor service.

> The relationale would be true as well for Hebrew points (most of them
> use distinct non-zero compbining classes when they are used in
> sequences).

> But it won't apply to "diacritics" (combining characters or joiner
> controls like CGJ, ZWK and ZWNJ, and possibly even some oher format
> controls) that have combining class 0 because their encoding order is
> significant to you know where to stop the effect of Backspace.

Your approach recommends input methods that separate combining
marks of different combining classes by CGJ for easier editing!

> I see absolutely no reason why Backspace would arbitrarily delete
> only the last encoded character when users canno even count them and
> may not have input them separately. or could expect them to have be
> typed in a different order.
> 
> So yes, entering:
> <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> should all result in keeping only the letter C in the backing store.
 
> And with a IME supporint Compose key this will also be true;
 
> <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>

Your input methods suggest that there is something unitary about the
result - which makes sense if their output is U+1E08 LATIN CAPITAL
LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments if
'C' were replaced with 'S'?  There is no character LATIN CAPITAL
LETTER S WITH CEDILLA AND ACUTE.

It will be distinctly unpleasant and unnatural with an input method
that allows separate input of all three characters - C,
COMBINING CEDILLA and COMBINING ACUTE - one by one.  Your suggestion
that typing THAI CHARACTER RO RUA, THAI CHARACTER SARA UU, THAI
CHARACTER MAI THO, BACKSPACE should result in just THAI CHARACTER RO RUA
is unlikely to be welcome to Thais.

I believe our sharply opposing opinions arise because of different
views of the clusters.  You are seeing characters that are composed of
multiple elements.  I am seeing groups of characters that, in general,
happen not to be arranged in a line of constant direction. 

> Canonical equivalence should be respected in visual editing modes.
> Deleting only the "last" encoding diacritic should only be done in
> specific non-visual editing modes (with "visible controls") and it is
> not expected that most users will like this editing mode.

For users who know what characters should be there, it makes a lot of
sense to enter a non-visual editing mode - ideally of limited scope
- when editing a previously typed cluster.

Richard.


From verdy_p at wanadoo.fr  Sat Mar 22 17:37:49 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 22 Mar 2014 23:37:49 +0100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140322195056.3e6d7c61@JRWUBU2>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
 <20140322195056.3e6d7c61@JRWUBU2>
Message-ID: <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>

2014-03-22 20:50 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> > But it won't apply to "diacritics" (combining characters or joiner
>  > controls like CGJ, ZWK and ZWNJ, and possibly even some oher format
> > controls) that have combining class 0 because their encoding order is
> > significant to you know where to stop the effect of Backspace.
>
> Your approach recommends input methods that separate combining
> marks of different combining classes by CGJ for easier editing!
>

NO. I certainly do not recommend it ! This is a false assertion.

> I see absolutely no reason why Backspace would arbitrarily delete
> > only the last encoded character when users canno even count them and
> > may not have input them separately. or could expect them to have be
> > typed in a different order.
> >
> > So yes, entering:
> > <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> > <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> > <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> > <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> > should all result in keeping only the letter C in the backing store.
>
> > And with a IME supporint Compose key this will also be true;
>
> > <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> > <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> > <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> > <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>
>
> Your input methods suggest that there is something unitary about the
> result - which makes sense if their output is U+1E08 LATIN CAPITAL
> LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments if
> 'C' were replaced with 'S'?  There is no character LATIN CAPITAL
> LETTER S WITH CEDILLA AND ACUTE.


I have NOT said that there existed such character (look at the separating
commas).
This is a false interpretation.

>
> It will be distinctly unpleasant and unnatural with an input method
> that allows separate input of all three characters - C,
> COMBINING CEDILLA and COMBINING ACUTE - one by one.  Your suggestion
> that typing THAI CHARACTER RO RUA, THAI CHARACTER SARA UU, THAI
> CHARACTER MAI THO, BACKSPACE should result in just THAI CHARACTER RO RUA
> is unlikely to be welcome to Thais.
>
> I believe our sharply opposing opinions arise because of different
> views of the clusters.  You are seeing characters that are composed of
> multiple elements.  I am seeing groups of characters that, in general,
> happen not to be arranged in a line of constant direction.


This is a pragmatic consideration, that canonical equivalence should also
be respected even when editing texts. The same key should produce
canonically equivalent text when editing at the same logical position texts
that are canonincally equivalent.

> Canonical equivalence should be respected in visual editing modes.
> > Deleting only the "last" encoding diacritic should only be done in
> > specific non-visual editing modes (with "visible controls") and it is
> > not expected that most users will like this editing mode.
>
> For users who know what characters should be there, it makes a lot of
> sense to enter a non-visual editing mode - ideally of limited scope
> - when editing a previously typed cluster.
>

As long as the IME (or keyboard driver) has not transmitted the characters
to the edited document, it may record the sequence of keystrokes used. But
clicing anywhere in the document, or pressing any cursor movement key will
reset the IME to its initial state. If an advanced IME is used to allow
editing the content of a cluster before the cursor position, it will
require a specific dialog to decompose the characters and render in the IME
the cluster as a sequence of characters rendered isolately in "view
controls mode").

Most text editors do not support such separate IME panel and in fact users
do not like seeing these IME popups appearing on top of the edited text.
They want to be able to inpute text diretly in the WYSIWIG window. The IME
panel is an advanced edit mode which requires specific support in the
application (and an integration similar to the panels used by spell
checkers).

IME popups also cause severe difficulties for accessibility, due to the
separation of the previewed text and the edited text in the panel, also
because it is difficult to naviate in these popups with the keyboard and
also because the popup is obscuring the rest of the text (complicating the
rereading).

And on small screens below 5 inches (like smartphones), it is really
difficult to fit the IME panel and make it easy to use with fingers, and
allow also reading a complete sentence, without reducing a lot the size of
touchable buttons, reducing a lot the font sizes, and making the text very
difficult to read.

That's why so many people over the age of 40 really hate composing any text
on smartphones and will prefer larger tablets : their smartphone is used
only to view small texts : this is a problem of vision - presbytie - and
size of fingers, the screen is too small to fit an IME editor except a TS9
one with 12 keys, used only to compose very short messages such as SMS or
Facebook status. On for this usage, people do not care much about composing
advanced diacritics; theu will compose only the basic letters and will even
drop correct punctuation except space and they won't care about
capitalisation if the spell cehceker of the TS9 editor does not guess it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140322/bcb261ba/attachment.html>

From richard.wordingham at ntlworld.com  Sat Mar 22 19:16:44 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 23 Mar 2014 00:16:44 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
 <20140322195056.3e6d7c61@JRWUBU2>
 <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>
Message-ID: <20140323001644.5dc25bd6@JRWUBU2>

On Sat, 22 Mar 2014 23:37:49 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-22 20:50 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > > But it won't apply to "diacritics" (combining characters or joiner
> > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher
> > > format
> > > controls) that have combining class 0 because their encoding
> > > order is significant to you know where to stop the effect of
> > > Backspace.
> >
> > Your approach recommends input methods that separate combining
> > marks of different combining classes by CGJ for easier editing!
> >
> 
> NO. I certainly do not recommend it ! This is a false assertion.

If one takes your approach to handling input, then one needs CGJ to ease
the correction of diacritics.  I am not saying that you recommend the
use of CGJ.

> > I see absolutely no reason why Backspace would arbitrarily delete
> > > only the last encoded character when users canno even count them
> > > and may not have input them separately. or could expect them to
> > > have be typed in a different order.
> > >
> > > So yes, entering:
> > > <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> > > <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> > > <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> > > <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> > > should all result in keeping only the letter C in the backing
> > > store.
> >
> > > And with a IME supporint Compose key this will also be true;
> >
> > > <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> > > <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> > > <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> > > <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>
> >
> > Your input methods suggest that there is something unitary about the
> > result - which makes sense if their output is U+1E08 LATIN CAPITAL
> > LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments
> > if 'C' were replaced with 'S'?  There is no character LATIN CAPITAL
> > LETTER S WITH CEDILLA AND ACUTE.
> 
> I have NOT said that there existed such character (look at the
> separating commas).

I looked at the names.  Dead keys are effectively modifiers applied
beforehand rather than simultaneously, so there is no more reason for
the dead key sequences to generate more than one character than there
is for an ordinary key to generate multiple characters.

The use of 'COMPOSE' indicates that one is not simply entering a
sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
an input process different to simply 'C, COMBINING CEDILLA, COMBINING
ACUTE'.

> This is a pragmatic consideration, that canonical equivalence should
> also be respected even when editing texts. The same key should produce
> canonically equivalent text when editing at the same logical position
> texts that are canonincally equivalent.

That raises an interesting question.  Which positions in the string <RO
RUA, SARA UU, MAI THO> (ccc = 0, 103, 107) are logically the same
positions as which positions in the canonically equivalent string <RO
RUA, MAI THO, SARA UU>?  Are you saying that some positions are not
'logical'?  I for one would prefer to be able to access any position
within the string.  It is a shame there has been so little uptake of
the SIL Graphite split cursor approach, which attempted to address the
issue of editing clusters.

As to pragmatics, we are discussing editing with feedback.  If we have
full feedback, we do not need canonical equivalence to be respected.

> If
> an advanced IME is used to allow editing the content of a cluster
> before the cursor position, it will require a specific dialog to
> decompose the characters and render in the IME the cluster as a
> sequence of characters rendered isolately in "view controls mode").

It is not a good idea to tamper with the normalisation in the first
place.  The sequence of characters used may say quite a bit about how
the user thinks of the cluster.  Pragmatically, normalisation may also
degrade rendering - recall the efforts Microsoft went to to discourage
the normalisation of Korean text!

> Most text editors do not support such separate IME panel and in fact
> users do not like seeing these IME popups appearing on top of the
> edited text. They want to be able to inpute text diretly in the
> WYSIWIG window. The IME panel is an advanced edit mode which requires
> specific support in the application (and an integration similar to
> the panels used by spell checkers).

A separate IME panel is not the only approach.  Another approach is to
use a modified font in the region of the cluster so that it displays
clusters suitably, and then renders the whole region in the
WYSIWYG region according to the usual rules except that it applies the
font modification in the relevant region.

Richard.


From verdy_p at wanadoo.fr  Sat Mar 22 21:32:06 2014
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 23 Mar 2014 03:32:06 +0100
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140323001644.5dc25bd6@JRWUBU2>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
 <20140322195056.3e6d7c61@JRWUBU2>
 <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>
 <20140323001644.5dc25bd6@JRWUBU2>
Message-ID: <CAGa7JC0BQkhyn7eYCLXWZFRbij2VM_yUVfy7+d4Eq_31Q9pWig@mail.gmail.com>

2014-03-23 1:16 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 22 Mar 2014 23:37:49 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2014-03-22 20:50 GMT+01:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > > But it won't apply to "diacritics" (combining characters or joiner
> > > > controls like CGJ, ZWK and ZWNJ, and possibly even some oher
> > > > format
> > > > controls) that have combining class 0 because their encoding
> > > > order is significant to you know where to stop the effect of
> > > > Backspace.
> > >
> > > Your approach recommends input methods that separate combining
> > > marks of different combining classes by CGJ for easier editing!
> > >
> >
> > NO. I certainly do not recommend it ! This is a false assertion.
>
> If one takes your approach to handling input, then one needs CGJ to ease
> the correction of diacritics.  I am not saying that you recommend the
> use of CGJ.
>
> > > I see absolutely no reason why Backspace would arbitrarily delete
> > > > only the last encoded character when users canno even count them
> > > > and may not have input them separately. or could expect them to
> > > > have be typed in a different order.
> > > >
> > > > So yes, entering:
> > > > <CEDILLA DEADKEY, ACUTE DEADKEY, C, BACKSPACE>, or
> > > > <ACUTE DEADKEY, CEDILLA DEADKEY, C, BACKSPACE>, or
> > > > <ACUTE DEADKEY, C WITH CEDILLA, BACKSPACE>, or
> > > > <CEDILLA DEADKEY, C WITH ACUTE, BACKSPACE>
> > > > should all result in keeping only the letter C in the backing
> > > > store.
> > >
> > > > And with a IME supporint Compose key this will also be true;
> > >
> > > > <COMPOSE, C, CEDILLA, ACUTE, BACKSPACE>, or
> > > > <COMPOSE, C, ACUTE, CEDILLA, BACKSPACE>, or
> > > > <COMPOSE, C WITH CEDILLA, ACUTE, BACKSPACE>, or
> > > > <COMPOSE, C WITH ACUTE, CEDILLA, BACKSPACE>
> > >
> > > Your input methods suggest that there is something unitary about the
> > > result - which makes sense if their output is U+1E08 LATIN CAPITAL
> > > LETTER C WITH CEDILLA AND ACUTE.  Would you make the same arguments
> > > if 'C' were replaced with 'S'?  There is no character LATIN CAPITAL
> > > LETTER S WITH CEDILLA AND ACUTE.
> >
> > I have NOT said that there existed such character (look at the
> > separating commas).
>
> I looked at the names.  Dead keys are effectively modifiers applied
> beforehand rather than simultaneously, so there is no more reason for
> the dead key sequences to generate more than one character than there
> is for an ordinary key to generate multiple characters.
>
> The use of 'COMPOSE' indicates that one is not simply entering a
> sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
> an input process different to simply 'C, COMBINING CEDILLA, COMBINING
> ACUTE'.
>

Here again you reinterpret what I did not say. When U used DEADKEY or
COMPOSE, I was evidently refering to keystrokes, not characters. So I did
not imply any encoding of characters (I was clear enough to say that
these sequences of keystrokes was allowed to generate any canonically
equivalent encoding), so instrad I described the input (on keyboard or IME)
and the expected output (an encoded text that should be canonically
equivalent).

I have NOWHERE intended to force the use of CGJ (you seem to imply that
these keys will generate separate combining diacritics/joiners, one or two,
for each key...

This is wrong, the IME or keyboard driver handles the state of keystrokes,
even if you use a COMPOSE key or a DEAD KEY, this does not matter, and so
it won't feed the encoded text with streams of characters as long as the
state is not complete enough:

In fact this input with a compose key does not work:
COMPOSE, C, CEDILLA, ACUTE
simply because the composed sequence is areaddy terminated after the
cedilla modifier key. So when you would type the acute modifier key it
would not be associated. That's another reson why dead keys are working:
the state is not complete as long as you have not *finally* input the base
letter. But let's suppose that the driver must generate something, then for
the ACUTE key it would need to output the combining character, possibly
with a preceding CGJ if the intent is to have the acute accent ordered
relatively with the cedilla (this is very unusual).

In most usages, by far, diacritics never need any preceding CGJ to preserve
their relative ordering: it is almost never the case for diacrititcs that
have distinct non-zero combining classes. The rare cases occur however in
classical pointed Hebrew.

For this reason the keyboard driver will likely include a separate key
mapping for the CGJ, either
- as a base key entered after the diacritic deadkey, to force the ouput of
CGJ+diacritic characters ; or
- as a sequence with COMPOSE+diacritic key, without any key for the
intermediate base letter, to produce the same ouput.
In the first case (driver with dead keys), you need a single keyboard
mapping for the CGJ working as a dead key. In the second case (driver with
compose key), you use the COMPOSE key mapping only, but you still need to
map positions for the second base key (in the 3-key compose sequence) meant
to represent diacritics.

The effect of Backspace entered just after it would delete simulatenously
CGJ and the diacritic characters. It does not need to depend on the input
state of the driver or the IME. In all cases, nothing in the keyboard
mapping or IME will generate a CGJ character isolately, ir will be always
followed by something.

But what would happen if you would type the compose sequence generating CGJ
with COMPOSE where you forget to press the initial base letter, or type
COMPOSE after the base letter ?
  C, COMPOSE, ACUTE
you get the characters <C,  CGJ, combining ACUTE> you cannot type another
CEDILLA after it without pressing COMPOSE again before it, to get <C,  CGJ,
combining ACUTE, CGJ, combining CEDILLA>.
The result is clearly abusing the use of CGJ when the input output should
just be canonically equivalent to
<C,  combining ACUTE, combining CEDILLA> (i.e. without any CGJ at all)

Your system would be even less meaningful, it would break in most renderers
and spell checkers. It would break in IDNA domain names. it would not match
in plain text search unless they are tuned so that ther collators discard
the CGJs to look for fuzzy matches (fuzzy matches would also look for
strings that are compatibility equivalent under NFKD, or could search at
collation levels 2, or at collation level 1 ignoring all diacritics and CGJ
wherever they are).

So compose keys cause more confusion to native users than dead keys that
are smarter as they can record more internal states and also allow
arbitrary order of input for unordered diacritics (like acute plus cedilla
: you can press their dead key in any order, the IME or driver handles the
case and generates them, preferably in canonical order with growing
combining classes; the drive or IME alos generates them in an input state
where it also knows the base letter to ouput, it can precombine the
diacritics and so it will output C WITH CEDILLA, followed by COMBINING
ACUTE, as expected, and still without needing any CGJ).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140323/651debb2/attachment.html>

From richard.wordingham at ntlworld.com  Sun Mar 23 06:51:09 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 23 Mar 2014 11:51:09 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAGa7JC0BQkhyn7eYCLXWZFRbij2VM_yUVfy7+d4Eq_31Q9pWig@mail.gmail.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
 <20140322195056.3e6d7c61@JRWUBU2>
 <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>
 <20140323001644.5dc25bd6@JRWUBU2>
 <CAGa7JC0BQkhyn7eYCLXWZFRbij2VM_yUVfy7+d4Eq_31Q9pWig@mail.gmail.com>
Message-ID: <20140323115109.54179d85@JRWUBU2>

On Sun, 23 Mar 2014 03:32:06 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-23 1:16 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

>> The use of 'COMPOSE' indicates that one is not simply entering a
>> sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
>> an input process different to simply 'C, COMBINING CEDILLA,
>> COMBINING ACUTE'.
 
> Here again you reinterpret what I did not say. When U used DEADKEY or
> COMPOSE, I was evidently refering to keystrokes, not characters.

That is how I understood it.

> So I
> did not imply any encoding of characters (I was clear enough to say
> that these sequences of keystrokes was allowed to generate any
> canonically equivalent encoding), so instrad I described the input
> (on keyboard or IME)

That is how I understood it.

> and the expected output (an encoded text that
> should be canonically equivalent).

I think you mean that you have only specified the generated character
output up to equivalence.  An actual implementation would have to chose
one specific sequence, though there might conceivably be a mechanism to
select this sequence.

> I have NOWHERE intended to force the use of CGJ (you seem to imply
> that these keys will generate separate combining diacritics/joiners,
> one or two, for each key...

The input method and the editing of backing store are generally done by
separate processes.  For IPA and Tai Tham input I have written my own
input methods.  If I frequently had to use a process editing backing
store as you recommend, I would be strongly tempted to write a variant
that protected marks with non-zero combining class by inserting CGJ.

> This is wrong, the IME or keyboard driver handles the state of
> keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does
> not matter, and so it won't feed the encoded text with streams of
> characters as long as the state is not complete enough:

This is certainly not true of Keyman for Linux (KMFL), and I don't
believe it is true of Tavultesoft Keyman for Windows either.  This
does require that the input method have a way of cancelling
previously provided input.  Now, if you use a method with a COMPOSE key
or a DEAD key, you are generally unlikely to get tentative entries.
However, one could write an input method that simulated a dead key but
actually generated an output for it so as to imitate a typewriter
differently.

> In fact this input with a compose key does not work:
> COMPOSE, C, CEDILLA, ACUTE
> simply because the composed sequence is areaddy terminated after the
> cedilla modifier key. So when you would type the acute modifier key it
> would not be associated.

I would not be at all surprised to find that someone has it working.

> That's another reson why dead keys are
> working: the state is not complete as long as you have not *finally*
> input the base letter. But let's suppose that the driver must
> generate something, then for the ACUTE key it would need to output
> the combining character, possibly with a preceding CGJ if the intent
> is to have the acute accent ordered relatively with the cedilla (this
> is very unusual).

Another method would be to generate, one character at a time, the
sequence <U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA, U+0008,
U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE>.  The NFD
decomposition of U+1E08 is <U+0043, U+0327, U+0301>.  The use of CGJ
would apply to 'COMPOSE, C, ACUTE, CEDILLA', for which I would again
expect to see the output U+1E08.

> The effect of Backspace entered just after it would delete
> simulatenously CGJ and the diacritic characters. It does not need to
> depend on the input state of the driver or the IME. In all cases,
> nothing in the keyboard mapping or IME will generate a CGJ character
> isolately, ir will be always followed by something.

If backspace is not modified by the input method - and Marc Durdin has
suggested that the input method should sometimes modify it - its
effect will depend on the process controlling the backing store, which
in general will work with multiple input methods, even during the
course of a single editing session.  You might not write an input
method that generates a single CGJ, but I do.  Do you insist on a soft
hyphen when writing 'Llangollen' so that it will collate after
'Llanberis' in Welsh?  (I typed the place names in English; the names
are spelt the same way in English and Welsh in hardcopy, though of
course the letter counts differ.)

> But what would happen if you would type the compose sequence
> generating CGJ with COMPOSE where you forget to press the initial
> base letter, or type COMPOSE after the base letter ?
>   C, COMPOSE, ACUTE
> you get the characters <C,  CGJ, combining ACUTE> you cannot type
> another CEDILLA after it without pressing COMPOSE again before it, to
> get <C,  CGJ, combining ACUTE, CGJ, combining CEDILLA>.
> The result is clearly abusing the use of CGJ when the input output
> should just be canonically equivalent to
> <C,  combining ACUTE, combining CEDILLA> (i.e. without any CGJ at all)

Lower case specimen? c????  (this was in NFD as I edited it)
Actually, I would prefer to avoid the first, unnecessary CGJ.
Lower case specimen: c???  (the was in NFD as I edited it)

> Your system would be even less meaningful, it would break in most
> renderers

Some, not all.  It renders fine in Firefox, though one can of course
set up input forms so that not even Thai renders properly.

> and spell checkers.

Most of the stuff I currently write with two combining marks of
non-zero ccc already fails with spell checkers.

> It would break in IDNA domain names.

No, it wouldn't.  If you consult Table B.1 in
http://tools.ietf.org/html/rfc3454#appendix-B 'Stringprep', you will
see that CGJ is stripped out.  For example, the URL
http://www.c???? .com, using the first specimen above, successfully
reached http://www.?.com/ when I used Firefox.

> would not match in plain text search unless they are tuned so that
> ther collators discard the CGJs to look for fuzzy matches (fuzzy
> matches would also look for strings that are compatibility equivalent
> under NFKD, or could search at collation levels 2, or at collation
> level 1 ignoring all diacritics and CGJ wherever they are).

Collation Level 3 searches would work for what I type.  Level 2 can
have a problem with diacritics frozen in the wrong order.

> So compose keys cause more confusion to native users than dead keys
> that are smarter as they can record more internal states and also
> allow arbitrary order of input for unordered diacritics (like acute
> plus cedilla : you can press their dead key in any order, the IME or
> driver handles the case and generates them, preferably in canonical
> order with growing combining classes; the drive or IME alos generates
> them in an input state where it also knows the base letter to ouput,
> it can precombine the diacritics and so it will output C WITH
> CEDILLA, followed by COMBINING ACUTE, as expected, and still without
> needing any CGJ).

A better easy solution is for backspace just to delete the previous
character, so the user will often find what he wants.  There is
then no need for the extra CGJ.  Commands to step into a cluster would
be helpful, but are more difficult.

One thing that bothers me is that no-one has come forward with the
conventions that an application must follow to work with Tavultesoft
Keyman and its derivatives and imitations.

Richard.


From richard.wordingham at ntlworld.com  Sun Mar 23 08:07:27 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 23 Mar 2014 13:07:27 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <CAGa7JC0BQkhyn7eYCLXWZFRbij2VM_yUVfy7+d4Eq_31Q9pWig@mail.gmail.com>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <ec337997ff2948e69133d1981368cc80@BL2PR03MB450.namprd03.prod.outlook.com>
 <CAGJ7U-W7a3cpwS7s62gB8Bm0Rw2taNZ-WNhcuP6fn1kpAED3hg@mail.gmail.com>
 <CAGa7JC1c+Z08fXK1UxiEMgUmSJAUSne83e4je3++p=L4s23uNg@mail.gmail.com>
 <20140322000439.4af309a5@JRWUBU2>
 <CAGa7JC1JOSWJo9rbEQZDgR+N67Fj8B5VW0CvO5J0t4nubuf_Ow@mail.gmail.com>
 <20140322195056.3e6d7c61@JRWUBU2>
 <CAGa7JC2uXcvoJiudfSJ+g2XkM48493OiL+Vs6JbHcb3gY=7iMA@mail.gmail.com>
 <20140323001644.5dc25bd6@JRWUBU2>
 <CAGa7JC0BQkhyn7eYCLXWZFRbij2VM_yUVfy7+d4Eq_31Q9pWig@mail.gmail.com>
Message-ID: <20140323130727.32346f80@JRWUBU2>

On Sun, 23 Mar 2014 03:32:06 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-03-23 1:16 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

>> The use of 'COMPOSE' indicates that one is not simply entering a
>> sequence of characters.  'COMPOSE, C, CEDILLA, ACUTE' should mean
>> an input process different to simply 'C, COMBINING CEDILLA,
>> COMBINING ACUTE'.
 
> Here again you reinterpret what I did not say. When U used DEADKEY or
> COMPOSE, I was evidently refering to keystrokes, not characters.

That is how I understood it.

> So I
> did not imply any encoding of characters (I was clear enough to say
> that these sequences of keystrokes was allowed to generate any
> canonically equivalent encoding), so instrad I described the input
> (on keyboard or IME)

That is how I understood it.

> and the expected output (an encoded text that
> should be canonically equivalent).

I think you mean that you have only specified the generated character
output up to equivalence.  An actual implementation would have to chose
one specific sequence, though there might conceivably be a mechanism to
select this sequence.

> I have NOWHERE intended to force the use of CGJ (you seem to imply
> that these keys will generate separate combining diacritics/joiners,
> one or two, for each key...

The input method and the editing of backing store are generally done by
separate processes.  For IPA and Tai Tham input I have written my own
input methods.  If I frequently had to use a process editing backing
store as you recommend, I would be strongly tempted to write a variant
that protected marks with non-zero combining class by inserting CGJ.

> This is wrong, the IME or keyboard driver handles the state of
> keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does
> not matter, and so it won't feed the encoded text with streams of
> characters as long as the state is not complete enough:

This is certainly not true of Keyman for Linux (KMFL), and I don't
believe it is true of Tavultesoft Keyman for Windows either.  This
does require that the input method have a way of cancelling
previously provided input.  Now, if you use a method with a COMPOSE key
or a DEAD key, you are generally unlikely to get tentative entries.
However, one could write an input method that simulated a dead key but
actually generated an output for it so as to imitate a typewriter
differently.

> In fact this input with a compose key does not work:
> COMPOSE, C, CEDILLA, ACUTE
> simply because the composed sequence is areaddy terminated after the
> cedilla modifier key. So when you would type the acute modifier key it
> would not be associated.

I would not be at all surprised to find that someone has it working.

> That's another reson why dead keys are
> working: the state is not complete as long as you have not *finally*
> input the base letter. But let's suppose that the driver must
> generate something, then for the ACUTE key it would need to output
> the combining character, possibly with a preceding CGJ if the intent
> is to have the acute accent ordered relatively with the cedilla (this
> is very unusual).

Another method would be to generate, one character at a time, the
sequence <U+00C7 LATIN CAPITAL LETTER C WITH CEDILLA, U+0008,
U+1E08 LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE>.  The NFD
decomposition of U+1E08 is <U+0043, U+0327, U+0301>.  The use of CGJ
would apply to 'COMPOSE, C, ACUTE, CEDILLA', for which I would again
expect to see the output U+1E08.

> The effect of Backspace entered just after it would delete
> simulatenously CGJ and the diacritic characters. It does not need to
> depend on the input state of the driver or the IME. In all cases,
> nothing in the keyboard mapping or IME will generate a CGJ character
> isolately, ir will be always followed by something.

If backspace is not modified by the input method - and Marc Durdin has
suggested that the input method should sometimes modify it - its
effect will depend on the process controlling the backing store, which
in general will work with multiple input methods, even during the
course of a single editing session.  You might not write an input
method that generates a single CGJ, but I do.  Do you insist on a soft
hyphen when writing 'Llangollen' so that it will collate after
'Llanberis' in Welsh?  (I typed the place names in English; the names
are spelt the same way in English and Welsh in hardcopy, though of
course the letter counts differ.)

> But what would happen if you would type the compose sequence
> generating CGJ with COMPOSE where you forget to press the initial
> base letter, or type COMPOSE after the base letter ?
>   C, COMPOSE, ACUTE
> you get the characters <C,  CGJ, combining ACUTE> you cannot type
> another CEDILLA after it without pressing COMPOSE again before it, to
> get <C,  CGJ, combining ACUTE, CGJ, combining CEDILLA>.
> The result is clearly abusing the use of CGJ when the input output
> should just be canonically equivalent to
> <C,  combining ACUTE, combining CEDILLA> (i.e. without any CGJ at all)

Lower case specimen? c????  (this was in NFD as I edited it)
Actually, I would prefer to avoid the first, unnecessary CGJ.
Lower case specimen: c???  (the was in NFD as I edited it)

> Your system would be even less meaningful, it would break in most
> renderers

Some, not all.  It renders fine in Firefox, though one can of course
set up input forms so that not even Thai renders properly.

> and spell checkers.

Most of the stuff I currently write with two combining marks of
non-zero ccc already fails with spell checkers.

> It would break in IDNA domain names.

No, it wouldn't.  If you consult Table B.1 in
http://tools.ietf.org/html/rfc3454#appendix-B 'Stringprep', you will
see that CGJ is stripped out.  For example, the URL
http://www.c???? .com, using the first specimen above, successfully
reached http://www.?.com/ when I used Firefox.

> would not match in plain text search unless they are tuned so that
> ther collators discard the CGJs to look for fuzzy matches (fuzzy
> matches would also look for strings that are compatibility equivalent
> under NFKD, or could search at collation levels 2, or at collation
> level 1 ignoring all diacritics and CGJ wherever they are).

Collation Level 3 searches would work for what I type.  Level 2 can
have a problem with diacritics frozen in the wrong order.

> So compose keys cause more confusion to native users than dead keys
> that are smarter as they can record more internal states and also
> allow arbitrary order of input for unordered diacritics (like acute
> plus cedilla : you can press their dead key in any order, the IME or
> driver handles the case and generates them, preferably in canonical
> order with growing combining classes; the drive or IME alos generates
> them in an input state where it also knows the base letter to ouput,
> it can precombine the diacritics and so it will output C WITH
> CEDILLA, followed by COMBINING ACUTE, as expected, and still without
> needing any CGJ).

A better easy solution is for backspace just to delete the previous
character, so the user will often find what he wants.  There is
then no need for the extra CGJ.  Commands to step into a cluster would
be helpful, but are more difficult.

One thing that bothers me is that no-one has come forward with the
conventions that an application must follow to work with Tavultesoft
Keyman and its derivatives and imitations.

Richard.


From marc at keyman.com  Sun Mar 23 17:46:49 2014
From: marc at keyman.com (Marc Durdin)
Date: Sun, 23 Mar 2014 22:46:49 +0000
Subject: Editing Sinhala and Similar Scripts
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local>

All the Keyman products -- on Windows, web, iOS and Android, as well as KMFL, which is a port of Keyman, work on the principle of modifying the text buffer directly.  There is no intermediate compose buffer.  For Indic and western scripts this works pretty well; the compose buffer which is a feature of IMEs does not fit these scripts cleanly in my experience.  It is often hard to know when a text entry is 'complete' for committing the compose buffer, and one effect is that the compose buffer tends to get very long, which makes accidental cancellation of input a common and frustrating issue.

The most obvious backspace intelligence I've seen in use is around handling NFC vs NFD text.  It is confusing to the end user if backspace sometimes deletes a whole character + diacritic, and sometimes just the diacritic mark.  For example, Vietnamese text has suffered from this issue with the varying composition schemes we've seen enforced by limited input methods.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Monday, 24 March 2014 12:07 AM
To: unicode at unicode.org
Subject: Re: Editing Sinhala and Similar Scripts

On Sun, 23 Mar 2014 03:32:06 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> This is wrong, the IME or keyboard driver handles the state of 
> keystrokes, even if you use a COMPOSE key or a DEAD KEY, this does not 
> matter, and so it won't feed the encoded text with streams of 
> characters as long as the state is not complete enough:

This is certainly not true of Keyman for Linux (KMFL), and I don't believe it is true of Tavultesoft Keyman for Windows either.  This does require that the input method have a way of cancelling previously provided input.  Now, if you use a method with a COMPOSE key or a DEAD key, you are generally unlikely to get tentative entries.
However, one could write an input method that simulated a dead key but actually generated an output for it so as to imitate a typewriter differently.

> The effect of Backspace entered just after it would delete 
> simulatenously CGJ and the diacritic characters. It does not need to 
> depend on the input state of the driver or the IME. In all cases, 
> nothing in the keyboard mapping or IME will generate a CGJ character 
> isolately, ir will be always followed by something.

If backspace is not modified by the input method - and Marc Durdin has suggested that the input method should sometimes modify it - its effect will depend on the process controlling the backing store, which in general will work with multiple input methods, even during the course of a single editing session.  You might not write an input method that generates a single CGJ, but I do.  Do you insist on a soft hyphen when writing 'Llangollen' so that it will collate after 'Llanberis' in Welsh?  (I typed the place names in English; the names are spelt the same way in English and Welsh in hardcopy, though of course the letter counts differ.)


From duerst at it.aoyama.ac.jp  Mon Mar 24 04:54:22 2014
From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=)
Date: Mon, 24 Mar 2014 18:54:22 +0900
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <B6B31BB04593D64F8B3E169A5C3DC62F233F87B1@USPHLEMB12C.global.corp.sap>
References: <20140319093913.665a7a7059d7ee80bb4d670165c8327d.a167ee6adb.wbe@email03.secureserver.net>
 <B6B31BB04593D64F8B3E169A5C3DC62F233F87B1@USPHLEMB12C.global.corp.sap>
Message-ID: <533000CE.1050204@it.aoyama.ac.jp>

On 2014/03/20 02:28, Whistler, Ken wrote:

> It is really annoying, particularly to efficient typists, when
> a sequence of 4 keystrokes is *not* exactly undone by a
> sequence of 4 backspace strokes. When that occurs, the
> flow of text composition is suddenly interrupted by forcing
> the user out of "compose" mode and into a completely different
> "monitor and check what the state of the display is" mode that
> can be very annoying.

It is certainly very annoying to a typist who is used to one backspace 
stroke removing one original keystroke. But not all typists are used to 
this. If I e.g. type Japanese, then depending on the syllable, there are 
one or more keystrokes for each Kana character, and because I'm entering 
Kana and only my fingers type Romaji, removing one Kana per backspace 
stroke isn't necessarily less natural than a more straightforward 
correspondence.

Regards,   Martin.


From richard.wordingham at ntlworld.com  Mon Mar 24 17:06:02 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 24 Mar 2014 22:06:02 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local>
References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local>
Message-ID: <20140324220602.6b8d11f1@JRWUBU2>

On Sun, 23 Mar 2014 22:46:49 +0000
Marc Durdin <marc at keyman.com> wrote:

> All the Keyman products -- on Windows, web, iOS and Android, as well
> as KMFL, which is a port of Keyman, work on the principle of
> modifying the text buffer directly.

I had been going to remark that they couldn't do that directly, but
further research showed that Text Services Framework and GTK+ both
allow it to be done in fact as opposed to merely in principle.  (Does
this mean that the Keyman substitution rules follow the principle of
canonical equivalence?)
 
> The most obvious backspace intelligence I've seen in use is around
> handling NFC vs NFD text.  It is confusing to the end user if
> backspace sometimes deletes a whole character + diacritic, and
> sometimes just the diacritic mark.  For example, Vietnamese text has
> suffered from this issue with the varying composition schemes we've
> seen enforced by limited input methods.

However, with Keyman and KMFL there are fallbacks for when the text
buffer is not accessible. e.g. when using X and presumably when using a
Windows program that does not use the Text Services Framework. Version
1.07 of the interface between ibus and KMFL shows that a backspace
character generated by the input method (as opposed to simply passed on
from the keyboard) is intended to delete exactly one character.  It
seems to me that at least where fallbacks are used, the backing store
that KMFL wishes to delete must be in the state in which KMFL placed it
- intervening normalisation will corrupt the input.  Is there an
explicit statement of this anywhere?

When using X, it is possible to tell a backspace generated by the 'Input
Method' from one simply generated by the keyboard; the keycode is 0 in
the former case but not the latter.

Richard.


From marc at keyman.com  Mon Mar 24 17:37:59 2014
From: marc at keyman.com (Marc Durdin)
Date: Mon, 24 Mar 2014 22:37:59 +0000
Subject: Editing Sinhala and Similar Scripts
In-Reply-To: <20140324220602.6b8d11f1@JRWUBU2>
References: <1CEDD746887FFF4B834688E7AF5FDA5A6DD4758E@federation.tavultesoft.local>
 <20140324220602.6b8d11f1@JRWUBU2>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A6DD5046F@federation.tavultesoft.local>

Richard Wordingham wrote:
>
> On Sun, 23 Mar 2014 22:46:49 +0000
> Marc Durdin <marc at keyman.com> wrote:
> 
> > All the Keyman products -- on Windows, web, iOS and Android, as well
> > as KMFL, which is a port of Keyman, work on the principle of modifying
> > the text buffer directly.
> 
> I had been going to remark that they couldn't do that directly, but further
> research showed that Text Services Framework and GTK+ both allow it to be
> done in fact as opposed to merely in principle.  (Does this mean that the
> Keyman substitution rules follow the principle of canonical equivalence?)

Currently, KMFL and Keyman do no normalization of text, treating it as a raw Unicode character stream -- catering for this is up to the input method.

Keyman works directly with the text store in the web, iOS and Android versions, and where possible on Windows -- which in practice means the very few applications that have enough support for Text Services Framework, including, for example, MS Word, SIL FLEx, and the RichEdit control.  Otherwise it works in a fallback mode where it retains the last sequence of characters typed in a cache until an intervening event causes it to flush its cache.  Not perfect, but covers the vast majority of cases without issue (as in, less than one support case per month...)

>It seems to me that at
> least where fallbacks are used, the backing store that KMFL wishes to delete
> must be in the state in which KMFL placed it
> - intervening normalisation will corrupt the input.  Is there an explicit
> statement of this anywhere?

Yes, this is true for both Keyman and KMFL when working in fallback modes.  In practice, it's rarely a problem.

Marc


From wjgo_10009 at btinternet.com  Thu Mar 27 03:13:52 2014
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 27 Mar 2014 08:13:52 +0000 (GMT)
Subject: Does regular Unicode have a character that looks like a space to a
 human yet is not treated as a space by software please?
Message-ID: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com>

Does regular Unicode have a character that looks like a space to a human yet is not treated as a space by software please?

Please consider my use of U+E001 in the following thread.

https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books

Essentially, can that effect be achieved without using a Private Use Area character?

William Overington

27 March 2014


From jkorpela at cs.tut.fi  Thu Mar 27 03:42:20 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Thu, 27 Mar 2014 10:42:20 +0200
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com>
References: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com>
Message-ID: <5333E46C.702@cs.tut.fi>

2014-03-27 10:13, William_J_G Overington wrote:

> Does regular Unicode have a character that looks like a space to a
> human yet is not treated as a space by software please?

It depends, among other things, on what you mean by ?space?.

There?s U+00A0 NO-BREAK SPACE, which surely isn?t the same as U+0020 
SPACE, but might be called a space. Programs can do different things to 
different characters.

> Please consider my use of U+E001 in the following thread.
>
> https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books

As far as I can see, the question is about indenting text in e-books. 
What I do in my e-books is a simple CSS setting, margin-left (or 
padding-left) with a suitable value. There are many other ways too.

Or you could even use a sequence of U+00A0 characters at the star of a 
line. There is no exact definition of what should happen, but in 
practice, HTML user agents, including e-book readers, treat U+00A0 as 
yet another graphic character, which just happens to have an empty 
glyph. Well, they may also seem to be honoring the non-breaking 
property, but this might be just incidental (they generally don?t break 
before or after graphic characters except whitespace characters, and 
U+00A0 is by HTML definition not whitespace).

There are also other characters that can be called ?spaces?, such as 
U+2002 EN SPACE. But they have properties similar to the properties of 
U+0020 SPACE, so we can expect some programs to handle them the same way 
as SPACE, in some respect. Sorry for this vagueness, but it reflects the 
vagueness of the question.

Yucca


From KalvesmakiJ at doaks.org  Thu Mar 27 08:10:30 2014
From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel)
Date: Thu, 27 Mar 2014 13:10:30 +0000
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <3281b40c-bf32-4bd2-b83f-0b46a811474d@unicode.org>
Message-ID: <CF599AB2.2CEDF%KalvesmakiJ@doaks.org>

William, try the U+2000..U+200A glyphs under General Punctuation--I think
that's what you're looking for to manage precise widths of blank space.
And many (most?) software routines do not treat these as part of the class
of spacing characters (\s in regular expressions).

Best wishes,

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
1703 32nd St. NW
Washington, DC 20007
(202) 339-6435

On 3/27/14 4:13 AM, "William_J_G Overington" <wjgo_10009 at btinternet.com>
wrote:


>Does regular Unicode have a character that looks like a space to a human
>yet is not treated as a space by software please?
>
>Please consider my use of U+E001 in the following thread.
>
>https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-bo
>oks
>
>Essentially, can that effect be achieved without using a Private Use Area
>character?
>
>William Overington
>
>27 March 2014
>
>_______________________________________________
>Unicode mailing list
>Unicode at unicode.org
>http://unicode.org/mailman/listinfo/unicode


From sittipon at x10studio.com  Thu Mar 27 03:14:33 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Thu, 27 Mar 2014 15:14:33 +0700
Subject: Pali in Thai Script
Message-ID: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>

Hi,

I am a volunteer programmer working for Tipitaka studies foundation in Thailand.
We are working on a new project about Pali in Thai script with special emphasize on the pronunciation aspect. 

Since, Pali here is written using an everyday use Thai characters with a couple of extra symbols. Most people will read out using their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), which make Thai spoken Pali differently from people not trained in Thailand.

In order to ease this situation, we have created an orthography font (slightly modified from the existed Thai font) and used them internally. I have to admit that, currently, we are changing the glyphs from time to time. But, we are looking forward to establish the studies nationwide in the near future once everything is in place.

I was wondering what is the unicode community opinion on these new characters.

Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated.
https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png

Thai consonants with Black dot for non-aspirated and White dot for aspirated.
https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png

These are all the characters we need beside the normal Thai characters. 
Is it possible for us to submit/add these new characters to unicode once everything is in place?
If it is possible, should we separate them into a new symbol for black dot and white dot, or simply call KO KAI with black dot as a new character?
We are open to suggestions.

Thanks a lot everyone!
Sittipon
  

From jkorpela at cs.tut.fi  Thu Mar 27 10:04:13 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Thu, 27 Mar 2014 17:04:13 +0200
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <CF599AB2.2CEDF%KalvesmakiJ@doaks.org>
References: <CF599AB2.2CEDF%KalvesmakiJ@doaks.org>
Message-ID: <53343DED.5040304@cs.tut.fi>

2014-03-27 15:10, Kalvesmaki, Joel wrote:

> William, try the U+2000..U+200A glyphs under General Punctuation--I think
> that's what you're looking for to manage precise widths of blank space.

That range contains some ?fixed-width spaces?, yes. Being ?fixed-width? 
is rather relative here, though, and many fonts do not contain these 
characters. Rendering software could of course display them by just 
leaving suitable spacing, but that?s not common.

The ?fixed-width spaces? are mostly just legacy characters, holdover 
from old typography. They may have their uses, though, in contexts where 
they work and other spacing methods don?t (for example, I recently 
noticed that they seem to be the only way to create a little spacing 
between an inline equation and normal character in MS Word).

But for the purposes of indenting text lines, I don?t think they are 
useful. In almost all cases, there are better tools for indentation.

> And many (most?) software routines do not treat these as part of the class
> of spacing characters (\s in regular expressions).

Well, most regexp implementations are very Ascii-oriented: notations 
like \s, \w, \d, etc. match Ascii characters only.

Yucca


From addison at lab126.com  Thu Mar 27 10:07:16 2014
From: addison at lab126.com (Phillips, Addison)
Date: Thu, 27 Mar 2014 15:07:16 +0000
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com>
References: <1395908032.13297.YahooMailNeo@web87803.mail.ir2.yahoo.com>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB517DF043C@ex10-mbx-36009.ant.amazon.com>

The thread on serif.com discusses formatting of poetry in a Kindle book. The problem is that the author would like to indent two lines.

You don't want to do that by using a character that "looks like a space" yet isn't seen by the software to be a space. This would break features like dictionary lookup on the first word on each of those lines. The actual solution is to style the text as indented. There are some guidelines on the KDP site.

   http://www.amazon.com/gp/feature.html?docId=1000729511 
   https://kdp.amazon.com/help?topicId=A17W8UM0MMSQX6#para

One way to achieve the desired goal is to use the 'margin' and 'text-align' CSS styles.

Addison

Addison Phillips
Globalization Architect (Amazon Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of
> William_J_G Overington
> Sent: Thursday, March 27, 2014 1:14 AM
> To: unicode at unicode.org
> Cc: wjgo_10009 at btinternet.com
> Subject: Does regular Unicode have a character that looks like a space to a
> human yet is not treated as a space by software please?
> 
> Does regular Unicode have a character that looks like a space to a human yet is
> not treated as a space by software please?
> 
> Please consider my use of U+E001 in the following thread.
> 
> https://community.serif.com/forum/pageplus/9646/formatting-poetry-for-e-books
> 
> Essentially, can that effect be achieved without using a Private Use Area
> character?
> 
> William Overington
> 
> 27 March 2014
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From KalvesmakiJ at doaks.org  Thu Mar 27 10:37:12 2014
From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel)
Date: Thu, 27 Mar 2014 15:37:12 +0000
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <8fa01889-fb05-499e-a34b-57bb975769fc@unicode.org>
Message-ID: <CF59BC91.2CF48%KalvesmakiJ@doaks.org>

Points taken. I just note for the record that in academic publishing and
scholarly editions these spacing characters are actively used,
particularly in InDesign files and in diplomatic editions rendered in XML.

The legacy lives.

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
1703 32nd St. NW
Washington, DC 20007
(202) 339-6435


>
>The ?fixed-width spaces? are mostly just legacy characters, holdover
>from old typography. They may have their uses, though, in contexts where
>they work and other spacing methods don?t (for example, I recently
>noticed that they seem to be the only way to create a little spacing
>between an inline equation and normal character in MS Word).


From budelberger.richard at wanadoo.fr  Thu Mar 27 12:38:05 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Thu, 27 Mar 2014 18:38:05 +0100 (CET)
Subject: Pali in Thai Script
In-Reply-To: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
Message-ID: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>

> Message du 27/03/14 15:43
> De : Sittipon Simasanti 
> A : unicode at unicode.org
> Objet : Pali in Thai Script
> 
> Hi,

Beuar,

> I am a volunteer programmer working for Tipitaka studies foundation in Thailand. 
> We are working on a new project about Pali in Thai script with special emphasize 
> on the pronunciation aspect. Since, Pali here is written using an everyday use 
> Thai characters with a couple of extra symbols. Most people will read out using 
> their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), 
> which make Thai spoken Pali differently from people not trained in Thailand. 
> In order to ease this situation, we have created an orthography font (slightly 
> modified from the existed Thai font) and used them internally. I have to admit 
> that, currently, we are changing the glyphs from time to time. But, we are looking 
> forward to establish the studies nationwide in the near future once everything is 
> in place. I was wondering what is the unicode community opinion on these new 
> characters. Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. 
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png Thai consonants with 
> Black dot for non-aspirated and White dot for aspirated. 
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png These are 
> all the characters we need beside the normal Thai characters. Is it possible for us 
> to submit/add these new characters to unicode once everything is in place? If it is 
> possible, should we separate them into a new symbol for black dot and white dot, 
> or simply call KO KAI with black dot as a new character? 
>
> We are open to suggestions.

Very interesting?! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script?(cf.?http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other 
languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin?(cf.?http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque), 
Ancient?Greek?(cf.?http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in 
Modern?Greek?script?(cf.?http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque)?!? That? what George Kiraz called 
?garshunography??(cf.?http://en.wikipedia.org/wiki/Garshuni).

And now, Pali. Not Thai in Pali script, but Pali in Thai script?

Do you know how many languages are concerned by this ?Paligarshunography??? Since ho many centuries??

> Thanks a lot everyone! 
>
> Sittipon


From budelberger.richard at wanadoo.fr  Thu Mar 27 12:58:13 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Thu, 27 Mar 2014 18:58:13 +0100 (CET)
Subject: Pali in Thai Script
In-Reply-To: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
Message-ID: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11>

> Message du 27/03/14 15:43
> De : Sittipon Simasanti 
> A : unicode at unicode.org
> Objet : Pali in Thai Script
> 
> Hi,

Beuar,

> I am a volunteer programmer working for Tipitaka studies foundation in Thailand. 
> We are working on a new project about Pali in Thai script with special emphasize 
> on the pronunciation aspect. Since, Pali here is written using an everyday use 
> Thai characters with a couple of extra symbols. Most people will read out using 
> their normal Thai voices for all consonants (e.g. ? is read as ?kha? and not ?ga?), 
> which make Thai spoken Pali differently from people not trained in Thailand. 
> In order to ease this situation, we have created an orthography font (slightly 
> modified from the existed Thai font) and used them internally. I have to admit 
> that, currently, we are changing the glyphs from time to time. But, we are looking 
> forward to establish the studies nationwide in the near future once everything is 
> in place. I was wondering what is the unicode community opinion on these new 
> characters. Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated. 
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png Thai consonants with 
> Black dot for non-aspirated and White dot for aspirated. 
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png These are 
> all the characters we need beside the normal Thai characters. Is it possible for us 
> to submit/add these new characters to unicode once everything is in place? If it is 
> possible, should we separate them into a new symbol for black dot and white dot, 
> or simply call KO KAI with black dot as a new character? 
>
> We are open to suggestions.

I?m afraid to say that since PHO SAMPHAO with White?dot (for aspirated)?? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l.?5 c.?2 ? may be (badly) drawn with 
U+0E20?THAI CHARACTER PHO?SAMPHAO and U+0325?COMBINING RING BELOW, ??????, you have to use your Internal Font?

> Thanks a lot everyone! 
>
> Sittipon


From richard.wordingham at ntlworld.com  Thu Mar 27 13:36:18 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 27 Mar 2014 18:36:18 +0000
Subject: Pali in Thai Script
In-Reply-To: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
Message-ID: <20140327183618.0bfd3a2b@JRWUBU2>

On Thu, 27 Mar 2014 15:14:33 +0700
Sittipon Simasanti <sittipon at x10studio.com> wrote:

> Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated.
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png
> 
> Thai consonants with Black dot for non-aspirated and White dot for
> aspirated.
> https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png

Those descriptions confused me - the black dot means 'voiced and not
aspirated', and the white dot means 'voiced and aspirated'.

> These are all the characters we need beside the normal Thai
> characters. Is it possible for us to submit/add these new characters
> to unicode once everything is in place? If it is possible, should we
> separate them into a new symbol for black dot and white dot, or
> simply call KO KAI with black dot as a new character? We are open to
> suggestions.

If your scheme has sufficient success, each combination of base letter
and diacritic may well be encoded as a separate letter because the
position of the diacritic is not obvious.  I presume we're looking at no
more than about 12 new characters - DO CHADA WITH BLACK DOT is an
obvious competitor to THO NANGMONTHO WITH BLACK DOT.

I'm disappointed you found that simply adding a black dot for the
voiced consonants didn't work.  If it had worked, then we might
have argued that this was just a font variation.

Richard.


From richard.wordingham at ntlworld.com  Thu Mar 27 14:12:39 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 27 Mar 2014 19:12:39 +0000
Subject: Pali in Thai Script
In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
Message-ID: <20140327191239.4cd56d46@JRWUBU2>

On Thu, 27 Mar 2014 18:38:05 +0100 (CET)
Richard BUDELBERGER <budelberger.richard at wanadoo.fr> wrote:

> And now, Pali. Not Thai in Pali script, but Pali in Thai script?

There is no Pali script as such, though sometimes Pali is written in a
neighbour's script rather than one's own.  What's more surprising is
that Pali wasn't regularly written in the Thai script until Rama IV
ordered the change.  Instead, the Buddhist script in his domains was the
Khom script (a variety of the Khmer script, with several unencoded
characters for Thai) in the south and the Tai Tham script in the north.

Richard.


From chris.fynn at gmail.com  Thu Mar 27 14:50:38 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Fri, 28 Mar 2014 01:50:38 +0600
Subject: Pali in Thai Script
In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
Message-ID: <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>

On 27/03/2014, Richard BUDELBERGER <budelberger.richard at wanadoo.fr> wrote:

> And now, Pali. Not Thai in Pali script, but Pali in Thai script?

There is no standard script for P??i - It is often written in
Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several
other scripts.

I do think there is quite a need for a utility to convert P??i written
in any one of these scripts to any of the others,

- Chris


From ed.trager at gmail.com  Thu Mar 27 15:08:29 2014
From: ed.trager at gmail.com (Ed Trager)
Date: Thu, 27 Mar 2014 16:08:29 -0400
Subject: Pali in Thai Script
In-Reply-To: <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
Message-ID: <CAP6tU+mGf=xWtneumSZ=qXsWctBfcGa98800Q6_rurq652GA3Q@mail.gmail.com>

Hi, Chris,

Besides the scripts you mention, there is also Tai Tham as Richard
mentioned.

In theory, writing a utility to convert Pali written in any of those
scripts to any one of the other scripts should not be too difficult but ...
:

* Modern phonetically-based Lao lacks some of the traditional letters that
are still preserved in Thai and other scripts.

* At least as far as Tai Tham goes, it seems that Tai Tham spelling is not
consistent with Central Thai spelling when it comes to Sanskrit and
Pali-derived words ... I don't really know much about this -- just my own
limited observations. Probably somebody else here like Richard Wordingham
or Martin Hosken knows a lot more about this than I do ...

... so maybe in reality it is not so simple to do?


On Thu, Mar 27, 2014 at 3:50 PM, Christopher Fynn <chris.fynn at gmail.com>wrote:

> On 27/03/2014, Richard BUDELBERGER <budelberger.richard at wanadoo.fr> wrote:
>
> > And now, Pali. Not Thai in Pali script, but Pali in Thai script?
>
> There is no standard script for P??i - It is often written in
> Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several
> other scripts.
>
> I do think there is quite a need for a utility to convert P??i written
> in any one of these scripts to any of the others,
>
> - Chris
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140327/35c7fc37/attachment.html>

From chris.fynn at gmail.com  Thu Mar 27 16:48:05 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Fri, 28 Mar 2014 03:48:05 +0600
Subject: Pali in Thai Script
In-Reply-To: <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
Message-ID: <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>

On 28/03/2014, Ed Trager <ed.trager at gmail.com> wrote:

> Hi, Chris,

> Besides the scripts you mention, there is also Tai Tham as Richard
> mentioned

Several un-encoded Mon and Shan scripts too - as well as other Indic scripts.
>
> In theory, writing a utility to convert Pali written in any of those
> scripts to any one of the other scripts should not be too difficult but ...

> * Modern phonetically-based Lao lacks some of the traditional letters that
> are still preserved in Thai and other scripts.

Are there old Lao characters (once) used for writing P??i?

Even if there is not a  1 to 1 correspondence - as long as there is
consistency in the way P??i is written in each script - and you know
you are dealing with P??i and not another language written in that
script, it should be possible.

> * At least as far as Tai Tham goes, it seems that Tai Tham spelling is not
> consistent with Central Thai spelling when it comes to Sanskrit and
> Pali-derived words ... I don't really know much about this -- just my own
> limited observations. Probably somebody else here like Richard Wordingham
> or Martin Hosken knows a lot more about this than I do ...

A problem might be if scribal errors have crept in over the centuries
and some of these misspellings have become accepted in one script or
another.

I think there is work going on to make a very carefully edited
critical edition of the P??i Canon - it
would be useful to be able to convert and print this out in the
scripts used in the different countries where Therav?da Buddhism is
popular.

> ... so maybe in reality it is not so simple to do?
>
> - Ed
>
>
> On Thu, Mar 27, 2014 at 3:50 PM, Christopher Fynn
> <chris.fynn at gmail.com>wrote:
>
>> On 27/03/2014, Richard BUDELBERGER <budelberger.richard at wanadoo.fr>
>> wrote:
>>
>> > And now, Pali. Not Thai in Pali script, but Pali in Thai script?
>>
>> There is no standard script for P??i - It is often written in
>> Devanagri, Sinhala, Myanmar, Thai, Lao, Khmer, Latin, and several
>> other scripts.
>>
>> I do think there is quite a need for a utility to convert P??i written
>> in any one of these scripts to any of the others,
>>
>> - Chris


From rick at unicode.org  Thu Mar 27 17:06:36 2014
From: rick at unicode.org (Rick McGowan)
Date: Thu, 27 Mar 2014 15:06:36 -0700
Subject: Pali in Thai Script
In-Reply-To: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
Message-ID: <5334A0EC.6050500@unicode.org>

Hello,

This is an interesting discussion so far...

What is the current situation of Pali written in the Thai script? Is 
there a scholarly tradition already? Why are new symbols being used for 
this purpose in this project? Is it because nothing else exists at this 
time? Or some other reason? Has this never been done before?

I'm trying to understand the particular scholarly need that will be 
addressed by this project, and to know why some other existing symbols 
are not, or cannot, be used for this purpose. It would help to get a 
sense of the project scope, and how it relates to previous and current 
Pali scholarship in Thailand. And what alternative solutions have been 
discussed and/or used by the project participants.

(Also to be clear: I'm only asking these questions out of personal 
curiosity, not an official question on behalf of the UTC or anything 
like that.)

Thanks,
Rick

On 3/27/2014 1:14 AM, Sittipon Simasanti wrote:
> In order to ease this situation, we have created an orthography font (slightly modified from the existed Thai font) and used them internally. I have to admit that, currently, we are changing the glyphs from time to time. But, we are looking forward to establish the studies nationwide in the near future once everything is in place.


From budelberger.richard at wanadoo.fr  Thu Mar 27 17:21:17 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Thu, 27 Mar 2014 23:21:17 +0100 (CET)
Subject: Pali in Thai Script
Message-ID: <22776183.17522.1395958877729.JavaMail.www@wwinf1m14>

> Message du 27/03/14 19:43
> De : Richard Wordingham 
> Copie ? : unicode at unicode.org
> Objet : Re: Pali in Thai Script
> 
> On Thu, 27 Mar 2014 15:14:33 +0700
> Sittipon Simasanti wrote:
> 
> > Normal KO KAI and KO KAI with black dot to make KO KAI non-aspirated.
> > https://dl.dropboxusercontent.com/u/824603/unicode/glyph.png
> > 
> > Thai consonants with Black dot for non-aspirated and White dot for
> > aspirated.
> > https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png
> 
> Those descriptions confused me - the black dot means 'voiced and not
> aspirated', and the white dot means 'voiced and aspirated'.

https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png : I think that ???????????? means ?unaspirated? and ????????? ?aspirated?? (But yes, Sittipon Simasanti?s message is not very clear. See http://twitpic.com/dzk1o3 : do you understand it?? I, no.)


(The True) Richard.

Note?: Tipitaka Studies Foundation Internal Font uses U+0325 ?? combining ring below ? the ?voiceless? diacritic?: https://en.wikipedia.org/wiki/Voice_(phonetics) ? for (un)aspiration ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140327/c27af55f/attachment.html>

From budelberger.richard at wanadoo.fr  Thu Mar 27 17:31:08 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Thu, 27 Mar 2014 23:31:08 +0100 (CET)
Subject: Pali in Thai Script
In-Reply-To: <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
Message-ID: <494994255.17641.1395959468087.JavaMail.www@wwinf1m14>

> Message du 27/03/14 22:56
> De : Christopher Fynn 
> A : Ed Trager
> Copie ? : Unicode List
> Objet : Re: Pali in Thai Script
> 
> On 28/03/2014, Ed Trager wrote: 
>> Besides the scripts you mention, there is also Tai Tham as Richard
>> mentioned
> 
> Several un-encoded Mon and Shan scripts too - as well as other Indic scripts.
> 
>> In theory, writing a utility to convert Pali written in any of those
>> scripts to any one of the other scripts should not be too difficult but ...
> 
>> * Modern phonetically-based Lao lacks some of the traditional letters that
>> are still preserved in Thai and other scripts.
> 
> Are there old Lao characters (once) used for writing P??i?
> 
> Even if there is not a 1 to 1 correspondence - as long as there is
> consistency in the way P??i is written in each script - and you know
> you are dealing with P??i and not another language written in that
> script, it should be possible.

What can I say with my experience from Garshuni, is that the rule is that there is no (strict) rules, and that the only consistency I saw in writing two related languages (Arabic and Syriac) 
is inconsistency.

So, imagine with an Indic and Asiatic languages.


From richard.wordingham at ntlworld.com  Thu Mar 27 19:23:49 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 28 Mar 2014 00:23:49 +0000
Subject: Pali in Thai Script
In-Reply-To: <5334A0EC.6050500@unicode.org>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <5334A0EC.6050500@unicode.org>
Message-ID: <20140328002349.67a826f3@JRWUBU2>

On Thu, 27 Mar 2014 15:06:36 -0700
Rick McGowan <rick at unicode.org> wrote:

> What is the current situation of Pali written in the Thai script? Is 
> there a scholarly tradition already?

There was a scholarly tradition of writing Pali in the Khom
script (Bangkok,as successor to Ayutthaya) or Tai Tham script (north and
northeast).  For secular writing, there were versions of the Tai/Lao
script, which had additional letters whose purpose was to retain the
consonant distinctions made in the religious scripts.  Rama IV
commanded (whether as Patriarch or later as king - I don't remember)
instructed that religious writing be switched to the Thai script.  He
also promulgated a change in the writing system for Pali, whereby the
two vowel killers, YAMAKKAN and THANTHAKHAT, were 'simplified' to a
single diacritic, PHINTHU.  There was thus in principal an immediate
'tradition' of writing Pali in the Thai script.

Now, there is actually a problem in writing Pali in the Thai script.
When a preposed vowel phonetically follows a consonant cluster in the
middle of a word, does it proceed or follow the first vowel?  There
seems to be a lot of inconsistency, as Vinodh and I found out when
trying to work out the rules so that he could transliterate his
master copy of the Tipitaka into the Thai script.  I was working
from a Thai CD of the Tipitaka, and was quite startled by the
internal inconsistency in the spelling in the Thai script.

There are two other problems.  The system of phinthu and implicit
vowel and NIKHAHIT to write the anusvara is quite different to the way
that Thai is actually written.  For private recitation of Pali, a
tradition has grown up of using MAI HANAKAT and SARA A, which are not
used in traditional Pali spelling, to replace the implicit vowels,
thus creating a Thai script writing system for Thai that is actually an
alphabet rather than an abugida.

The second problem is the 'great consonant shift' whereby old voicing
contrasts were lost in much of East Asia, covering most Tai,
Mon-Khmer and Chinese dialects.  (The change is not complete - some
areas have escaped the change.)  Consequently, the more conservative
Sinhalese and Burmese pronunciations are quite different to the Thai
and Lao (and Mon and Khmer) pronunciations.  The Thai and Lao
pronunciations have replaced voiced stops by voiceless aspirates.

> Why are new symbols being used for this purpose in this project?

The ideas of the new symbols it to restore the ancient pronunciation.
Just as a Classical Latin pronunciation differs greatly from English
legal Latin or Roman Catholic Church Latin, and is very different to
how Latin loan words are pronounced in English, the modern Thai
consonant sounds are very different to the ancient Pali sounds.

> I'm trying to understand the particular scholarly need that will be 
> addressed by this project, and to know why some other existing
> symbols are not, or cannot, be used for this purpose.

The problem with the traditional symbols is that they are pronounced
quite differently in Thai.  ????? is /budd?a/ in the ancient
pronunciation, but ???? is /p?ut(t?a)/ in Thai pronunciation.  (Thai
doesn't use PHINTHU.)  An analogy is that 'Caesar' is
pronounced /si?z?/ in British English, but is approximated as /ka?sar/
in a Latin class in England.  Apart from the possible examples of Pali
and Sanskrit pronounced in the Indian way, most Thais are probably not
accustomed to Thai letters being pronounced differently in different
languages. 

Richard.


From richard.wordingham at ntlworld.com  Thu Mar 27 20:03:20 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 28 Mar 2014 01:03:20 +0000
Subject: Pali in Thai Script
In-Reply-To: <CAP6tU+mGf=xWtneumSZ=qXsWctBfcGa98800Q6_rurq652GA3Q@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+mGf=xWtneumSZ=qXsWctBfcGa98800Q6_rurq652GA3Q@mail.gmail.com>
Message-ID: <20140328010320.2fcf9c6c@JRWUBU2>

On Thu, 27 Mar 2014 16:08:29 -0400
Ed Trager <ed.trager at gmail.com> wrote:

> Hi, Chris,
> 
> Besides the scripts you mention, there is also Tai Tham as Richard
> mentioned.
> 
> In theory, writing a utility to convert Pali written in any of those
> scripts to any one of the other scripts should not be too difficult
> but ... :
> 
> * Modern phonetically-based Lao lacks some of the traditional letters
> that are still preserved in Thai and other scripts.
> 
> * At least as far as Tai Tham goes, it seems that Tai Tham spelling
> is not consistent with Central Thai spelling when it comes to
> Sanskrit and Pali-derived words ... I don't really know much about
> this -- just my own limited observations.

Modern Siamese spelling is highly Sanskritised.  It has also been
simplified by the elimination of final 'geminate' clusters.  There are
also quite a few differences in the reflexes of P/S /a/ in closed
syllables, and certainly the spelling of the Mae Fah Luang dictionary
reflects vowel changes that Siamese spelling simply ignores.

Having said that, some Tai Tham spelling has geminates where the
evidence of other varieties of Pali is that there should not be
geminates - what should etymologically be written <HIGH SA, consonant,
MAI SAM> is often written <GREAT SA, consonant>.  I have seen
remarks that the Pali of inland SE Asia is rather different from
that of Sri Lanka.

There are other issues, such as the merger of HIGH SA and HIGH CHA in
some varieties, so that what should be the cluster <HIGH CA, SAKOT,
HIGH CHA> actually appears to be <HIGH CA, SAKOT, HIGH SA>.  There is
also the tendency of <SAKOT, BA> to be used for other labials. 

Richard.


From sittipon at x10studio.com  Thu Mar 27 20:47:17 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Fri, 28 Mar 2014 08:47:17 +0700
Subject: Pali in Thai Script
In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
Message-ID: <E9C4C10D-EE5F-4DBA-B128-30EDA01AF19F@x10studio.com>

> 
> Very interesting ! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script (cf. http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other 
> languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin (cf. http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque), 
> Ancient Greek (cf. http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in 
> Modern Greek script (cf. http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque) !? That? what George Kiraz called 
> ?garshunography? (cf. http://en.wikipedia.org/wiki/Garshuni).
> 
> And now, Pali. Not Thai in Pali script, but Pali in Thai script?
> 

As far as I know, Pali doesn't have its own set of characters. It is often written using languages' characters where its users are familiar with.
The important thing is its voice should be the same no matter what character set you are using. We already have Pali written using Thai script.
This is not entirely a new one. Just a few changes to make Pali written in Thai sounds more like written in other languages.


> Do you know how many languages are concerned by this ?Paligarshunography? ? Since ho many centuries ?

I have no idea. But, should be a lot. Since our neighbors, Lao, Myanmar also have Pali written in their languages. 
And we also have Pali written in English alphabets in our database too.


Sittipon


From sittipon at x10studio.com  Thu Mar 27 21:07:52 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Fri, 28 Mar 2014 09:07:52 +0700
Subject: Pali in Thai Script
In-Reply-To: <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11>
Message-ID: <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com>

> 
> I?m afraid to say that since PHO SAMPHAO with White dot (for aspirated) ? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l. 5 c. 2 ? may be (badly) drawn with 
> U+0E20 THAI CHARACTER PHO SAMPHAO and U+0325 COMBINING RING BELOW, ? ?? ?, you have to use your Internal Font?
> 

Arr, thanks. We have considered to put them below and above as well. But, white dot above the consonants just look too much like SARA AM (U+0E33) and if we put them below
black dot will look like PINTHU (U+0E3A). Both of them have already their functions in Thai language. So, it might be confusing rather than helping Pali in Thai script.

If possible we would like to keep them in the same place. That's why we put them there. 

Sittipon.


From mark at kli.org  Thu Mar 27 21:28:42 2014
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 27 Mar 2014 22:28:42 -0400
Subject: Pali in Thai Script
In-Reply-To: <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
Message-ID: <5334DE5A.9030405@kli.org>

On 03/27/2014 01:38 PM, Richard BUDELBERGER wrote:
> Very interesting ! we already have ?Garshuni?, that is, basically, Arabic written in Syriac script (cf. http://fr.wiktionary.org/wiki/Category:arabe_en_graphie_syriaque), extended to other
> languages, as Persian, Turkish, Azeri Turkish, Kurdish, Armenian, Malayalam, Latin (cf. http://fr.wiktionary.org/wiki/Category:latin_en_graphie_syriaque),
> Ancient Greek (cf. http://fr.wiktionary.org/wiki/Category:grec_ancien_en_graphie_syriaque)? and even a kind of ?reverse-Garshuni?, that is Syriac in
> Modern Greek script (cf. http://fr.wiktionary.org/wiki/Category:syriaque_en_graphie_grecque) !? That? what George Kiraz called
> ?garshunography? (cf. http://en.wikipedia.org/wiki/Garshuni).
>
> And now, Pali. Not Thai in Pali script, but Pali in Thai script?
>
It's not at all uncommon.  Consider Yiddish, which is essentially German 
written in Hebrew script.  Or various Judeo-Arabics written in Hebrew, 
and the Talmud, which is Aramaic written in Hebrew letters (in pretty 
much every printing and MS I've heard of).

~mark


From budelberger.richard at wanadoo.fr  Thu Mar 27 21:33:40 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Fri, 28 Mar 2014 03:33:40 +0100 (CET)
Subject: Pali in Thai Script
In-Reply-To: <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <1643411963.8582.1395943093617.JavaMail.www@wwinf1m11>
 <0856FE7F-CB89-496E-94B0-68CE792C72C5@x10studio.com>
Message-ID: <92226187.139.1395974020618.JavaMail.www@wwinf1m14>

> Message du 28/03/14 03:08
> De : Sittipon Simasanti
> A : Richard BUDELBERGER
> Copie ? : unicode at unicode.org
> Objet : Re: Pali in Thai Script
> 
> > I?m afraid to say that since PHO SAMPHAO with White dot (for aspirated) ? https://dl.dropboxusercontent.com/u/824603/unicode/glyph2.png, l. 5 c. 2 ? may be (badly) drawn with 
> > U+0E20 THAI CHARACTER PHO SAMPHAO and U+0325 COMBINING RING BELOW, ? ?? ?, you have to use your Internal Font?
> > 
> 
> Arr, thanks. We have considered to put them below and above as well. But, white dot above the consonants just look too much like SARA AM (U+0E33) and if we put them below
> black dot will look like PINTHU (U+0E3A). Both of them have already their functions in Thai language. So, it might be confusing rather than helping Pali in Thai script.
> 
> If possible we would like to keep them in the same place. That's why we put them there. 

The tip is to say that dots are not above or below the characters, but inside them.


From sittipon at x10studio.com  Thu Mar 27 21:34:51 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Fri, 28 Mar 2014 09:34:51 +0700
Subject: Pali in Thai Script
In-Reply-To: <20140327183618.0bfd3a2b@JRWUBU2>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <20140327183618.0bfd3a2b@JRWUBU2>
Message-ID: <D367AD6F-4E06-45B6-A0A1-E0A918D8B9A1@x10studio.com>

> 
> Those descriptions confused me - the black dot means 'voiced and not
> aspirated', and the white dot means 'voiced and aspirated'.
> 

Sorry for that, you are right, the picture doesn't represent the entire table so it might be confusing. 
 
Here's the entire one:
https://dl.dropboxusercontent.com/u/824603/unicode/glyph3.png

Only red and green columns have extra black and white dots.
The rest of them are normal Thai characters.

> 
> If your scheme has sufficient success, each combination of base letter
> and diacritic may well be encoded as a separate letter because the
> position of the diacritic is not obvious.  I presume we're looking at no
> more than about 12 new characters - DO CHADA WITH BLACK DOT is an
> obvious competitor to THO NANGMONTHO WITH BLACK DOT.
> 

Yes, 10 characters.


> I'm disappointed you found that simply adding a black dot for the
> voiced consonants didn't work.  If it had worked, then we might
> have argued that this was just a font variation.

Please, see the previous email.
Thanks!

Sittipon


From budelberger.richard at wanadoo.fr  Thu Mar 27 21:59:29 2014
From: budelberger.richard at wanadoo.fr (Richard BUDELBERGER)
Date: Fri, 28 Mar 2014 03:59:29 +0100 (CET)
Subject: Pali in Thai Script
In-Reply-To: <5334DE5A.9030405@kli.org>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <5334DE5A.9030405@kli.org>
Message-ID: <1226242372.188.1395975570026.JavaMail.www@wwinf1m14>

> Message du 28/03/14 03:34
> De : Mark E. Shoulson
> A : unicode at unicode.org
> Objet : Re: Pali in Thai Script
> 
> It's not at all uncommon. Consider Yiddish, which is essentially German 
> written in Hebrew script. Or various Judeo-Arabics written in Hebrew, 
> and the Talmud, which is Aramaic written in Hebrew letters (in pretty 
> much every printing and MS I've heard of). 

(What you call ??Hebrew letters ? are Aramaic letters of the alphabet adopted by Hebrew in Vth?c. BC.)

Or Byelorussian written in Latin script in a Polish way? (More than Ukrainian.)


From mark at kli.org  Thu Mar 27 22:15:50 2014
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 27 Mar 2014 23:15:50 -0400
Subject: Pali in Thai Script
In-Reply-To: <1226242372.188.1395975570026.JavaMail.www@wwinf1m14>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <5334DE5A.9030405@kli.org>
 <1226242372.188.1395975570026.JavaMail.www@wwinf1m14>
Message-ID: <5334E966.4030700@kli.org>

On 03/27/2014 10:59 PM, Richard BUDELBERGER wrote:
>> Message du 28/03/14 03:34
>> De : Mark E. Shoulson
>> A : unicode at unicode.org
>> Objet : Re: Pali in Thai Script
>>
>> It's not at all uncommon. Consider Yiddish, which is essentially German
>> written in Hebrew script. Or various Judeo-Arabics written in Hebrew,
>> and the Talmud, which is Aramaic written in Hebrew letters (in pretty
>> much every printing and MS I've heard of).
> (What you call ? Hebrew letters ? are Aramaic letters of the alphabet adopted by Hebrew in Vth c. BC.)

Of course.  And the Samaritans still write both Hebrew and Aramaic as 
well using truly _Hebrew_ characters (ktav ivri, though of course 
developed by them through history), not the Aramaic-derived ones. But 
Aramaic is more associated with various Syriac alphabets. Still, I was 
reading Aramaic for a long time before I even knew there were Syriac 
alphabets that people wrote Aramaic in, and I still can't particularly 
read those.

I think I've seen colloquial Arabic in Hebrew letters (aimed at teaching 
Hebrew-speakers, to be sure; maybe mostly to avoid having to teach a new 
alphabet).  Someone once sent me a proposal for writing Esperanto in 
Hebrew letters (yes, Aramaic, of course: square Hebrew, ktav ashuri.  
What Unicode calls "HEBREW"), to what purpose I don't know (it was more 
or less the same as Yiddish writing). Sanskrit is also often seen in 
various scripts, I believe.

I don't think it's unusual to find one language written in a script 
generally associated with another, especially if the first language 
doesn't have a well-established script for itself (not all the above are 
examples of that).

~mark


From theppitak at gmail.com  Thu Mar 27 22:49:13 2014
From: theppitak at gmail.com (Theppitak Karoonboonyanan)
Date: Fri, 28 Mar 2014 10:49:13 +0700
Subject: Pali in Thai Script
In-Reply-To: <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
Message-ID: <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>

On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn <chris.fynn at gmail.com> wrote:
> On 28/03/2014, Ed Trager <ed.trager at gmail.com> wrote:
>
>> * Modern phonetically-based Lao lacks some of the traditional letters that
>> are still preserved in Thai and other scripts.
>
> Are there old Lao characters (once) used for writing P??i?

Historically no. But there was once an attempt to devise such characters
by Lao Royal Institute before being dismissed by the communist
revolution later. The writing principle was to use PHINTU in the same
manner as Thai script, and the missing characters were borrowed
from Tham script.

See some sample text here:

  http://ic.pics.livejournal.com/saixelamphao/16569530/7323/7323_original.jpg

( Source: http://saixelamphao.livejournal.com/1326.html )

The upper part is written in Tham script, and the lower part is in the
extended Lao script.

The writing system was in use during 1932-1948. And some
North-Eastern Thai scholars are trying to revive it at present.

The full character chart, demonstrated by a font created by a Thai
scholar (Facebook login is needed, sorry):

http://www.facebook.com/photo.php?fbid=10201049297248857

Regards,
-- 
Theppitak Karoonboonyanan
http://linux.thai.net/~thep/


From chris.fynn at gmail.com  Fri Mar 28 02:09:10 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Fri, 28 Mar 2014 13:09:10 +0600
Subject: Pali in Thai Script
In-Reply-To: <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
Message-ID: <CAA_CYcJYn4bcDPf3zJgBT0dVAeVf_577WK44ES_AXkOBVyNOYw@mail.gmail.com>

On 28/03/2014, Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:

> The full character chart, demonstrated by a font created by a Thai
> scholar (Facebook login is needed, sorry):
>
> http://www.facebook.com/photo.php?fbid=10201049297248857

Even after  logging into Facebook I only get the message:
 "This content is currently unavailable"
"The page you requested cannot be displayed at the moment. It may be
temporarily unavailable, the link you clicked on may have expired, or
you may not have permission to view this page."

- C


From chris.fynn at gmail.com  Fri Mar 28 02:29:30 2014
From: chris.fynn at gmail.com (Christopher Fynn)
Date: Fri, 28 Mar 2014 13:29:30 +0600
Subject: Pali in Thai Script
In-Reply-To: <5334E966.4030700@kli.org>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <5334DE5A.9030405@kli.org>
 <1226242372.188.1395975570026.JavaMail.www@wwinf1m14>
 <5334E966.4030700@kli.org>
Message-ID: <CAA_CYcKuqkEjL=WXgKQM5Gj03mxxJboig4mXN=-TdJNxvSOaQg@mail.gmail.com>

Here the case is a little different as there is no particular script
associated with P??i. People in different Buddhist countries just use
their own script for writing P??i..

A conversion utility, or simple way of letting users choose the script
in which P??i. is displayed, would be useful so that there would be no
need to type the same texts in each script.

Sanskrit is strongly associated with the Devan?gar?  script - but it
is sometimes written in nearly all of the widely used scripts of India
and some others such as Tibetan and Latin.

- C


From richard.wordingham at ntlworld.com  Fri Mar 28 04:15:47 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 28 Mar 2014 09:15:47 +0000
Subject: Pali in Thai Script
In-Reply-To: <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
Message-ID: <20140328091547.64b97a4f@JRWUBU2>

On Fri, 28 Mar 2014 10:49:13 +0700
Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:

> On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn
> <chris.fynn at gmail.com> wrote:
> > On 28/03/2014, Ed Trager <ed.trager at gmail.com> wrote:
> >
> >> * Modern phonetically-based Lao lacks some of the traditional
> >> letters that are still preserved in Thai and other scripts.
> >
> > Are there old Lao characters (once) used for writing P??i?
> 
> Historically no. But there was once an attempt to devise such
> characters by Lao Royal Institute before being dismissed by the
> communist revolution later. The writing principle was to use PHINTU
> in the same manner as Thai script, and the missing characters were
> borrowed from Tham script.

An older form of the Lao script is called the Thai Noi script.  That
script has many of the characters needed.  It has the characters, to
give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA,
DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and
Vocalic R.  The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to
their rarity, as with the lack of Vocalic L.

Richard.


From duerst at it.aoyama.ac.jp  Fri Mar 28 05:40:37 2014
From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=)
Date: Fri, 28 Mar 2014 19:40:37 +0900
Subject: Fwd: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala
 got great reception in Sri Lanka)
In-Reply-To: <53266CBF.6060209@it.aoyama.ac.jp>
References: <53266CBF.6060209@it.aoyama.ac.jp>
Message-ID: <533551A5.6070800@it.aoyama.ac.jp>

I got informed today by your IT Dept. that the mail below never went 
out. Resent herewith.    Martin.

-------- Original Message --------
Subject: Updated Japanese Legacy Standard? (was: Re: Romanized Singhala 
got great reception in Sri Lanka)
Date: Mon, 17 Mar 2014 12:32:15 +0900
From: "Martin J. D?rst" <duerst at it.aoyama.ac.jp>

On 2014/03/16 14:36, Philippe Verdy wrote:

> You may still want to promote it at some government or education
> institution, in order to promote it as a national standard, except that
> there's little change it will ever happen when all countries in ISO have
> stopoed working on standardization of new 8-bit encodings (only a few ones
> are maintained; but these are the most complex ones used in China and Japan.
>
> Well in fact only Japan now seens to be actively updating its legacy JIS
> standard; but only with the focus of converging it to use the UCS and solve
> ambiguities or solve some technical problems (e.g. with emojis used by
> mobile phone operators). Even China stopped updating its national standard
> by publishing a final mapping table to/from the full UCS (including for
> characters still not encoded in the UCS): this simplified the work because
> only one standard needs to be maintained instead of 2.

I'm not aware of any activity in Japan regarding the update of legacy
character encodings. Can you tell me what you mean by "actively updating"?

Regards,   Martin.


From duerst at it.aoyama.ac.jp  Fri Mar 28 05:41:55 2014
From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=)
Date: Fri, 28 Mar 2014 19:41:55 +0900
Subject: Fwd: Re: Romanized Singhala got great reception in Sri Lanka
In-Reply-To: <532689FC.7010606@it.aoyama.ac.jp>
References: <532689FC.7010606@it.aoyama.ac.jp>
Message-ID: <533551F3.2090409@it.aoyama.ac.jp>

I got informed today by your IT Dept. that the mail below never went 
out. Resent herewith.    Martin.


-------- Original Message --------
Subject: Re: Romanized Singhala got great reception in Sri Lanka
Date: Mon, 17 Mar 2014 14:37:00 +0900
From: "Martin J. D?rst" <duerst at it.aoyama.ac.jp>

On 2014/03/17 13:16, Jean-Fran?ois Colson wrote:
>
>> As for Japanese (and also for Indic) I have read the warnings in RFC
>> 1815:
>> http://tools.ietf.org/rfc/rfc1815.txt
>>
>>
>
> RFC 1815       Character Sets ISO-10646 and ISO-10646-J-1      July 1995
>
> July 1995? Is that document up-to-date?

No, it's not. Not at all. It was outdated when it was published, and
expresses only the opinions of the author (who was well know for not
liking, and not very well understanding, Unicode).

It's labeled as "Informational", which means it is not in any way part
of an IETF Standard/specification. Even April 1st RFCs are classified as
"Informational".

The "charset" label "ISO-10646-J-1" it defines is listed at
http://www.iana.org/assignments/character-sets/character-sets.xhtml, but
I don't think that there is any major conversion library that supports
this. Similar for what RFC 1815 labels as "ISO-10646", which appears as
"ISO-10646-Unicode-Latin1" in the IANA registry (because simply using
"ISO-10646" for this would be strongly misleading).

Regards,   Martin.


From richard.wordingham at ntlworld.com  Fri Mar 28 14:29:05 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 28 Mar 2014 19:29:05 +0000
Subject: Pali in Thai Script
In-Reply-To: <5334A0EC.6050500@unicode.org>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <5334A0EC.6050500@unicode.org>
Message-ID: <20140328192905.60600cb8@JRWUBU2>

On Thu, 27 Mar 2014 15:06:36 -0700
Rick McGowan <rick at unicode.org> wrote:

> I'm trying to understand the particular scholarly need that will be 
> addressed by this project, and to know why some other existing
> symbols are not, or cannot, be used for this purpose.

I didn't completely answer this question.  There are existing symbols
that would be adequate.

I can see a font-based solution that might not violate the principle of
character identity.  For the five voiced consonants, one could use the
encodings:

/g/ <U+0E01 THAI CHARACTER KO KAI, U+0331 COMBINING MACRON BELOW> (??)
/?/ <U+0E08 THAI CHARACTER CHO CHAN, U+0331 COMBINING MACRON BELOW> (??)
/?/ <U+0E0E THAI CHARACTER DO CHADA> (?)
/d/ <U+0E14 THAI CHARACTER DO DEK> (?)
/b/ <U+0E1A THAI CHARACTER BO BAIMAI> (?)

These would be unambiguous for Pali (in this convention) whatever the
font used, and thus almost immediately ready for general use.  (There
may be problems with the rendering of U+0331 - isn't there a minority
orthography that use it as a diacritic?) A special font could be used
for didactic purposes to add the black and white circles to
emphasise that the normal Thai pronunciation is not to be used.
One could also do that with the conventional letters for Pali
voiced stops, namely ?????, which to me would be a superior
solution.

Richard.


From sittipon at x10studio.com  Fri Mar 28 21:57:50 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Sat, 29 Mar 2014 09:57:50 +0700
Subject: Pali in Thai Script
In-Reply-To: <20140328192905.60600cb8@JRWUBU2>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <5334A0EC.6050500@unicode.org> <20140328192905.60600cb8@JRWUBU2>
Message-ID: <2177D6DB-E215-4F8D-8D01-731054EFD763@x10studio.com>

Thanks for pointing out.
I will bring this to the team's attention today.

Sittipon 


On Mar 29, 2557 BE, at 2:29 AM, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> On Thu, 27 Mar 2014 15:06:36 -0700
> Rick McGowan <rick at unicode.org> wrote:
> 
>> I'm trying to understand the particular scholarly need that will be 
>> addressed by this project, and to know why some other existing
>> symbols are not, or cannot, be used for this purpose.
> 
> I didn't completely answer this question.  There are existing symbols
> that would be adequate.
> 
> I can see a font-based solution that might not violate the principle of
> character identity.  For the five voiced consonants, one could use the
> encodings:
> 
> /g/ <U+0E01 THAI CHARACTER KO KAI, U+0331 COMBINING MACRON BELOW> (??)
> /?/ <U+0E08 THAI CHARACTER CHO CHAN, U+0331 COMBINING MACRON BELOW> (??)
> /?/ <U+0E0E THAI CHARACTER DO CHADA> (?)
> /d/ <U+0E14 THAI CHARACTER DO DEK> (?)
> /b/ <U+0E1A THAI CHARACTER BO BAIMAI> (?)
> 
> These would be unambiguous for Pali (in this convention) whatever the
> font used, and thus almost immediately ready for general use.  (There
> may be problems with the rendering of U+0331 - isn't there a minority
> orthography that use it as a diacritic?) A special font could be used
> for didactic purposes to add the black and white circles to
> emphasise that the normal Thai pronunciation is not to be used.
> One could also do that with the conventional letters for Pali
> voiced stops, namely ?????, which to me would be a superior
> solution.
> 
> Richard.
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From theppitak at gmail.com  Fri Mar 28 22:59:09 2014
From: theppitak at gmail.com (Theppitak Karoonboonyanan)
Date: Sat, 29 Mar 2014 10:59:09 +0700
Subject: Pali in Thai Script
In-Reply-To: <CAA_CYcJYn4bcDPf3zJgBT0dVAeVf_577WK44ES_AXkOBVyNOYw@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <CAA_CYcJYn4bcDPf3zJgBT0dVAeVf_577WK44ES_AXkOBVyNOYw@mail.gmail.com>
Message-ID: <CACvhRisrN=qTDtsZJgn1HL5Bi4E3iOWeYjPG5uhWk+X02tRgCQ@mail.gmail.com>

On Fri, Mar 28, 2014 at 2:09 PM, Christopher Fynn <chris.fynn at gmail.com> wrote:
> On 28/03/2014, Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:
>
>> The full character chart, demonstrated by a font created by a Thai
>> scholar (Facebook login is needed, sorry):
>>
>> http://www.facebook.com/photo.php?fbid=10201049297248857
>
> Even after  logging into Facebook I only get the message:
>  "This content is currently unavailable"
> "The page you requested cannot be displayed at the moment. It may be
> temporarily unavailable, the link you clicked on may have expired, or
> you may not have permission to view this page."

Sorry again. It seems to be shared privately. And I don't think it's
appropriate to share it here against the author's will, then.

There is a better image here:
  http://saixelamphao.livejournal.com/pics/catalog/493/8465

( Source: http://saixelamphao.livejournal.com/1620.html )

Regards,
-- 
Theppitak Karoonboonyanan
http://linux.thai.net/~thep/


From theppitak at gmail.com  Fri Mar 28 23:10:52 2014
From: theppitak at gmail.com (Theppitak Karoonboonyanan)
Date: Sat, 29 Mar 2014 11:10:52 +0700
Subject: Pali in Thai Script
In-Reply-To: <20140328091547.64b97a4f@JRWUBU2>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <20140328091547.64b97a4f@JRWUBU2>
Message-ID: <CACvhRivLytdMrGyyNQMR5eNYHZV0K0_Tt9qdwGRQ2mz-dK_gWA@mail.gmail.com>

On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham
<richard.wordingham at ntlworld.com> wrote:
> On Fri, 28 Mar 2014 10:49:13 +0700
> Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:
>
>> On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn
>> <chris.fynn at gmail.com> wrote:
>> > On 28/03/2014, Ed Trager <ed.trager at gmail.com> wrote:
>> >
>> >> * Modern phonetically-based Lao lacks some of the traditional
>> >> letters that are still preserved in Thai and other scripts.
>> >
>> > Are there old Lao characters (once) used for writing P??i?
>>
>> Historically no. But there was once an attempt to devise such
>> characters by Lao Royal Institute before being dismissed by the
>> communist revolution later. The writing principle was to use PHINTU
>> in the same manner as Thai script, and the missing characters were
>> borrowed from Tham script.
>
> An older form of the Lao script is called the Thai Noi script.  That
> script has many of the characters needed.  It has the characters, to
> give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA,
> DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and
> Vocalic R.  The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to
> their rarity, as with the lack of Vocalic L.

I don't think so. From my studies so far, Tai Noi script (aka. Lao Buhan)
writing system was not so different from that of contemporary Lao script.
Some characters are just obsolete.

In fact, I have been drafting a summarized proposal to encode Tai Noi
script here:

  http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html

There is also a project to revive the script in North-Eastern Thailand,
which may urge the need for contemporary usage in computers:

  http://icmrpthailand.org/

The Tai Noi version with web font hack, which should be converted
to Unicode instead if it were supported:

  http://icmrpthailand.org/is

Regards,
-- 
Theppitak Karoonboonyanan
http://linux.thai.net/~thep/


From sittipon at x10studio.com  Sat Mar 29 00:17:12 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Sat, 29 Mar 2014 12:17:12 +0700
Subject: Pali in Thai Script
In-Reply-To: <CACvhRivLytdMrGyyNQMR5eNYHZV0K0_Tt9qdwGRQ2mz-dK_gWA@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <20140328091547.64b97a4f@JRWUBU2>
 <CACvhRivLytdMrGyyNQMR5eNYHZV0K0_Tt9qdwGRQ2mz-dK_gWA@mail.gmail.com>
Message-ID: <AFE7606D-0C67-4633-ADD3-B921F9CD96A1@x10studio.com>

In my opinion (not my team), I think small underline like Richard said would be better for wider Thais audience. Since, Tai Noi is very different from modern Thai script we are using nowadays. 

My aim is to make subtle changes on how we already write Pali in Thai. And, if we have to change the language to cover all the pronunciation we needed. I would recommend changing to the language everyone else is using to studies Pali instead.

Best,
Sittipon

> On Mar 29, 2014, at 11:10 AM, Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:
> 
> On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
>> On Fri, 28 Mar 2014 10:49:13 +0700
>> Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:
>> 
>>> On Fri, Mar 28, 2014 at 4:48 AM, Christopher Fynn
>>> <chris.fynn at gmail.com> wrote:
>>>> On 28/03/2014, Ed Trager <ed.trager at gmail.com> wrote:
>>>> 
>>>>> * Modern phonetically-based Lao lacks some of the traditional
>>>>> letters that are still preserved in Thai and other scripts.
>>>> 
>>>> Are there old Lao characters (once) used for writing P??i?
>>> 
>>> Historically no. But there was once an attempt to devise such
>>> characters by Lao Royal Institute before being dismissed by the
>>> communist revolution later. The writing principle was to use PHINTU
>>> in the same manner as Thai script, and the missing characters were
>>> borrowed from Tham script.
>> 
>> An older form of the Lao script is called the Thai Noi script.  That
>> script has many of the characters needed.  It has the characters, to
>> give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA,
>> DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA and
>> Vocalic R.  The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be due to
>> their rarity, as with the lack of Vocalic L.
> 
> I don't think so. From my studies so far, Tai Noi script (aka. Lao Buhan)
> writing system was not so different from that of contemporary Lao script.
> Some characters are just obsolete.
> 
> In fact, I have been drafting a summarized proposal to encode Tai Noi
> script here:
> 
>  http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html
> 
> There is also a project to revive the script in North-Eastern Thailand,
> which may urge the need for contemporary usage in computers:
> 
>  http://icmrpthailand.org/
> 
> The Tai Noi version with web font hack, which should be converted
> to Unicode instead if it were supported:
> 
>  http://icmrpthailand.org/is
> 
> Regards,
> -- 
> Theppitak Karoonboonyanan
> http://linux.thai.net/~thep/
> 
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode


From asmusf at ix.netcom.com  Sat Mar 29 06:01:43 2014
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sat, 29 Mar 2014 04:01:43 -0700
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <53343DED.5040304@cs.tut.fi>
References: <CF599AB2.2CEDF%KalvesmakiJ@doaks.org> <53343DED.5040304@cs.tut.fi>
Message-ID: <5336A817.1010205@ix.netcom.com>

On managing some types of spacing between elements in running text:

On 3/27/2014 8:04 AM, Jukka K. Korpela wrote:
> 2014-03-27 15:10, Kalvesmaki, Joel wrote:
>
>> William, try the U+2000..U+200A glyphs under General Punctuation--I 
>> think
>> that's what you're looking for to manage precise widths of blank space.
>
> That range contains some ?fixed-width spaces?, yes. Being 
> ?fixed-width? is rather relative here, though, and many fonts do not 
> contain these characters. Rendering software could of course display 
> them by just leaving suitable spacing, but that?s not common.
>
> The ?fixed-width spaces? are mostly just legacy characters, holdover 
> from old typography. They may have their uses, though, in contexts 
> where they work and other spacing methods don?t (for example, I 
> recently noticed that they seem to be the only way to create a little 
> spacing between an inline equation and normal character in MS Word).
>
They are useful when the object is to create fixed offsets between 
elements in running text. Unless these elements have a special nature 
that is widely recognized, there usually isn't any styling or markup 
available to create the same effect.

As noted ..
> But for the purposes of indenting text lines, I don?t think they are 
> useful. In almost all cases, there are better tools for indentation.
>

.. they are usually not needed for indentation and they are also not 
normally used for justification -- it seems somewhat of an unsettled 
question whether they do or do not partake in expansion / contraction 
based on justification and similar adjustments to the width of the 
variable spaces.

It's the fact that indentation and justification do not need specific 
width for spaces that lead to the (incorrect) statement, oft repeated, 
that they are not needed in digital typography -- which is nonsense, of 
course, but unfortunately, by now, well-entrenched nonsense.
>> And many (most?) software routines do not treat these as part of the 
>> class
>> of spacing characters (\s in regular expressions).
>
> Well, most regexp implementations are very Ascii-oriented: notations 
> like \s, \w, \d, etc. match Ascii characters only.
Which is an entirely different issue.

A./
>
> Yucca
>
>
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


From richard.wordingham at ntlworld.com  Sat Mar 29 17:35:59 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 29 Mar 2014 22:35:59 +0000
Subject: Unencoded Lao Characters
In-Reply-To: <CACvhRivLytdMrGyyNQMR5eNYHZV0K0_Tt9qdwGRQ2mz-dK_gWA@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <20140328091547.64b97a4f@JRWUBU2>
 <CACvhRivLytdMrGyyNQMR5eNYHZV0K0_Tt9qdwGRQ2mz-dK_gWA@mail.gmail.com>
Message-ID: <20140329223559.0965a007@JRWUBU2>

On Sat, 29 Mar 2014 11:10:52 +0700
Theppitak Karoonboonyanan <theppitak at gmail.com> wrote,
under topic 'Pali in Thai Script':

> On Fri, Mar 28, 2014 at 4:15 PM, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:

> > An older form of the Lao script is called the Thai Noi script.  That
> > script has many of the characters needed.  It has the characters, to
> > give them their 'standard' Unicode Indic names, GHA, NYA, TTHA, NNA,
> > DHA, BHA, and even has the Sanskrit-supporting characters SHA, SSA
> > and Vocalic R.  The lack of CHA, JHA, TTA, DDA, DDHA and LLA may be
> > due to their rarity, as with the lack of Vocalic L.
> 
> I don't think so. From my studies so far, Tai Noi script (aka. Lao
> Buhan) writing system was not so different from that of contemporary
> Lao script. Some characters are just obsolete.
> 
> In fact, I have been drafting a summarized proposal to encode Tai Noi
> script here:
> 
>   http://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html

That seems to be based on the analysis that the Tai Noi script is a
form of the Lao script.  In that case, it ought to address GHA, NYA,
TTHA, NNA, DHA and BHA as seen in inscriptions, recorded for example in
the 1979 MA thesis of Thawaj Poonotoke (???? ???????) at
http://www.khamkoo.com/uploads/9/0/0/4/9004485/thai_noi_palaeography.pdf .
The Buddhist Institute 'additions' should also be handled.  There are
several fonts around that make presumptions about their encoding in
Unicode.  I'm not convinced that the old Tai Noi and Buddhist Institute
forms of each of NYA and NNA are the same character - I suspect we may
have four characters here.  The two versions of NYA are particularly
difficult to reconcile.

My though on the subscript consonants are:

1) The Lao block already has two subscript consonants, U+0EBC LAO
SEMIVOWEL SIGN LO and U+0EBD LAO SEMIVOWEL SIGN NYO, though perhaps
the various forms of the latter need to disunified.  How does the
latter's J-shaped glyph kern?

2) If we allow the Lao script to be split between planes, subscript
forms could be accommodated in an 'Archaic Lao' block in the SMP.  This
would have the advantages that:

(a) In UTF-8, a subscript consonant would only take 4 bytes, whereas
using a coeng in the BMP would require 6 bytes, 3 for the coeng and and
3 for the consonant identity.  The memory requirement is 4 bytes for
both schemes in UTF-16.

(b) Distinct subscripts for the same letter can easily be encoded
distinctly.  For example, the Lao letters LO, DO and NO can easily be
taken to have two distinct subscript forms, and in the related Thai
Nithet script (?????????????), formerly used in Northern Thailand, one
can argue for four forms of the cluster HO MO - the ligature HO MO (as
LAO HO MO), and HO plus (i) a purely subscript MO (gc=Mn), (ii)
subscript MO with an ascender (gc=Mc), and (iii) a borrowing of Tai
Tham <SAKOT, MA> (gc=Mn if treated as a single character).

Richard.


From richard.wordingham at ntlworld.com  Sun Mar 30 04:50:40 2014
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 30 Mar 2014 10:50:40 +0100
Subject: Pali in Thai Script
In-Reply-To: <CACvhRisrN=qTDtsZJgn1HL5Bi4E3iOWeYjPG5uhWk+X02tRgCQ@mail.gmail.com>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <CAA_CYcJYn4bcDPf3zJgBT0dVAeVf_577WK44ES_AXkOBVyNOYw@mail.gmail.com>
 <CACvhRisrN=qTDtsZJgn1HL5Bi4E3iOWeYjPG5uhWk+X02tRgCQ@mail.gmail.com>
Message-ID: <20140330105040.6eebbfdd@JRWUBU2>

On Sat, 29 Mar 2014 10:59:09 +0700
Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:

> There is a better image here:
>   http://saixelamphao.livejournal.com/pics/catalog/493/8465
> 
> ( Source: http://saixelamphao.livejournal.com/1620.html )

Several letters in that image of the Buddhis Institute's system are in
the wrong place, not to mention two correctly labelled vargas being in
the wrong order. Properly labelled additions can be found at
http://th.wikipedia.org/wiki/???????? .  There are also additional
characters for the two extra Sanskrit sibilants.

Richard.


From sittipon at x10studio.com  Sun Mar 30 07:43:46 2014
From: sittipon at x10studio.com (Sittipon Simasanti)
Date: Sun, 30 Mar 2014 19:43:46 +0700
Subject: Pali in Thai Script
In-Reply-To: <20140330105040.6eebbfdd@JRWUBU2>
References: <BBF408C0-A1F6-4F89-896D-799F74646B0C@x10studio.com>
 <433790424.7679.1395941885598.JavaMail.www@wwinf1m11>
 <CAA_CYcKEy9dmY=TTHZYhoL476FrUtF7WDHmDkgRkDbY5WkmajQ@mail.gmail.com>
 <CAP6tU+nj=J-oaBDR8WuAhCceDd0yegnZK82O9ibUsjMjs-_y9w@mail.gmail.com>
 <CAA_CYc+sGbYHzgSgnB6MNhLhOLZtgaih4javtTJx+q0K+C5ryw@mail.gmail.com>
 <CACvhRis3Z2W65nH-Rgt-dQfjgvuA3iHxp9NZDN08J+5X19PAHw@mail.gmail.com>
 <CAA_CYcJYn4bcDPf3zJgBT0dVAeVf_577WK44ES_AXkOBVyNOYw@mail.gmail.com>
 <CACvhRisrN=qTDtsZJgn1HL5Bi4E3iOWeYjPG5uhWk+X02tRgCQ@mail.gmail.com>
 <20140330105040.6eebbfdd@JRWUBU2>
Message-ID: <CADLnAj1jgR=t6EPWm5xvvKeQv1vTvSPF2HY5DL41PROUG5mObQ@mail.gmail.com>

I would like to thanks everyone for this lively and interesting discussion.
Our project is still very early, and we are still changing the glyphs.
After the last meeting, I found myself still have a lot of things to catch
up.
But, we are going to go with unicode private area for now. They are
actually more than sufficient than we need at the moment.
I will keep you informed on any further development.

Thanks a lot, everyone!
Sittipon

On Sun, Mar 30, 2014 at 4:50 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Sat, 29 Mar 2014 10:59:09 +0700
> Theppitak Karoonboonyanan <theppitak at gmail.com> wrote:
>
> > There is a better image here:
> >   http://saixelamphao.livejournal.com/pics/catalog/493/8465
> >
> > ( Source: http://saixelamphao.livejournal.com/1620.html )
>
> Several letters in that image of the Buddhis Institute's system are in
> the wrong place, not to mention two correctly labelled vargas being in
> the wrong order. Properly labelled additions can be found at
> http://th.wikipedia.org/wiki/???????? .  There are also additional
> characters for the two extra Sanskrit sibilants.
>
> Richard.
>
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>


-- 
Sittipon Simasanti
Extend Interactive Co.,Ltd.
668-6880-8490
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140330/6e0242eb/attachment.html>

From jkorpela at cs.tut.fi  Mon Mar 31 09:05:42 2014
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Mon, 31 Mar 2014 17:05:42 +0300
Subject: Does regular Unicode have a character that looks like a space
 to a human yet is not treated as a space by software please?
In-Reply-To: <5336A817.1010205@ix.netcom.com>
References: <CF599AB2.2CEDF%KalvesmakiJ@doaks.org>
 <53343DED.5040304@cs.tut.fi> <5336A817.1010205@ix.netcom.com>
Message-ID: <53397636.6060007@cs.tut.fi>

2014-03-29 13:01, Asmus Freytag wrote:

> On managing some types of spacing between elements in running text:
>
> On 3/27/2014 8:04 AM, Jukka K. Korpela wrote:
[?]
>> The ?fixed-width spaces? are mostly just legacy characters, holdover
>> from old typography. They may have their uses, though, in contexts
>> where they work and other spacing methods don?t (for example, I
>> recently noticed that they seem to be the only way to create a little
>> spacing between an inline equation and normal character in MS Word).
>>
> They are useful when the object is to create fixed offsets between
> elements in running text.

In special cases, I would say. Normally, other tools are used. E.g., 
typesetting programs may have commands with, say, ?thin space? in their 
name, but they don?t really insert THIN SPACE characters but some 
internal representation, and the effect (width of spacing) may be 
settable in the program, possibly with a default that differs from the 
description ?a fifth of an em (or sometimes a sixth)?.

> Unless these elements have a special nature
> that is widely recognized, there usually isn't any styling or markup
> available to create the same effect.

For example, in HTML or XML, you can wrap either of the two elements in 
an inline element and set padding-right or padding-left on it. While 
this may look clumsier than using, &thinsp; or &#x2009; or THIN SPACE 
itself, it?s much more flexible?you can set any amount of spacing. 
Besides, quite often one of the elements is already an element in the 
markup, as in <i>f</i>(0), to take a typical example of a construct that 
really needs special spacing.

In word processors, you would typically select a character and set 
spacing on it in Font settings. This is clumsy, but using styles, it is 
reasonably manageable.

On the other hand, tuning of spacing is rather rare outside professional 
and ambitious typesetting. It?s really one of the things that 
distinguishes quality typesetting. Typesetters that do such things might 
be quite unaware of fixed-width spaces as characters (and might even 
regard it as odd to call spacing things characters).

> It's the fact that indentation and justification do not need specific
> width for spaces that lead to the (incorrect) statement, oft repeated,
> that they are not needed in digital typography -- which is nonsense, of
> course, but unfortunately, by now, well-entrenched nonsense.

I would rather say that the problem is in not understanding the 
importance of spacing, at a more refined level than just SPACE versus no 
space. When the problem has been understood, the solution is usually 
something else than fixed-width spaces.

Yucca


From mpsuzuki at hiroshima-u.ac.jp  Mon Mar 31 19:28:26 2014
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Tue, 01 Apr 2014 09:28:26 +0900
Subject: Call for the experts of U+3013
Message-ID: <533A082A.7030808@hiroshima-u.ac.jp>

Dear all,

Today I submitted a preliminary proposal to standardize
Variation Selectors for U+3013, so-called "GETA" mark.

ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/n4572.pdf

The geta mark was introduced from JIS X 0208:1990 and
GB 2312-1980. When I check the original documents
including the geta mark, some of the representative glyphs
in these regional standards are different from original
geta mark. I investigated theoretically possible visual
shapes of the geta mark, and concluded the registry-based
standardization of the geta mark is a considerable option.

Unfortunately, the officially printed matters including
the geta mark is not popular (I found only a few books
in Japanese national diet library), so I want to hear the
comments from the geta expert for the official proposal.

Regards,
mpsuzuki