From unicode at unicode.org  Sat Dec  2 17:49:03 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Sun, 3 Dec 2017 05:19:03 +0530
Subject: \b and Indic word boundaries?
Message-ID: <CAH-HCWXgATi6JMqz+eokW6VBp1g-M_fiaFBrFS3d9FazDN2LXw@mail.gmail.com>

Hello. Yesterday I reported https://bugs.python.org/issue32198 but
then was pointed to already existing
https://bugs.python.org/issue1693050 and friends.

>From reading these I came to find \b under
https://unicode.org/reports/tr18/#Compatibility_Properties.

I confess I don't entirely grok all the intricacies. So my question:
isn't \b the Unicode-recommended way of identifying full Unicode-aware
word boundaries in regexes? If not, what is?

-- 
Shriramana Sharma ???????????? ???????????? ????????????????????????


From unicode at unicode.org  Mon Dec  4 07:30:22 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 4 Dec 2017 13:30:22 +0000
Subject: Minimal Implementation of Unicode Collation Algorithm
Message-ID: <20171204133022.07571022@JRWUBU2>

May a collation algorithm that always compares all strings as equal be a
compliant implementation of the Unicode Collation Algorithm (UTS #10)?
If not, by which clause is it not compliant?  Formally, this algorithm
would require that all weights be zero.

Would an implementation that supported no characters be compliant?

It used to be that for an implementation to be claimed as compliant, it
also had to pass a specific conformance test.  This requirement has now
been abandoned, perhaps because the Default Unicode Collation Element
Table (DUCET) is incompatible with the CLDR Collation Algorithm.

The compatibility issues are that the DUCET weighting of U+FFFE is
incompatible with the CLDR Collation algorithm, and it seems that the
ICU implementation will not work if well-formedness condition WF5 is not
met.  Meeting WF5 without changing the collation would require about a
thousand extra entries in the table - the CLDR root collation just adds
the six changes (plus a consequent four entries for FCD closure)
desirable for natural language, and accepts the consequent changes for
unlikely strings.

Richard.

From unicode at unicode.org  Mon Dec  4 14:48:11 2017
From: unicode at unicode.org (Markus Scherer via Unicode)
Date: Mon, 4 Dec 2017 12:48:11 -0800
Subject: Minimal Implementation of Unicode Collation Algorithm
In-Reply-To: <20171204133022.07571022@JRWUBU2>
References: <20171204133022.07571022@JRWUBU2>
Message-ID: <CAN49p6qbsRDsucKEVfoVQSt0OQE57d4kZb=-bHtx_X2so6m9=g@mail.gmail.com>

On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> May a collation algorithm that always compares all strings as equal be a
> compliant implementation of the Unicode Collation Algorithm (UTS #10)?
> If not, by which clause is it not compliant?  Formally, this algorithm
> would require that all weights be zero.
>

I think so. The algorithm would be equivalent to an implementation of the
UCA with a degenerate CET that maps every character to a Completely
Ignorable Collation Element.

Would an implementation that supported no characters be compliant?
>

I guess so. I assume that would mean that the CET maps nothing, and that
the implementation does implement the implicit weighting of Han characters
and unassigned (here: unmapped) code points. It would also have to do NFD
first.

It used to be that for an implementation to be claimed as compliant, it
> also had to pass a specific conformance test.  This requirement has now
> been abandoned, perhaps because the Default Unicode Collation Element
> Table (DUCET) is incompatible with the CLDR Collation Algorithm.
>

The DUCET is missing some things that are needed by the CLDR Collation
Algorithm, but that has nothing to do with UCA compliance.

The simple fact is that tailorings are common, and it has to be possible to
conform to the algorithm without forbidding tailorings.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171204/05725f4f/attachment.html>

From unicode at unicode.org  Mon Dec  4 19:02:22 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 5 Dec 2017 01:02:22 +0000
Subject: Minimal Implementation of Unicode Collation Algorithm
In-Reply-To: <CAN49p6qbsRDsucKEVfoVQSt0OQE57d4kZb=-bHtx_X2so6m9=g@mail.gmail.com>
References: <20171204133022.07571022@JRWUBU2>
 <CAN49p6qbsRDsucKEVfoVQSt0OQE57d4kZb=-bHtx_X2so6m9=g@mail.gmail.com>
Message-ID: <20171205010222.408a2e96@JRWUBU2>

On Mon, 4 Dec 2017 12:48:11 -0800
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  

> > Would an implementation that supported no characters be compliant?
 
> I guess so. I assume that would mean that the CET maps nothing, and
> that the implementation does implement the implicit weighting of Han
> characters and unassigned (here: unmapped) code points. It would also
> have to do NFD first.

I am extrapolating from the comment on UTS10-C1 in UTS#10, "In
particular,  a conformant implementation must be able to compare any
two canonical-equivalent strings as being equal, for all Unicode
characters supported by that implementation."  There is now nothing
that forces the implementation to support any Unicode characters!

Possibly this results from an attempt to allow an implementation to
conform to Version x.y.z of the UCA with supporting normalisation
for some other set of characters or choosing not to support character
with non-zero canonical combining class, which, while not eliminating
the need to address canonical equivalence, goes a long way towards doing
so.

I am not aware of any general requirement that a CET be a tailoring of
DUCET or of the CLDR root collation, so the implicit weights would be
irrelevant in this case.  The implicit weights are part of DUCET.

If no characters are supported, performing NFD will be a rather obvious
trivial transformation of the null string to itself.

> 
> It used to be that for an implementation to be claimed as compliant,
> it
> > also had to pass a specific conformance test.  This requirement has
> > now been abandoned, perhaps because the Default Unicode Collation
> > Element Table (DUCET) is incompatible with the CLDR Collation
> > Algorithm. 
> 
> The DUCET is missing some things that are needed by the CLDR Collation
> Algorithm, but that has nothing to do with UCA compliance.

An implementation that only implements the CLDR collation algorithm
cannot be tailored to support DUCET, because DUCET (at Version 10.0.0)
has the ordering U+FFF8 < U+FFFE < U+1004E, which is incompatible with
UTS#35 Part 5 Section 1.1.1 - "U+FFFE maps to a CE with a minimal,
unique primary weight".

Therefore one could only apply the published UCA conformance test if it
deliberately avoided strings containing U+FFFE.

> The simple fact is that tailorings are common, and it has to be
> possible to conform to the algorithm without forbidding tailorings.

It's the CLDR collation algorithm that prohibits DUCET.  Thankfully, the
CLDR root collation can be interpreted to be compatible with the UCA.
(Tailorings may be incompatible, or at least, incompatible with the
concept of a finite CET.) 

Richard.

From unicode at unicode.org  Tue Dec  5 11:44:05 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 5 Dec 2017 18:44:05 +0100
Subject: Armenian Mijaket (Armenian colon)
Message-ID: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>

The Armenian script has its own distinctive punctuation (vertsaket) for the
standard full stop at end of sentence (whose glyph looks very much like the
Basic Latin/ASCII colon, however slighly more bold and slanted and whose
dots are rectangular). It is encoded at U+0589. And used in traditional
texts instead of the "modern" full stop.

But Armenian also has its own distinctive puctuation (mijaket) for the
introductory colon between two phrases of the same sentence (whose glyph
looks very much like the Basic Latin/ASCII full stop). It is not encoded
and I don't like using the ASCII full stop where it causes confusion.

Where is the Armenian distinctive mijaket? Shouldn't it be encoded at
U+0588?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/02691f13/attachment.html>

From unicode at unicode.org  Tue Dec  5 13:28:10 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 5 Dec 2017 20:28:10 +0100
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <20171205185925.w52f4rijsld7m6cy@number19>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
Message-ID: <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>

U+2024 is not supported in any fonts I have loaded. A websearch of mijaket
gives nothing.
U+20224 is used as a "leader dot", and does not match the expected metrics
(it is  certainly not a mijaket, it should be more like U+0589, i.e. as a
bold parallelogram, and not a thin leader dot).

Leader dots are NOT used as real punctuation, they are presentational, for
example in TOC (table of contents), where they are aligned in arbitrarily
long rows.

The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not
normative and in fact it is wrong in my opinion.

The mijaket (Armenian colon) should be encoded (preferably at U+0588 in the
Armenian block) as it also has to be distinguisdhed from leader dots in
Armenian TOC, exactly like the vertsaket was distinguished at U+0589.


2017-12-05 19:59 GMT+01:00 S. Gilles <sgilles at math.umd.edu>:

> On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote:
> > The Armenian script has its own distinctive punctuation (vertsaket) for
> the
> > standard full stop at end of sentence (whose glyph looks very much like
> the
> > Basic Latin/ASCII colon, however slighly more bold and slanted and whose
> > dots are rectangular). It is encoded at U+0589. And used in traditional
> > texts instead of the "modern" full stop.
> >
> > But Armenian also has its own distinctive puctuation (mijaket) for the
> > introductory colon between two phrases of the same sentence (whose glyph
> > looks very much like the Basic Latin/ASCII full stop). It is not encoded
> > and I don't like using the ASCII full stop where it causes confusion.
> >
> > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at
> > U+0588?
>
> Off-list because I generally don't know what I'm talking about, but
> grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't
> what you're looking for, my apologies.
>
> --
> S. Gilles
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/713b134d/attachment.html>

From unicode at unicode.org  Tue Dec  5 13:46:22 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 5 Dec 2017 20:46:22 +0100
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
 <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
Message-ID: <CAGa7JC2KkZaOYUOg+j9YF0WrDCpFrU6JSJzk_mprYwaJZFpdnA@mail.gmail.com>

Note that "Noto Sans Armenian" does not even map U+2024 (I doubt it is
accepted as a real replacement for the missing Armenian mijaket which plays
a role similar to a Latin semicolon or colon), it does match the hyphen at
U+2010. But U+0589 (Armenian "versakjet", the Armenian full stop that looks
like a ":" Latin colon) is mapped.

My opinion is that the one dot leader has only been used by some sources
that don't need to render tabular data or TOCs: these sources needing these
traditional distinctions are probably religious texts, and clearly they
don't even look like what is in the Unicode PDF for the representative
glyph, and "Noto Sans Armenian" is designed for modern use on display and
even there we'll need a better distinction and better metrics than going
with the possible "Noto Sans" mapping of the leader dot at U+2024 (which
still does not exist: in fact leaders are better represented another way
than by repeating this character: leaders are essially parsed in arbitrary
lengths like a tabulation whitespace and so the leader dot is not
semantically suitable at all as a mijaket (it's just like if we wanted to
replace ASCII full stops or colons and semicolons in English by SPACE or
TAB: in Armenian this just causes havoc).


2017-12-05 20:28 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> U+2024 is not supported in any fonts I have loaded. A websearch of mijaket
> gives nothing.
> U+20224 is used as a "leader dot", and does not match the expected metrics
> (it is  certainly not a mijaket, it should be more like U+0589, i.e. as a
> bold parallelogram, and not a thin leader dot).
>
> Leader dots are NOT used as real punctuation, they are presentational, for
> example in TOC (table of contents), where they are aligned in arbitrarily
> long rows.
>
> The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not
> normative and in fact it is wrong in my opinion.
>
> The mijaket (Armenian colon) should be encoded (preferably at U+0588 in
> the Armenian block) as it also has to be distinguisdhed from leader dots in
> Armenian TOC, exactly like the vertsaket was distinguished at U+0589.
>
>
> 2017-12-05 19:59 GMT+01:00 S. Gilles <sgilles at math.umd.edu>:
>
>> On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote:
>> > The Armenian script has its own distinctive punctuation (vertsaket) for
>> the
>> > standard full stop at end of sentence (whose glyph looks very much like
>> the
>> > Basic Latin/ASCII colon, however slighly more bold and slanted and whose
>> > dots are rectangular). It is encoded at U+0589. And used in traditional
>> > texts instead of the "modern" full stop.
>> >
>> > But Armenian also has its own distinctive puctuation (mijaket) for the
>> > introductory colon between two phrases of the same sentence (whose glyph
>> > looks very much like the Basic Latin/ASCII full stop). It is not encoded
>> > and I don't like using the ASCII full stop where it causes confusion.
>> >
>> > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at
>> > U+0588?
>>
>> Off-list because I generally don't know what I'm talking about, but
>> grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't
>> what you're looking for, my apologies.
>>
>> --
>> S. Gilles
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/983b9e06/attachment.html>

From unicode at unicode.org  Tue Dec  5 14:35:14 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 5 Dec 2017 12:35:14 -0800
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
 <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
Message-ID: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/5a56b7f2/attachment.html>

From unicode at unicode.org  Tue Dec  5 15:08:39 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 5 Dec 2017 22:08:39 +0100
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
 <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
 <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>
Message-ID: <CAGa7JC2YU3tVOKG29hdBmZHAjpy07LEAo3it6ty6BKwSAmBaOg@mail.gmail.com>

In fact I would also remove the suggested misleading (non normative) note
in NamesList.txt about the use of the ONE LEADER DOT (it is jsut one of the
possible fallbacks but it has wrong properties for encoiding plaintext, it
is only useful as a rendering fallback, but is not even useful for that
because almsot no font map this character, as leader dots are preferably
rendered another way, by drawing a dotted line; one some text renderers may
use the leader dot only when they need to transform a leader space into a
botrted line and they need a glyph for that, but note that they'll also
need to control the spacing, margins and will probably always put it on the
baseline like regular full stops)

A better fallback is the middle dot (but with additional thin space around
it). Still for the semantics, and because we should not have to use such
renndering fallbacks for composing plain texts (imagine what we want to
enter in a database of texts or in translation engines that don't know and
should have to worry about the fonts, font styles or metrics, when here we
need a clear semantic distinction of the mikajet (colon or semi-colon
articulating two phrases in the same sentence, or at end of an introductory
sentence followed by one value or a list of Armenian words, itself
terminated by an Armenian full stop U+0589).

You'll note that on Wikipedia, the ArmSCII table at top of the page was
composed and rendered (in LaTeX) with the middle dot and is clearly
distinguished from the ASCII full stop and the Armenian full stop. You will
find no place there about the ONE DOT LEADER.

This is espacially important because today Armenian will be written using
eithern "modern" (ASCII) punctuations (like in English with colons,
semicolons, and full stops), or traditional punctuation. And it cannot be
predicted in which context the transalted texts will be used (modern/ASCII
or traditional) so we have an ambiguity about how to translate and
represent colons/semicolons and full stops.

The Armenian full stop is clearly encoded. The Armenian [semi]colon is not
and we only have fallbacks. So we need the "mikajet" at U+0588 (unallocated
and jut before the distinctive U+0589 Armenian full stop) is the best place.
Even for the Unicode represenative chart, you'll note that the characters
are slanted including the punctuation and the dots become ovals. Various
Armenian texts use square dots (apparently drawn as a small nearly vertical
stroke with a pencil or plum).

This will leave the renderers choosing how to rendere the two Armenian
punctuations (either traditional, or modern) and will preserve the
semantics of text without conflicting with other rendering options (for the
leaders in TOCs or tabular data, which may eventually use U+2024 with some
rare fonts specific to the renderer engine and its own typographical
engine, if it ever needs a font for its needed glyphs, but zven in that
case this internal fonts will not need to be Unicode encoded, it will just
be a collection of glyphs for the intended rendering effect and styles it
wants to support).

For now the immediate real need is for fully translating interfaces in
applications and allowing them to support either a "modern" style
(English/ASCII punctuations) or "traditional" style. No fallback characters
should be encoded in these texts so that no confusion will arise if ever
one uses both the real Armenian full stop (two dots) and a fallback for the
distinctive missing mikajet (single dot, to distinguish also from leaders
and decimal separators in numbers or abbreviation dots). The new encoded
mikajet may include a note suggesting the use of the MIDDLE DOT as a
preferable fallback.


2017-12-05 21:35 GMT+01:00 Asmus Freytag via Unicode <unicode at unicode.org>:

> On 12/5/2017 11:28 AM, Philippe Verdy via Unicode wrote:
>
> U+2024 is not supported in any fonts I have loaded. A websearch of mijaket
> gives nothing.
> U+20224 is used as a "leader dot", and does not match the expected metrics
> (it is  certainly not a mijaket, it should be more like U+0589, i.e. as a
> bold parallelogram, and not a thin leader dot).
>
> Leader dots are NOT used as real punctuation, they are presentational, for
> example in TOC (table of contents), where they are aligned in arbitrarily
> long rows.
>
> The note in http://www.unicode.org/charts/PDF/U2000.pdf is absolutely not
> normative and in fact it is wrong in my opinion.
>
> The mijaket (Armenian colon) should be encoded (preferably at U+0588 in
> the Armenian block) as it also has to be distinguisdhed from leader dots in
> Armenian TOC, exactly like the vertsaket was distinguished at U+0589.
>
>
> Well, unless someone (you?) writes a proposal to that effect....
>
> (I don't know the history of this particular "unification" but on the face
> of it would share your concern that unifying something with a very specific
> functionality and metrics, leader dots, with ordinary script-specific
> punctuation is not helpful - unless it can be shown that this unification
> is widely supported in practice. However, if your claim that 2024 is
> unsupported is correct, that would strengthen the case for reconsidering
> this; however the case would have to  be made in a formal proposal first).
>
> A./
>
>
>
> 2017-12-05 19:59 GMT+01:00 S. Gilles <sgilles at math.umd.edu>:
>
>> On 2017-12-05T18:44:05+0100, Philippe Verdy via Unicode wrote:
>> > The Armenian script has its own distinctive punctuation (vertsaket) for
>> the
>> > standard full stop at end of sentence (whose glyph looks very much like
>> the
>> > Basic Latin/ASCII colon, however slighly more bold and slanted and whose
>> > dots are rectangular). It is encoded at U+0589. And used in traditional
>> > texts instead of the "modern" full stop.
>> >
>> > But Armenian also has its own distinctive puctuation (mijaket) for the
>> > introductory colon between two phrases of the same sentence (whose glyph
>> > looks very much like the Basic Latin/ASCII full stop). It is not encoded
>> > and I don't like using the ASCII full stop where it causes confusion.
>> >
>> > Where is the Armenian distinctive mijaket? Shouldn't it be encoded at
>> > U+0588?
>>
>> Off-list because I generally don't know what I'm talking about, but
>> grepping NamesList.txt for ?mijaket? gives U+2024. If this isn't
>> what you're looking for, my apologies.
>>
>> --
>> S. Gilles
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/f42efb10/attachment.html>

From unicode at unicode.org  Tue Dec  5 15:32:56 2017
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 5 Dec 2017 13:32:56 -0800
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
 <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
 <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>
Message-ID: <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net>

Asmus,


On 12/5/2017 12:35 PM, Asmus Freytag via Unicode wrote:
> I don't know the history of this particular "unification"

Here are some clues to guide further research on the history.

The annotation in question was added to a draft of the NamesList.txt 
file for Unicode 4.1 on October 7, 2003.

The annotation was not yet in the Unicode 4.0 charts, published in 
April, 2003.

That should narrow down the search for everybody. I can't find specific 
mention of this in the UTC minutes from the relevant 2003 window.

But I strongly suspect that the catalyst for the change was the 
discussion that took place regarding PRI #12 re terminal punctuation:

http://www.unicode.org/review/pr-12.html

That document, at least, does mention "Armenian" and U+2024, although 
not in the same breath. That PRI was discussed and closed at UTC #96, on 
August 25, 2003:

http://www.unicode.org/L2/L2003/03240.htm

I don't find any particular mention of U+2024 in my own notes from that 
meeting, so I suspect the proximal cause for the change to the 
annotation for U+2024 on October 7 will have to be dug out of an email 
archive at some point.

--Ken


From unicode at unicode.org  Tue Dec  5 16:26:51 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 5 Dec 2017 14:26:51 -0800
Subject: Armenian Mijaket (Armenian colon)
In-Reply-To: <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net>
References: <CAGa7JC0ADt4JExGY=87-9P0PLZA2DsnRatcheD11dW0Ka=shTg@mail.gmail.com>
 <20171205185925.w52f4rijsld7m6cy@number19>
 <CAGa7JC1GRj5PR1m3OT6e5ukZ2y4_ygwHz=czyCcomMpzMFuDZg@mail.gmail.com>
 <0037e6a8-4af9-fd7d-50b3-a4a378827f0a@ix.netcom.com>
 <0b5fcb11-b7de-c497-adac-3494656c8fde@att.net>
Message-ID: <7bfa3bf5-7c92-7a99-3902-3a7ed9accb59@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171205/03683fe7/attachment.html>

From unicode at unicode.org  Fri Dec  8 12:13:11 2017
From: unicode at unicode.org (Dreiheller, Albrecht via Unicode)
Date: Fri, 8 Dec 2017 18:13:11 +0000
Subject: Typo  in FAQ-Indic ?
Message-ID: <3E10480FE4510343914E4312AB46E74212D180E2@DEFTHW99EH5MSX.ww902.siemens.net>

Is this a typo?
>> Q: Is the keyboard arrangement in a Unicode system different form that of the regular "TTF" fonts?
Maybe it should read   "...  different FROM that ..."

Regards,
Albrecht

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171208/e9054a29/attachment.html>

From unicode at unicode.org  Fri Dec  8 16:06:19 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 8 Dec 2017 22:06:19 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
Message-ID: <20171208220619.3eb2fcbe@JRWUBU2>

Apart from the likely but unmandated consequence of making editing
Indic text more difficult (possibly contrary to the UK's Equality Act
2010), there is another difficulty that will follow directly from the
currently proposed expansion of grapheme clusters
(https://www.unicode.org/reports/tr29/proposed.html).

Unless I am missing something, text boundaries have hitherto been
cunningly crafted so that they are not changed by normalisation.
Have I missed something, or has there been a change in policy?

For extended grapheme clusters, the relevant rules are proposed as:

GB9: ? 	(Extend | ZWJ | Virama)

GB9c: (Virama | ZWJ ) 	? LinkingConsonant

Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
This would lead canonically equivalent text to have strikingly
different divisions:

<consonant, nukta, virama, consonant> (no break)

but

<consonant, virama, nukta | consonant>

There are other variations on this theme. In Tai Tham, we have the
following conflict:

natural order, no break:

<consonant, non-spacing-vowel, tone-mark, sakot, consonant>

but normalised, there would be a break:

<consonant, non-spacing-vowel, sakot, tone-mark | consonant>

>From reading the text, it seems that it is expected that the presence
or absence of a break should be fine-tuned by CLDR language-specific
rules.  How is this expected to work, e.g. for Saurashtra in Tamil
script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
root locale now specify the default segmentation rule, rather than
UAX#29 plus the Unicode Character Database?

Richard.


From unicode at unicode.org  Sat Dec  9 08:28:31 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 9 Dec 2017 14:28:31 +0000
Subject: =?UTF-8?B?QXF1Yc+Gzr/Oss6vzrE=?=
Message-ID: <20171209142831.772d1f49@JRWUBU2>

Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0
implies that it might be considered desirable to have a word boundary
in 'aqua?????' or a grapheme cluster break in a coding such as <006C,
U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which
should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the
principle of script separation.  Why are such breaks desirable?

I can understand an argument that these should be tolerated, as an
application could have been designed on the basis that script
boundaries imply word boundaries (not true for Japanese) and that word
boundaries imply grapheme cluster boundaries (not true for Sanskrit,
where they don't even imply character boundaries.)  There are some who
claim that the Laotian consonant place holder is the letter 'x' rather
than the multiplication sign, U+00D7, which does have
Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is
suggesting that there should be grapheme cluster boundary between
U+00D7 with script=common and a non-spacing Lao vowel any more than
there would be with a Lao consonant.)

Richard.


From unicode at unicode.org  Sat Dec  9 09:08:22 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 9 Dec 2017 16:08:22 +0100
Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?=
In-Reply-To: <20171209142831.772d1f49@JRWUBU2>
References: <20171209142831.772d1f49@JRWUBU2>
Message-ID: <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>

2017-12-09 15:28 GMT+01:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0
> implies that it might be considered desirable to have a word boundary
> in 'aqua?????' or a grapheme cluster break in a coding such as <006C,
> U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which
> should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the
> principle of script separation.  Why are such breaks desirable?
>

I don't understand why one would encode a DEVANAGARI SIGN in the middle of
a Greek word to mean it implies a word boundary in Greek !?!


> There are some who
> claim that the Laotian consonant place holder is the letter 'x' rather
> than the multiplication sign, U+00D7, which does have
> Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is
> suggesting that there should be grapheme cluster boundary between
> U+00D7 with script=common and a non-spacing Lao vowel any more than
> there would be with a Lao consonant.)
>

Here again the multiplication sign has nothing to do with an Indic
consonnant. May be it has been used like this in some texts but this look
more like a tweak. If one needs a consonnant holder propose to encode an
"empty" letter (like in Hangul or in Arabic), possibly with variant forms
(e.g. changing between a circle, dotted circle, cross, or horizontal joiner
on the hanging baseline for Devenagari and similar scripts).

The usual base letter placeholder for combining diacritics is usually a
whitespace (preferably NBSP, not SPACE) or the dotted circle symbol, but
not a mathematical symbol which is used also within math formulas with
variable names using common letters or even words.

The multiplication sign used in the UTS standard was chosen because it
normally does not occur within words, and only for defining the breaking
rules (to indicate that NO break is allowed here, i.e. the opposite of what
you describe): it is notational only and is clearly not meant to combine
with what follows: if you encode the multiplication sign then an Indic
diacritic, we expect to see the separate multipliation sign (with break
opportunities on both sides) then a dotted circle glyph used for defective
grapheme clusters to hold the diacritic.

So for me Indic_syllabic_category=Consonant_Placeholder is wrong: for such
use of the cross, an Indic (or generic) consonant placeholder should better
be encoded and used and that property may be added on it, and removed from
the multiplication sign.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171209/96d03ab8/attachment.html>

From unicode at unicode.org  Sat Dec  9 09:16:44 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sat, 9 Dec 2017 16:16:44 +0100
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <20171208220619.3eb2fcbe@JRWUBU2>
References: <20171208220619.3eb2fcbe@JRWUBU2>
Message-ID: <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>

1. You make a good point about the GB9c. It should probably instead be
something like:

GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant


Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.

https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory


Mark <https://twitter.com/mark_e_davis>

On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> (https://www.unicode.org/reports/tr29/proposed.html).
>
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
>
> For extended grapheme clusters, the relevant rules are proposed as:
>
> GB9: ?  (Extend | ZWJ | Virama)
>
> GB9c: (Virama | ZWJ )   ? LinkingConsonant
>
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
>
> <consonant, nukta, virama, consonant> (no break)
>
> but
>
> <consonant, virama, nukta | consonant>
>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
>
> natural order, no break:
>
> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
>
> but normalised, there would be a break:
>
> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171209/b77bcf62/attachment.html>

From unicode at unicode.org  Sat Dec  9 09:31:06 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sat, 9 Dec 2017 16:31:06 +0100
Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?=
In-Reply-To: <20171209142831.772d1f49@JRWUBU2>
References: <20171209142831.772d1f49@JRWUBU2>
Message-ID: <CAJ2xs_GqTORDtN02Z=Q2B6HbBQ3NeWdo1L+5R7=LYFY5cDqffQ@mail.gmail.com>

Some people have been confused by the previous wording, and thought that it
wouldn't be legitimate to break on script boundaries. So we wanted to make
it clear that that was possible, since:

   1. Many implementations of rendering break text into script-runs before
   further processing, and
   2. There are certainly cases where user's expectations are better met
   with breaks on script boundaries*

We thus wanted to make it clear to people that it *is* a legitimate
customization to break on script boundaries.

* Clearly such an approach can't be hard-nosed: an implementation would
need at the very least to handle Common and Inherited specially: not impose
a boundary *because of script* where the SCX value is one of those, either
before or after a break point.

Any suggestions for clarifying language are appreciated.

Mark

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 9, 2017 at 3:28 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0
> implies that it might be considered desirable to have a word boundary
> in 'aqua?????' or a grapheme cluster break in a coding such as <006C,
> U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l?), which
> should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the
> principle of script separation.  Why are such breaks desirable?
>
> I can understand an argument that these should be tolerated, as an
> application could have been designed on the basis that script
> boundaries imply word boundaries (not true for Japanese) and that word
> boundaries imply grapheme cluster boundaries (not true for Sanskrit,
> where they don't even imply character boundaries.)  There are some who
> claim that the Laotian consonant place holder is the letter 'x' rather
> than the multiplication sign, U+00D7, which does have
> Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is
> suggesting that there should be grapheme cluster boundary between
> U+00D7 with script=common and a non-spacing Lao vowel any more than
> there would be with a Lao consonant.)
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171209/4d6f201a/attachment.html>

From unicode at unicode.org  Sat Dec  9 10:22:47 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 9 Dec 2017 16:22:47 +0000
Subject: =?UTF-8?B?QXF1Yc+Gzr/Oss6vzrE=?=
In-Reply-To: <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>
References: <20171209142831.772d1f49@JRWUBU2>
 <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>
Message-ID: <20171209162247.58c60e3c@JRWUBU2>

On Sat, 9 Dec 2017 16:08:22 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2017-12-09 15:28 GMT+01:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  
> 
> > Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0
> > implies that it might be considered desirable to have a word
> > boundary in 'aqua?????' or a grapheme cluster break in a coding
> > such as <006C, U+0901 DEVANAGARI SIGN CANDRABINDU> for el
> > candrabindu (l?), which should be <006C, U+0310 COMBINING
> > CANDRABINDU> in accordance with the principle of script
> > CANDRABINDU> separation.  Why are such breaks desirable?
> >  
> 
> I don't understand why one would encode a DEVANAGARI SIGN in the
> middle of a Greek word to mean it implies a word boundary in Greek !?!

The two examples given are "aqua?????" and "A?".  The first switches
from Latin to Greek and the second is a Latin letter with a Devanagari
mark. However, there is a pre-Unicode tradition of using el with
candrabindu when writing Sanskrit in the the Roman alphabet, which is
why there is U+0310.

> > There are some who
> > claim that the Laotian consonant place holder is the letter 'x'
> > rather than the multiplication sign, U+00D7, which does have
> > Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is
> > suggesting that there should be grapheme cluster boundary between
> > U+00D7 with script=common and a non-spacing Lao vowel any more than
> > there would be with a Lao consonant.)
> >  
> 
> Here again the multiplication sign has nothing to do with an Indic
> consonnant. May be it has been used like this in some texts but this
> look more like a tweak.

Whatever its origin, it seems well established in Laos, and I've seen
it used for the Tai Tham script as well as for the Lao script. Try
searching for images of Lao vowels in French. Googling in English found
plenty of examples, and the teaching book shown at
http://www.bigbrothermouse.com/books/antknife16size-book.html supports
the case nicely.  I've also seen it used for Khmer, but not to the
extent that I can argue that it is well-established in Cambodia.

The Khmer example was produced using a typewriter and apparently a
felt-tipped pen, so unsurprisingly the vowel bearer was clearly a
typewritten letter 'x'.

> If one needs a consonnant holder propose to
> encode an "empty" letter (like in Hangul or in Arabic), possibly with
> variant forms (e.g. changing between a circle, dotted circle, cross,
> or horizontal joiner on the hanging baseline for Devenagari and
> similar scripts).

Propose a disunification if you like. The competing tradition is to
use LAO LETTER KO, and a Lao-English dictionary from Thailand uses a
grey LAO LETTER O, following the Thai tradition of using the Thai letter
for /?/, which serves as the 'empty' letter for Pali and Sanskrit.
Remember that a proposal for an invisible letter for Indic was rejected.

> The usual base letter placeholder for combining diacritics is usually
> a whitespace (preferably NBSP, not SPACE) or the dotted circle
> symbol, but not a mathematical symbol which is used also within math
> formulas with variable names using common letters or even words.

> The multiplication sign used in the UTS standard was chosen because it
> normally does not occur within words,...

and has nothing to do with the usage of U+00D7 as a consonant
placeholder.

Richard.


From unicode at unicode.org  Sat Dec  9 14:30:17 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 9 Dec 2017 20:30:17 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
Message-ID: <20171209203017.77dbcbf9@JRWUBU2>

On Sat, 9 Dec 2017 16:16:44 +0100
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> 1. You make a good point about the GB9c. It should probably instead be
> something like:
> 
> GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant
> 
> 
> Extend is a broader than necessary, and there are a few items that
> have ccc!=0 but not gcb=extend. But all of those look to be
> degenerate cases.

Something *like*.

Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I believe
these both prevent a preceding candrakkala from extending an akshara -
see TUS Section 12.9 about Table 12-33.  I think Extend will have to be
split between starters and non-starters.

I believe there is a problem with the first two examples in Table
12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM
VOWEL SIGN AA> to the first two examples, yielding *??????? and
*????????, one would have three Malayalam aksharas, not two extended
grapheme clusters as the proposed rules would say. This is different to
Tai Tham, where there would indeed just be two aksharas in each word,
albit odd-looking - ??????? and ????????.  Who's checking the impact of
these changes on Malayalam?

Richard.


From unicode at unicode.org  Sat Dec  9 14:56:19 2017
From: unicode at unicode.org (Jonathan Rosenne via Unicode)
Date: Sat, 9 Dec 2017 20:56:19 +0000
Subject: =?utf-8?B?UkU6IEFxdWHPhs6/zrLOr86x?=
In-Reply-To: <20171209162247.58c60e3c@JRWUBU2>
References: <20171209142831.772d1f49@JRWUBU2>
 <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>
 <20171209162247.58c60e3c@JRWUBU2>
Message-ID: <HE1PR0701MB183374BD969D2F168EFBB60784310@HE1PR0701MB1833.eurprd07.prod.outlook.com>

There exist several Judeo-Arabic texts, Arabic written in Hebrew script with Arabic vowels and other marks. One well known is The Guide to the Perplexed.

See a modern transcript at https://he.wikisource.org/wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7%99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90.

A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_MANUSCRIPTS000043324-1#|FL36876376

Best Regards,

Jonathan Rosenne


From unicode at unicode.org  Sun Dec 10 23:14:18 2017
From: unicode at unicode.org (Manish Goregaokar via Unicode)
Date: Sun, 10 Dec 2017 21:14:18 -0800
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
Message-ID: <CAFOnWkkiqj9jRUYgMGg=tW7RnkutQ2qyMKp2UHbZqtVeL7uWmA@mail.gmail.com>

> GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant

You can also explicitly request ligatureification with a ZWJ, so perhaps
this rule should be something like

(Virama ZWJ? | ZWJ) x Extend* LinkingConsonant

-Manish

On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ?? via Unicode <
unicode at unicode.org> wrote:

> 1. You make a good point about the GB9c. It should probably instead be
> something like:
>
> GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant
>
>
> Extend is a broader than necessary, and there are a few items that have
> ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.
>
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=
> [\p{ccc!=0}-\p{gcb=extend}]&g=ccc+indicsyllabiccategory
> <https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%5Cp%7Bccc!=0%7D-%5Cp%7Bgcb=extend%7D]&g=ccc+indicsyllabiccategory>
>
>
>
> Mark <https://twitter.com/mark_e_davis>
>
> On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>
>> Apart from the likely but unmandated consequence of making editing
>> Indic text more difficult (possibly contrary to the UK's Equality Act
>> 2010), there is another difficulty that will follow directly from the
>> currently proposed expansion of grapheme clusters
>> (https://www.unicode.org/reports/tr29/proposed.html).
>>
>> Unless I am missing something, text boundaries have hitherto been
>> cunningly crafted so that they are not changed by normalisation.
>> Have I missed something, or has there been a change in policy?
>>
>> For extended grapheme clusters, the relevant rules are proposed as:
>>
>> GB9: ?  (Extend | ZWJ | Virama)
>>
>> GB9c: (Virama | ZWJ )   ? LinkingConsonant
>>
>> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
>> This would lead canonically equivalent text to have strikingly
>> different divisions:
>>
>> <consonant, nukta, virama, consonant> (no break)
>>
>> but
>>
>> <consonant, virama, nukta | consonant>
>>
>> There are other variations on this theme. In Tai Tham, we have the
>> following conflict:
>>
>> natural order, no break:
>>
>> <consonant, non-spacing-vowel, tone-mark, sakot, consonant>
>>
>> but normalised, there would be a break:
>>
>> <consonant, non-spacing-vowel, sakot, tone-mark | consonant>
>>
>> From reading the text, it seems that it is expected that the presence
>> or absence of a break should be fine-tuned by CLDR language-specific
>> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
>> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
>> root locale now specify the default segmentation rule, rather than
>> UAX#29 plus the Unicode Character Database?
>>
>> Richard.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171210/1d4460f0/attachment.html>

From unicode at unicode.org  Mon Dec 11 01:59:20 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 11 Dec 2017 08:59:20 +0100
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <20171209203017.77dbcbf9@JRWUBU2>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
Message-ID: <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>

The proposed rules do not distinguish the different visual forms that a
sequence of characters surrounding a virama can have, such as

   1. an explicit virama, or
   2. a half-form is visible, or
   3. a ligature is created.

That is following the requested structure in
http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.

So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct Forms in
Devanagari <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
doesn't
break a GC, nor do instances where a particular script always shows an
explicit virama between two particular consonants. All the lines on Figure
12-7. Consonant Forms in Devanagari and Oriya
<http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257> having a
virama would have single GCs (that is, all but the first line). [That,
after correcting the rules as per Manish Goregaokar's feedback, thanks!]

The examples in "Annexure B" of 17200-text-seg-rec.pdf
<http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly include #2
and #3, but don't have any examples of #1 (as far as I can tell from a
quick scan). It would be very useful to have explicit examples that
included #1, and included scripts other than Devanagari (+swaran,
others). While
the online tool at http://unicode.org/cldr/utility/breaks.jsp can't yet be
used until the Unicode 11 UCD is further along, I have an implementation of
the new rules such that I can take any particular list of words and
generate the breaks. So if someone can supply examples from different
scripts or with different combinations of virama, zwj, zwnj, etc..... I can
push out the result to this list.

And yes, we do need review of these for Malayalam (+cibu, others).

If there are scripts for which the rules really don't work (or need more
research before #29 is finalized in May), it is fairly straightforward to
restrict the rule changes by modifying
http://www.unicode.org/reports/tr29/proposed.html#Virama to either exclude
particular scripts or include only particular scripts.

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sat, 9 Dec 2017 16:16:44 +0100
> Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
>
> > 1. You make a good point about the GB9c. It should probably instead be
> > something like:
> >
> > GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant
> >
> >
> > Extend is a broader than necessary, and there are a few items that
> > have ccc!=0 but not gcb=extend. But all of those look to be
> > degenerate cases.
>
> Something *like*.
>
> Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I believe
> these both prevent a preceding candrakkala from extending an akshara -
> see TUS Section 12.9 about Table 12-33.  I think Extend will have to be
> split between starters and non-starters.
>
> I believe there is a problem with the first two examples in Table
> 12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM
> VOWEL SIGN AA> to the first two examples, yielding *??????? and
> *????????, one would have three Malayalam aksharas, not two extended
> grapheme clusters as the proposed rules would say. This is different to
> Tai Tham, where there would indeed just be two aksharas in each word,
> albit odd-looking - ??????? and ????????.  Who's checking the impact of
> these changes on Malayalam?
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171211/c5cd779d/attachment.html>

From unicode at unicode.org  Mon Dec 11 04:16:31 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 11 Dec 2017 10:16:31 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAFOnWkkiqj9jRUYgMGg=tW7RnkutQ2qyMKp2UHbZqtVeL7uWmA@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <CAFOnWkkiqj9jRUYgMGg=tW7RnkutQ2qyMKp2UHbZqtVeL7uWmA@mail.gmail.com>
Message-ID: <20171211101631.44155a27@JRWUBU2>

On Sun, 10 Dec 2017 21:14:18 -0800
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> > GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant  
> 
> You can also explicitly request ligatureification with a ZWJ, so
> perhaps this rule should be something like
> 
> (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant
> 
> -Manish
> 
> On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ?? via Unicode <
> unicode at unicode.org> wrote:  
> 
> > 1. You make a good point about the GB9c. It should probably instead
> > be something like:
> >
> > GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant

This change is unnecessary.  If we start from Draft 1 where there are:

GB9: 	  	? 	(Extend | ZWJ | Virama)
GB9c: 	(Virama | ZWJ ) 	? 	LinkingConsonant

If the classes used in the rules are to be disjoint, we then have to
split Extend into something like ViramaExtend and OtherExtend to allow
normalised (NFC/NFD) text, at which point we may as well continue to
have rules that work without any normalisation. Informally,

ViramaExtend = Extend and ccc ? 0.

OtherExtend = Extend and ccc = 0.

(We might need to put additional characters in ViramaExtend.)

This gives us rules:

GB9': ? (OtherExtend | ViramaExtend | ZWJ | Virama)

GB9c': 	(Virama | ZWJ ) ViramaExtend* ? LinkingConsonant

So, for a sequence <virama, ZWJ, nukta, LinkingConsonant>, GB9' gives us

virama ? ZWJ ? nukta LinkingConsonant

and GB9c' gives us

virama ? ZWJ ? nukta ? LinkingConsonant

---
In Rule GB9c, what examples justify including ZWJ?  Are they just the C1
half-forms?  My knowledge suggests that

GB9c'': Virama (ZWJ | ViramaExtend)* ? LinkingConsonant

might be more appropriate.

Richard.


From unicode at unicode.org  Mon Dec 11 05:56:51 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 11 Dec 2017 11:56:51 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
Message-ID: <20171211115651.58ee7ad9@JRWUBU2>

On Mon, 11 Dec 2017 08:59:20 +0100
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> The proposed rules do not distinguish the different visual forms that
> a sequence of characters surrounding a virama can have, such as
> 
>    1. an explicit virama, or
>    2. a half-form is visible, or
>    3. a ligature is created.

Do you mean 'visible virama' by an 'explicit virama'?  In the context
of the Indic syllabic category of virama (which is what I think of as
the Unicode virama), I would expect 'explicit virama' to refer to the
sequence <virama, ZWNJ>.  (In several scripts, this is encoded as a
separate character, and usually classified as a 'pure killer'.)

> That is following the requested structure in
> http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.
> 
> So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct
> Forms in Devanagari
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
> doesn't break a GC, nor do instances where a particular script always
> shows an explicit virama between two particular consonants.

Actually, I don't see ZWJ or ZWNJ in this document.  A literal reading
of the document would see a syllable break after an explicit half-form!

> All the
> lines on Figure 12-7. Consonant Forms in Devanagari and Oriya
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257>
> having a virama would have single GCs (that is, all but the first
> line). [That, after correcting the rules as per Manish Goregaokar's
> feedback, thanks!]

That looks like a change of intent.  For NFD text in Indian Indic blocks
plus control characters, in Version 11.0 Draft 1, ZWNJ does stop a gcb
virama from including the next consonant in an extended grapheme
cluster. 

> The examples in "Annexure B" of 17200-text-seg-rec.pdf
> <http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly
> include #2 and #3, but don't have any examples of #1 (as far as I can
> tell from a quick scan). It would be very useful to have explicit
> examples that included #1, and included scripts other than Devanagari
> (+swaran, others).

There aren't any examples of explicitly encoded half-forms (C1 or C2)
or explicitly encoded viramas, either.  It would be good to have
examples of visible viramas in conjunction with preposed vowels, such
as U+093F DEVANAGARI VOWEL SIGN I. From Paul Nelson's remarks many
years ago, I gather there are language-dependent variations in their
placement when the halant appears.

A bit of Sanskrit would be nice to see as well.  Hindi and Sanskrit
have different preferred shapes for several consonant clusters.  Some
Tamil script Sanskrit shlokas would be good, as well.

Richard.


From unicode at unicode.org  Mon Dec 11 10:07:05 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 11 Dec 2017 16:07:05 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
Message-ID: <20171211160705.78828972@JRWUBU2>

On Mon, 11 Dec 2017 08:59:20 +0100
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> The proposed rules do not distinguish the different visual forms that
> a sequence of characters surrounding a virama can have, such as
> 
>    1. an explicit virama, or
>    2. a half-form is visible, or
>    3. a ligature is created.
> 
> That is following the requested structure in
> http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.
> 
> So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct
> Forms in Devanagari
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
> doesn't break a GC, nor do instances where a particular script always
> shows an explicit virama between two particular consonants. All the
> lines on Figure 12-7. Consonant Forms in Devanagari and Oriya
> <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257>
> having a virama would have single GCs (that is, all but the first
> line). [That, after correcting the rules as per Manish Goregaokar's
> feedback, thanks!]
> 
> The examples in "Annexure B" of 17200-text-seg-rec.pdf
> <http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly
> include #2 and #3, but don't have any examples of #1 (as far as I can
> tell from a quick scan). It would be very useful to have explicit
> examples that included #1, and included scripts other than Devanagari
> (+swaran, others). While
> the online tool at http://unicode.org/cldr/utility/breaks.jsp can't
> yet be used until the Unicode 11 UCD is further along, I have an
> implementation of the new rules such that I can take any particular
> list of words and generate the breaks. So if someone can supply
> examples from different scripts or with different combinations of
> virama, zwj, zwnj, etc..... I can push out the result to this list.

Tai Tham oddities, which could cause issues with advanced typograpy:

??????? (Currently C-VN-C-VH-C, becoming C-VN-C-VHC)
????? (Currently C-CH-C-V, becoming C-CHC-V)
More obvious versions of the above, with consonants other than
U+1A36 TAI THAM LETTER NA:

???? (Currently C-VH-C, becoming C-VHC)
????? (Currently CMS-C-V, becoming CMSC-V)

A clear case for tailoring is Pali ????? (CM-CVV, but in Laos
and in much Northern Thai usage, U+1A58 TAI THAM SIGN MAI KANG LAI
merits gcb=prepend. Northeastern Thailand has the same style as Laos,
so pi_TH would be far too vague as a locale.)  Compare with Myanmar
script ??????? (currently C-CHH-CVV, becoming C-CHHCVV), with a pure
killer followed by an invisible stacker.

??????? (currently CVV-CMH-C, becoming CVV-CHHC) will be a case of
adjacent pure killer and invisible stacker that commute (to use the
terminology of traces).  The more typical commutation problem from Tai
Tham is exemplified by ????? (currently CVTH-C, becoming CVTHC), where
the tone mark and invisible stacker commute.

I'd like to add the example of Northern Thai Tai Tham ????????? <U+1A49
HIGH HA, U+1A60 SAKOT, U+1A36 NA, U+1A6E SIGN E, U+1A6C SIGN OA BELOW,
U+1A65 SIGN I, U+1A75 TONE-1, U+1A60 SAKOT, U+1A3F LOW YA> /n?ai/ 'to
ache all over'.  At present that akshara is split into three grapheme
clusters, composed of 2, 6 and 1 characters. (Thai teaching splits it
into four logically contiguous groups of 3, 3, 1 and 2 characters for
onset, vowel, tone and final consonant.  I find ??? in native
abecedaries, and the other three all have names, namely mai kuea, mai
yak and hang ya.) When the change goes through, this will be just one
extended grapheme cluster of nine characters.

Moving back to India, I suggest the Tamil example from
https://github.com/w3c/ilreq/issues/31#issuecomment-349589752, namely
??????????? (y?va??aiyum), which currently has an extended grapheme
cluster for each consonant.

At a minimum, we need the Malayalam examples from the TUS.

Finally, I would recommend the Nepali example from
L2/11-370,???????????, that I brought to the UTC's attention in
L2/17-122.  I hope someone else can deal with the other Devanagari
issues.  (Yep, even Devanagari needs more research!)


> 
> And yes, we do need review of these for Malayalam (+cibu, others).
> 
> If there are scripts for which the rules really don't work (or need
> more research before #29 is finalized in May), it is fairly
> straightforward to restrict the rule changes by modifying
> http://www.unicode.org/reports/tr29/proposed.html#Virama to either
> exclude particular scripts or include only particular scripts.
> 
> Mark <https://twitter.com/mark_e_davis>
> 
> On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  
> 
> > On Sat, 9 Dec 2017 16:16:44 +0100
> > Mark Davis ?? via Unicode <unicode at unicode.org> wrote:
> >  
> > > 1. You make a good point about the GB9c. It should probably
> > > instead be something like:
> > >
> > > GB9c: (Virama | ZWJ )   ? Extend* LinkingConsonant
> > >
> > >
> > > Extend is a broader than necessary, and there are a few items that
> > > have ccc!=0 but not gcb=extend. But all of those look to be
> > > degenerate cases.  
> >
> > Something *like*.
> >
> > Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I
> > believe these both prevent a preceding candrakkala from extending
> > an akshara - see TUS Section 12.9 about Table 12-33.  I think
> > Extend will have to be split between starters and non-starters.
> >
> > I believe there is a problem with the first two examples in Table
> > 12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E
> > MALAYALAM VOWEL SIGN AA> to the first two examples, yielding
> > *??????? and *????????, one would have three Malayalam aksharas,
> > not two extended grapheme clusters as the proposed rules would say.
> > This is different to Tai Tham, where there would indeed just be two
> > aksharas in each word, albit odd-looking - ??????? and ????????.
> > Who's checking the impact of these changes on Malayalam?
> >
> > Richard.
> >
> >  


From unicode at unicode.org  Mon Dec 11 13:07:18 2017
From: unicode at unicode.org (Roozbeh Pournader via Unicode)
Date: Mon, 11 Dec 2017 11:07:18 -0800
Subject: =?UTF-8?B?UmU6IEFxdWHPhs6/zrLOr86x?=
In-Reply-To: <HE1PR0701MB183374BD969D2F168EFBB60784310@HE1PR0701MB1833.eurprd07.prod.outlook.com>
References: <20171209142831.772d1f49@JRWUBU2>
 <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>
 <20171209162247.58c60e3c@JRWUBU2>
 <HE1PR0701MB183374BD969D2F168EFBB60784310@HE1PR0701MB1833.eurprd07.prod.outlook.com>
Message-ID: <CABWzK_UDpeqycNz-RUMO7UB6CZNU0A_ssSYHToj2fdjkMFFmAA@mail.gmail.com>

Jonathan,

I've been trying to gather a list of the Arabic marks that actually happen
in Hebrew for a while now, but don't have sources. I want to add them to
ScriptExtensions data in Unicode. Do you know of a source that lists them?


On Sat, Dec 9, 2017 at 12:56 PM, Jonathan Rosenne via Unicode <
unicode at unicode.org> wrote:

> There exist several Judeo-Arabic texts, Arabic written in Hebrew script
> with Arabic vowels and other marks. One well known is The Guide to the
> Perplexed.
>
> See a modern transcript at https://he.wikisource.org/
> wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7%
> 99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90.
>
> A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/
> pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_
> MANUSCRIPTS000043324-1#|FL36876376
>
> Best Regards,
>
> Jonathan Rosenne
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171211/ab7f6ded/attachment.html>

From unicode at unicode.org  Mon Dec 11 16:13:48 2017
From: unicode at unicode.org (Jonathan Rosenne via Unicode)
Date: Mon, 11 Dec 2017 22:13:48 +0000
Subject: =?utf-8?B?UkU6IEFxdWHPhs6/zrLOr86x?=
In-Reply-To: <CABWzK_UDpeqycNz-RUMO7UB6CZNU0A_ssSYHToj2fdjkMFFmAA@mail.gmail.com>
References: <20171209142831.772d1f49@JRWUBU2>
 <CAGa7JC1-n=DRj=AQdGKUe9SdkgT09o2akG9v5jKJ7N3jMTEgFQ@mail.gmail.com>
 <20171209162247.58c60e3c@JRWUBU2>
 <HE1PR0701MB183374BD969D2F168EFBB60784310@HE1PR0701MB1833.eurprd07.prod.outlook.com>
 <CABWzK_UDpeqycNz-RUMO7UB6CZNU0A_ssSYHToj2fdjkMFFmAA@mail.gmail.com>
Message-ID: <HE1PR0701MB18333E0E8F7A84DE8E2D3C5584370@HE1PR0701MB1833.eurprd07.prod.outlook.com>

Roozbeh,

You could look at the second link, but I am not at all sure they are new characters. One can easily see the three vowels and the shadda, and the i'jam dot which in Arabic is considered to be a part of the letter. I think that browsing the NLI will get you further manuscripts.

And of course there is https://he.wikipedia.org/wiki/%D7%A2%D7%A8%D7%91%D7%99%D7%AA_%D7%99%D7%94%D7%95%D7%93%D7%99%D7%AA

Best Regards,

Jonathan Rosenne

From: roozbeh at google.com [mailto:roozbeh at google.com] On Behalf Of Roozbeh Pournader
Sent: Monday, December 11, 2017 9:07 PM
To: Jonathan Rosenne
Cc: unicode at unicode.org
Subject: Re: Aqua?????

Jonathan,

I've been trying to gather a list of the Arabic marks that actually happen in Hebrew for a while now, but don't have sources. I want to add them to ScriptExtensions data in Unicode. Do you know of a source that lists them?


On Sat, Dec 9, 2017 at 12:56 PM, Jonathan Rosenne via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:
There exist several Judeo-Arabic texts, Arabic written in Hebrew script with Arabic vowels and other marks. One well known is The Guide to the Perplexed.

See a modern transcript at https://he.wikisource.org/wiki/%D7%9E%D7%95%D7%A8%D7%94_%D7%A0%D7%91%D7%95%D7%9B%D7%99%D7%9D_(%D7%9E%D7%A7%D7%95%D7%A8)/%D7%9E%D7%91%D7%95%D7%90.

A manuscript: http://web.nli.org.il/sites/NLI/Hebrew/digitallibrary/pages/viewer.aspx?presentorid=MANUSCRIPTS&docid=PNX_MANUSCRIPTS000043324-1#|FL36876376

Best Regards,

Jonathan Rosenne


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171211/891bfaf8/attachment.html>

From unicode at unicode.org  Mon Dec 11 19:25:04 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 12 Dec 2017 01:25:04 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
Message-ID: <20171212012504.4fbe9d10@JRWUBU2>

On Mon, 11 Dec 2017 21:45:23 +0000
Cibu Johny (????) <cibu at google.com> wrote:

> I am assuming the purpose of the grapheme cluster definition is to be
> used line spacing, vertical writing or cursor movement. Without
> defining the purpose, it is hard for me to say if a ruleset is valid
> or not.

That is a very fair point.  Take the example of Thai, an Indic
script which isn't affected by the proposal.  There, the spacing
vowel signs, whether before or after, may undergo greater separation
when text is stretched to fill a space.  I've seen great separation on
hoardings.  The spacing vowel signs are given gc=Lo.  Vertical writing
examples are fairly rare, but I've seen 'Yamaha' written vertically in
three horizontal stretches - ?? ?? ??.  Also, 'video' may be written
vertically in three horizontal stretches, as V D O or as ?? ?? ??.  I'm
not absolutely sure I've the latter in Thai script, but Glenn Slayden
reports it at
http://www.thai-language.com/phpbb/viewtopic.php?f=11&t=2568&start=0.
The striking thing is that four of these syllables have spacing vowels,
which would be written on their own in writing stretched horizontally,
but associate with the consonant in vertical writing. 

I haven't checked on the software-free behaviour of U+0E33 THAI
CHARACTER SARA AM, which is historically a combination of a
mark above and a mark to the right.  The Royal Institute Dictionary of
1999 resolves it into NIKHAHIT and SARA AA for what is a very slight
horizontal spacing (e.g. the entry for ??????, but I have seen the
NIKHAHIT component still attached to the SARA AA component.  However, I
don't know how much control the RID had over the typesetting of the
dictionary.

I think making the proposed change and still saying that cursor motion
should follow the extended grapheme cluster boundaries is contrary to
the Equality Act 2010.  It would be knowingly making text editing
harder for the users of most Indic scripts.  Those writing a Tai
language in the Tai Tham script would be hit hardest, even if one
mapped compound vowels to simple key stroke sequences.

> Assuming that purpose driven definition, we probably need
> language specific definitions - a pan-indic algorithm may not work.

There is the intermediate level of script-specific definitions.  We
already have them - following spacing marks are generally excluded from
the grapheme clusters in the Burmic scripts.

> For instance, the proposed ruleset, may not hold good for Tamil. For
> example, see the title in the following image: ??????? broken as
> [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed
> algorithm it would be: [ta-u, ka-virama-lla, ka-virama]
> 
> [image: image.png]
> http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg

Thank you for the example.  I think the rule for the Tamil script should
be that pulli attaches a following consonant to its grapheme cluster
only in the case of the sequences ??? and ????, but as I typed the
latter, I was surprised to see the sequence ??? adopt a conjunct shape,
so I don't know whether I'm seeing variation or a font error.

> Malayalam could be a similar story. In case of Malayalam, it can be
> font specific because of the existence of traditional and reformed
> writing styles. A conjunct might be a ligature in traditional; and it
> might get displayed with explicit virama in the reformed style. For
> example see the poster with word ??????? broken as [u, sa-virama,
> ta-aa, da-virama] - as it is written in the reformed style. As per
> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
> These breaks would be used by the traditional style of writing.

It seems that the of UAX#29 have been forgotten - "So tailorings for
aksaras may need to be script-, language-, font-, or context-specific
to be useful".  The big problem is that virama leaves too much up to
the font.

Richard.


From unicode at unicode.org  Wed Dec 13 12:36:35 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 13 Dec 2017 18:36:35 +0000
Subject: Atomicity of Grapheme Clusters
Message-ID: <20171213183635.6faf88e6@JRWUBU2>

I have been reviewing UAX#29 Unicode Text Segmentation because I have a
feeling we will be trying to do too much with the concept of grapheme
clusters, even with tailoring, when we extend it to include whole
aksharas.

What is the meaning of "Word boundaries, line boundaries, and sentence
boundaries should not occur within a grapheme cluster: in other words,
a grapheme cluster should be an atomic unit with respect to the process
of determining these other boundaries"?  In particular, whom is it
directed to?

Now, once quadrate support is added and we are able to write Ancient
Egyptian in Unicode, we will probably have two very significant
languages that regularly breach parts of that rule.  (At least, I
assume a whole Egyptian quadrate would be included in a dropped
capital.) Sanskrit word boundaries frequently occur within *legacy*
grapheme clusters, and sentence boundaries may occur within quadrates.
I presume UAX#29 does not intend that we should use means other than
Unicode to write samhita Sanskrit and Ancient Egyptian.

Richard.

From unicode at unicode.org  Thu Dec 14 02:09:57 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 14 Dec 2017 08:09:57 +0000
Subject: Word_Break for Hieroglyphs
Message-ID: <20171214080957.419a5668@JRWUBU2>

Is there any valid reason for Egyptian hieroglyphs to have
Word_Break=ALetter rather than Complex_Context?  So far as I am aware,
hieroglyphs lack visible word breaks in both inscriptions and in modern
transcriptions.

Richard.

From unicode at unicode.org  Thu Dec 14 07:12:10 2017
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 14 Dec 2017 13:12:10 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <20171214080957.419a5668@JRWUBU2>
References: <20171214080957.419a5668@JRWUBU2>
Message-ID: <6D368253-EA4C-42E3-A938-9FD3EC324E83@evertype.com>

On 14 Dec 2017, at 08:09, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> Is there any valid reason for Egyptian hieroglyphs to have
> Word_Break=ALetter rather than Complex_Context?  So far as I am aware,
> hieroglyphs lack visible word breaks in both inscriptions and in modern
> transcriptions.

Why should visibility matter here?

Michael Everson

From unicode at unicode.org  Thu Dec 14 08:14:31 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 14 Dec 2017 15:14:31 +0100
Subject: Word_Break for Hieroglyphs
In-Reply-To: <20171214080957.419a5668@JRWUBU2>
References: <20171214080957.419a5668@JRWUBU2>
Message-ID: <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>

The Word_Break property doesn't have a value Complex_Context, but I think
that was just a typo in your message.

The word break and line break properties for 1,057 [:Script=Egyp:]
characters are currently

Word_Break=ALetter
Line_Break=Alphabetic

Off the top of my head, I think the best course would be to make them both
the same as for most of [:Script=Hani:]

Word_Break=Other
Line_Break=Ideographic

We would only need to use Complex_Context [:lb=SA:] for scripts that keep
some letters together and break others apart (typically needing dictionary
lookup). I would suspect for modern use of Egyp, that is not the case; most
people would expect the characters to would just flow like ideographs,
breaking between any pair: you wouldn't need to disallow breaks between a
<man whose head is hit with an axe> and a <head of hippopotamus>, for
example.


Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic have
a linebreak and general category properties that seem odd and inconsistent
to me.

Line_Break=Close_Punctuation

General_Category=Other_Letter
items: 8

Egyptian Hieroglyphs
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Block=Egyptian%20Hieroglyphs}>
 ? *O. Buildings, parts of buildings, etc.
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=O.%20Buildings,%20parts%20of%20buildings,%20etc.}>*
items: 6
 ??  U+1325B <https://unicode.org/cldr/utility/character.jsp?a=1325B> EGYPTIAN
HIEROGLYPH O006D
 ??  U+1325C <https://unicode.org/cldr/utility/character.jsp?a=1325C> EGYPTIAN
HIEROGLYPH O006E
 ??  U+1325D <https://unicode.org/cldr/utility/character.jsp?a=1325D> EGYPTIAN
HIEROGLYPH O006F
 ??  U+13282 <https://unicode.org/cldr/utility/character.jsp?a=13282> EGYPTIAN
HIEROGLYPH O033A
 ??  U+13287 <https://unicode.org/cldr/utility/character.jsp?a=13287> EGYPTIAN
HIEROGLYPH O036B
 ??  U+13289 <https://unicode.org/cldr/utility/character.jsp?a=13289> EGYPTIAN
HIEROGLYPH O036D
Egyptian Hieroglyphs
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Block=Egyptian%20Hieroglyphs}>
 ? *V. Rope, fiber, baskets, bags, etc.
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=V.%20Rope,%20fiber,%20baskets,%20bags,%20etc.}>*
items: 2
 ??  U+1337A <https://unicode.org/cldr/utility/character.jsp?a=1337A> EGYPTIAN
HIEROGLYPH V011B
 ??  U+1337B <https://unicode.org/cldr/utility/character.jsp?a=1337B> EGYPTIAN
HIEROGLYPH V011C

Line_Break=Open_Punctuation

General_Category=Other_Letter
items: 6

Egyptian Hieroglyphs
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Block=Egyptian%20Hieroglyphs}>
 ? *O. Buildings, parts of buildings, etc.
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=O.%20Buildings,%20parts%20of%20buildings,%20etc.}>*
items: 5
 ??  U+13258 <https://unicode.org/cldr/utility/character.jsp?a=13258> EGYPTIAN
HIEROGLYPH O006A
 ??  U+13259 <https://unicode.org/cldr/utility/character.jsp?a=13259> EGYPTIAN
HIEROGLYPH O006B
 ??  U+1325A <https://unicode.org/cldr/utility/character.jsp?a=1325A> EGYPTIAN
HIEROGLYPH O006C
 ??  U+13286 <https://unicode.org/cldr/utility/character.jsp?a=13286> EGYPTIAN
HIEROGLYPH O036A
 ??  U+13288 <https://unicode.org/cldr/utility/character.jsp?a=13288> EGYPTIAN
HIEROGLYPH O036C
Egyptian Hieroglyphs
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Block=Egyptian%20Hieroglyphs}>
 ? *V. Rope, fiber, baskets, bags, etc.
<https://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=V.%20Rope,%20fiber,%20baskets,%20bags,%20etc.}>*
items: 1
 ??  U+13379 <https://unicode.org/cldr/utility/character.jsp?a=13379> EGYPTIAN
HIEROGLYPH V011A


Mark <https://twitter.com/mark_e_davis>

On Thu, Dec 14, 2017 at 9:09 AM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Is there any valid reason for Egyptian hieroglyphs to have
> Word_Break=ALetter rather than Complex_Context?  So far as I am aware,
> hieroglyphs lack visible word breaks in both inscriptions and in modern
> transcriptions.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171214/f0287f9d/attachment.html>

From unicode at unicode.org  Thu Dec 14 08:22:54 2017
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 14 Dec 2017 14:22:54 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
Message-ID: <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>

On 14 Dec 2017, at 14:14, Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> The Word_Break property doesn't have a value Complex_Context, but I think that was just a typo in your message.
> 
> The word break and line break properties for 1,057 [:Script=Egyp:] characters are currently
> 
> Word_Break=ALetter
> Line_Break=Alphabetic
> 
> Off the top of my head, I think the best course would be to make them both the same as for most of [:Script=Hani:]
> 
> Word_Break=Other
> Line_Break=Ideographic

Egyptian is not ideographic and is certainly not fixed-width. CJK does not cluster. Why should you want to make them the same? Moreover, these properties were defined at the beginning, were they not? Bob Richmond and others will certainly have a view on this. 

> We would only need to use Complex_Context [:lb=SA:] for scripts that keep some letters together and break others apart (typically needing dictionary lookup). I would suspect for modern use of Egyp, that is not the case;

Please do not ?suspect?. It is not hard to ask experts.

> most people would expect the characters to would just flow like ideographs, breaking between any pair:

NO. Clusters cannot be broken up just anywhere. 

> you wouldn't need to disallow breaks between a <man whose head is hit with an axe> and a <head of hippopotamus>, for example.
> 
> Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic have a linebreak and general category properties that seem odd and inconsistent to me.
> 
> Line_Break=Close_Punctuation
> General_Category=Other_Letteritems: 8
> Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 6
> 
>  ?? 	U+1325B	EGYPTIAN HIEROGLYPH O006D
>  ?? 	U+1325C	EGYPTIAN HIEROGLYPH O006E
>  ?? 	U+1325D	EGYPTIAN HIEROGLYPH O006F
>  ??	U+13282	EGYPTIAN HIEROGLYPH O033A
>  ?? 	U+13287	EGYPTIAN HIEROGLYPH O036B
>  ?? 	U+13289	EGYPTIAN HIEROGLYPH O036D
> Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 2
> 
>  ?? 	U+1337A	EGYPTIAN HIEROGLYPH V011B
>  ?? 	U+1337B	EGYPTIAN HIEROGLYPH V011C
> Line_Break=Open_Punctuation
> General_Category=Other_Letteritems: 6
> Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 5
> 
>  ??	U+13258	EGYPTIAN HIEROGLYPH O006A
>  ??	U+13259	EGYPTIAN HIEROGLYPH O006B
>  ??	U+1325A	EGYPTIAN HIEROGLYPH O006C
>  ??	U+13286	EGYPTIAN HIEROGLYPH O036A
>  ??	U+13288	EGYPTIAN HIEROGLYPH O036C
> Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 1
> 
>  ??	U+13379	EGYPTIAN HIEROGLYPH V011A

These properties were chosen explicitly when Egyptian was first defined. Those are enclosing punctuation characters. 

Michael Everson.

From unicode at unicode.org  Thu Dec 14 08:53:13 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 14 Dec 2017 15:53:13 +0100
Subject: Word_Break for Hieroglyphs
In-Reply-To: <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
Message-ID: <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>

Mark <https://twitter.com/mark_e_davis>

On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson <everson at evertype.com>
wrote:

> On 14 Dec 2017, at 14:14, Mark Davis ?? via Unicode <unicode at unicode.org>
> wrote:
>
> > The Word_Break property doesn't have a value Complex_Context, but I
> think that was just a typo in your message.
> >
> > The word break and line break properties for 1,057 [:Script=Egyp:]
> characters are currently
> >
> > Word_Break=ALetter
> > Line_Break=Alphabetic
> >
> > Off the top of my head, I think the best course would be to make them
> both the same as for most of [:Script=Hani:]
> >
> > Word_Break=Other
> > Line_Break=Ideographic
>
> Egyptian is not ideographic and is certainly not fixed-width. CJK does not
> cluster. Why should you want to make them the same?


?fixed-width has *nothing* to do with these properties. The issue is
whether spaces are required between words. The impact of the *these* properties
with their current values are that

   - you would ?never break a word within a string of hieroglyphs (eg
   double-click) and
   - you would only break within a string of hieroglyphs if there are no
   spaces, etc. on the line.

For example, if you have a string of 300 hieroglyphs in a paragraph, double
clicking on one of them would select the entire string, because as far as
Word_Break is concerned, the entire 300 characters form one word. For
linebreak, you would only break when forced. So in a paragraph of passages
of English + hieroglyphs (represented here by CAPS), you would only break
at the spaces and when forced. For example, suppose we have:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEWLQNFNNAKDFNFNQKLER
is constructed from 15 words with...

It would not line break (with the current properties) as:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLAREN
KQLNRKEWLQNFNNAKDFNFNQKLER is constructed from
15 words with...

but rather as:

... the passage
ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEW
LQNFNNAKDFNFNQKLER is constructed from 15 words with...


> Moreover, these properties were defined at the beginning, were they not?
> Bob Richmond and others will certainly have a view on this.
>

If there is defined clustering behavior that affects line break, then the
line break property value would need to be Complex_Context.

But the *current* value is Alphabetic, which makes any length of
hieroglyphs function as one (possibly very long) word. That appears clearly
wrong, even if it was "defined at the beginning". Properties are not carved
in stone (so to speak); we sometimes find out later, especially for seldom
used scripts, that property values can be improved.


> > We would only need to use Complex_Context [:lb=SA:] for scripts that
> keep some letters together and break others apart (typically needing
> dictionary lookup). I would suspect for modern use of Egyp, that is not the
> case;
>
> Please do not ?suspect?. It is not hard to ask experts.
>

?You misunderstand. When I say "I suspect" that means I'm not certain. Thus
I would like people who are both knowledgeable about hieroglyphs *and*
Unicode properties to weigh in. I know that people like Andrew Glass are on
this list, who satisfy both criteria.
?

>
> > most people would expect the characters to would just flow like
> ideographs, breaking between any pair:
>
> NO. Clusters cannot be broken up just anywhere.
>

A simple assertion without more information is useless.

Does that mean that ancient inscriptions would leave gaps at the end of
lines in order to not break a cluster, or that modern users would expect
software to leave gaps at the end of lines in order ?to not break a
cluster? And what constitutes a cluster? Is that semantically determined
(eg like Thai), or is it based on algorithmic features of the hieroglyphs?


> > you wouldn't need to disallow breaks between a <man whose head is hit
> with an axe> and a <head of hippopotamus>, for example.
> >
> > Also, I noticed that the 14 Egyp characters with Line_Break?Alphabetic
> have a linebreak and general category properties that seem odd and
> inconsistent to me.
> >
> > Line_Break=Close_Punctuation
> > General_Category=Other_Letteritems: 8
> > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 6
> >
> >  ??   U+1325B EGYPTIAN HIEROGLYPH O006D
> >  ??   U+1325C EGYPTIAN HIEROGLYPH O006E
> >  ??   U+1325D EGYPTIAN HIEROGLYPH O006F
> >  ??   U+13282 EGYPTIAN HIEROGLYPH O033A
> >  ??   U+13287 EGYPTIAN HIEROGLYPH O036B
> >  ??   U+13289 EGYPTIAN HIEROGLYPH O036D
> > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 2
> >
> >  ??   U+1337A EGYPTIAN HIEROGLYPH V011B
> >  ??   U+1337B EGYPTIAN HIEROGLYPH V011C
> > Line_Break=Open_Punctuation
> > General_Category=Other_Letteritems: 6
> > Egyptian Hieroglyphs ? O. Buildings, parts of buildings, etc.items: 5
> >
> >  ??   U+13258 EGYPTIAN HIEROGLYPH O006A
> >  ??   U+13259 EGYPTIAN HIEROGLYPH O006B
> >  ??   U+1325A EGYPTIAN HIEROGLYPH O006C
> >  ??   U+13286 EGYPTIAN HIEROGLYPH O036A
> >  ??   U+13288 EGYPTIAN HIEROGLYPH O036C
> > Egyptian Hieroglyphs ? V. Rope, fiber, baskets, bags, etc.items: 1
> >
> >  ??   U+13379 EGYPTIAN HIEROGLYPH V011A
>
> These properties were chosen explicitly when Egyptian was first defined.
> Those are enclosing punctuation characters.
>

?The issue is that the general category property values are *not* punctuation
characters, so there appears to be an inconsistency (as I said).


>
> Michael Everson.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171214/6baa00eb/attachment.html>

From unicode at unicode.org  Thu Dec 14 10:27:00 2017
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Thu, 14 Dec 2017 08:27:00 -0800
Subject: Word_Break for Hieroglyphs
In-Reply-To: <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
Message-ID: <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net>

Gentlemen,


On 12/14/2017 6:53 AM, Mark Davis ?? via Unicode wrote:
> Thus I would like people who are both knowledgeable about?hieroglyphs 
> /and/ Unicode properties to weigh in. I know that people like Andrew 
> Glass are on this list, who satisfy both criteria.
> ?
> And what constitutes a cluster?

This entire discussion is premature. The model for Egyptian is in flux 
right now. What constitutes a "quadrat", which is significantly relevant 
to any determination of how other segmentation properties should work 
for Egyptian hieroglyphics, will depend on the details of the model and 
how quadrat formation interacts with the exact set of format controls 
eventually agreed upon. See:

http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf

(And please note that that has a reference list of 13 *other* documents. 
This is not simple stuff.)

When we get closure on the Egyptian model, *then* will be the time to 
make suggestions for how Egyptian values for GCB, WB, and LB might we 
adjusted for possible better default behavior.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171214/e0b37083/attachment.html>

From unicode at unicode.org  Thu Dec 14 12:11:33 2017
From: unicode at unicode.org (Andrew Glass via Unicode)
Date: Thu, 14 Dec 2017 18:11:33 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
 <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net>
Message-ID: <DM5PR21MB0508F9C30F6E3B917B58E26D8E0A0@DM5PR21MB0508.namprd21.prod.outlook.com>


We?ve made a lot of progress on Hieroglyphs this year with the addition of the quadrat forming controls (thanks again to everyone involved in that effort and in the preceding 13 documents). I like to think that that part of the model is no longer in flux. Certainly, there is more work to be done on correct breaking. At this point we know that quadrat breaks != word breaks, but quadrat boundaries must align with line breaks. We had some discussion on the sidelines of the August UTC meeting at which time it became clear that more work is needed as current property values are not entirely correct. Currently, my Hieroglyphic energies are focused on completing font documentation and a reference font. I think it will be most helpful to understand the properties when we have a font that fully supports the quadrat controls so we have specific examples we can look at and confer on with specialists. So I?m happy to take Ken?s suggestion that we don?t rush in here.

Cheers,

Andrew

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler via Unicode
Sent: Thursday, December 14, 2017 8:27 AM
To: mark <mark at macchiato.com>
Cc: unicode at unicode.org
Subject: Re: Word_Break for Hieroglyphs


Gentlemen,

On 12/14/2017 6:53 AM, Mark Davis ?? via Unicode wrote:
Thus I would like people who are both knowledgeable about hieroglyphs and Unicode properties to weigh in. I know that people like Andrew Glass are on this list, who satisfy both criteria.
?
And what constitutes a cluster?

This entire discussion is premature. The model for Egyptian is in flux right now. What constitutes a "quadrat", which is significantly relevant to any determination of how other segmentation properties should work for Egyptian hieroglyphics, will depend on the details of the model and how quadrat formation interacts with the exact set of format controls eventually agreed upon. See:

http://www.unicode.org/L2/L2017/17112r-quadrat-encoding.pdf<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.unicode.org%2FL2%2FL2017%2F17112r-quadrat-encoding.pdf&data=04%7C01%7CAndrew.Glass%40microsoft.com%7C39d84a5cc99343537f6308d543106a18%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636488660163563936%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=FxkMPiP7GvgII%2FdP%2FhM68lwui1rLV%2BjeWnFqDN%2Bo8jk%3D&reserved=0>

(And please note that that has a reference list of 13 *other* documents. This is not simple stuff.)

When we get closure on the Egyptian model, *then* will be the time to make suggestions for how Egyptian values for GCB, WB, and LB might we adjusted for possible better default behavior.

--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171214/df6becf6/attachment.html>

From unicode at unicode.org  Thu Dec 14 14:13:21 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 14 Dec 2017 20:13:21 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <DM5PR21MB0508F9C30F6E3B917B58E26D8E0A0@DM5PR21MB0508.namprd21.prod.outlook.com>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
 <44ddcbd0-a9d1-0ada-d8e6-85f268f7d7e7@att.net>
 <DM5PR21MB0508F9C30F6E3B917B58E26D8E0A0@DM5PR21MB0508.namprd21.prod.outlook.com>
Message-ID: <20171214201321.30706b7b@JRWUBU2>

On Thu, 14 Dec 2017 18:11:33 +0000
Andrew Glass via Unicode <unicode at unicode.org> wrote:

> We had some discussion on the sidelines of
> the August UTC meeting at which time it became clear that more work
> is needed as current property values are not entirely correct.
> Currently, my Hieroglyphic energies are focused on completing font
> documentation and a reference font. I think it will be most helpful
> to understand the properties when we have a font that fully supports
> the quadrat controls so we have specific examples we can look at and
> confer on with specialists. So I?m happy to take Ken?s suggestion
> that we don?t rush in here.

I'll read that as saying there is no need to report a problem; that we
already know that there will be a problem with real text of more than a
few characters.  (The current encoding was justified as
primarily defining short strings marshalled by a layout language.) 

I was approaching hieroglyphs as a system where grapheme cluster
breaks, line break opportunities and sentence boundaries have little
connection, unlike the hierarchy seen in most writing systems.  At
least, quadrats seem to be strong candidates for the status of graphme
clusters.

Richard.


From unicode at unicode.org  Thu Dec 14 15:13:22 2017
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Thu, 14 Dec 2017 22:13:22 +0100 (CET)
Subject: Include emoticons in CLDR character annotation?
Message-ID: <310424535.7599.1513286002504@ox.hosteurope.de>

The CLDR Survey Tool is currently open to, among other things, collect improvements to (emoji) character names and keywords. I don't see it being done for any language yet, but wouldn't it make sense to add classic emoticons (like :-) for various smiling emojis), kaomoji (like o/ for Person Raising Hand) and ASCII art (like ><)))?> and similar for Fish) to the keywords of Face emoji?


From unicode at unicode.org  Thu Dec 14 16:40:23 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 14 Dec 2017 22:40:23 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
Message-ID: <20171214224023.23e5723f@JRWUBU2>

On Mon, 11 Dec 2017 21:45:23 +0000
Cibu Johny (????) <cibu at google.com> wrote:
 
> I am assuming the purpose of the grapheme cluster definition is to be
> used line spacing, vertical writing or cursor movement. Without
> defining the purpose, it is hard for me to say if a ruleset is valid
> or not. Assuming that purpose driven definition, we probably need
> language specific definitions - a pan-indic algorithm may not work.
> For instance, the proposed ruleset, may not hold good for Tamil. For
> example, see the title in the following image: ??????? broken as
> [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed
> algorithm it would be: [ta-u, ka-virama-lla, ka-virama]
> 
> http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg

I think Tamil is actually rather straightforward.  For native
intuition, I would cite the Tamil letter-counting account at
https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf.
What the author counts is not spacing glyphs, but vowel letters and
consonant characters, with two significant modifications.  Firstly,
K.SSA counts as just one consonant, and SH.R.II is also counted as
containing a single consonant.  In other words, the Tamil virama
character works as a pure killer except in those two environments.
This is also the story the TUNE protagonists tell us.  It will be an
inelegant rule for UAX#29, but, unfortunately, reality is messy.

> Malayalam could be a similar story. In case of Malayalam, it can be
> font specific because of the existence of traditional and reformed
> writing styles. A conjunct might be a ligature in traditional; and it
> might get displayed with explicit virama in the reformed style. For
> example see the poster with word ??????? broken as [u, sa-virama,
> ta-aa, da-virama] - as it is written in the reformed style. As per
> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
> These breaks would be used by the traditional style of writing.

Working round that seems to be tricky.  The best I can think of is to
have two different locales, traditional and reformed, and hope that the
right font is selected.  It doesn't seem at all straightforward to
work out what the font is doing even from a character to glyph map
without knowing what the glyphs are.  I'm not sure how one should have
the difference designated - language variants, or two scripts?

> 
> [image: image.png]
> https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg

> BTW, there is an example with explicit virama in the proposal under
> the Sanskrit section:

The alleged grapheme cluster is the last cluster of the second word in
the Sanskrit section of  L2/17-200 Recommendations to UTC #152
on Text segmentation in Indian languages
(https://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf).  The
rendering seems odd if there is no ZWNJ in the word.  I read the word
as ????????????  pprpadya with two pitch accents.  However, I can't
explain the visible virama under the DA - even a Hindi font should have
a conjunct for D.YA.

Richard.


From unicode at unicode.org  Sat Dec 16 16:06:03 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 16 Dec 2017 22:06:03 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
Message-ID: <20171216220603.7137899c@JRWUBU2>

On Thu, 14 Dec 2017 15:53:13 +0100
Mark Davis ?? via Unicode <unicode at unicode.org> wrote:

> On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson
> <everson at evertype.com> wrote:

> > NO. Clusters cannot be broken up just anywhere.
 
> Does that mean that ancient inscriptions would leave gaps at the end
> of lines in order to not break a cluster, or that modern users would
> expect software to leave gaps at the end of lines in order ?to not
> break a cluster? And what constitutes a cluster? Is that semantically
> determined (eg like Thai), or is it based on algorithmic features of
> the hieroglyphs?

An absence of gaps in ancient inscriptions would not be revealing.  One
justification trick available to the engravers was variable spelling -
spacing phonetic complements were optional.  Original letters would
offer the best evidence in this respect.

We're going to have some algorithmic clusters - it will make no sense
to break quadrats between lines. Also, it would be perverse to
line-break a graphic transposition.  Phonetic elements normally occur
in phonetic order, but bird plus tall thin character is usually
replaced by tall thin character plus bird.  Thus splitting ?????? /wd?/
'order' <wD-w-Y1A> i.e. <U+13397 EGYPTIAN HIEROGLYPH V024, U+13171
EGYPTIAN HIEROGLYPH G043, U+133DC EGYPTIAN HIEROGLYPH Y001A> into wD on
one line and w-Y1A on the next would be perverse.  Unfortunately, I
don't know whether it happens or not.  Preventing this particular
example ought to require a semantic analysis, but I couldn't find an
example of word final V024 in the free, 2006 edition of Paul Dickson's
"Dictionary of Middle Egyptian in Gardiner Classification Order", so
perhaps a sequence wD-w will always be word-internal.

Richard.


From unicode at unicode.org  Sun Dec 17 09:16:20 2017
From: unicode at unicode.org (David P. Kendal via Unicode)
Date: Sun, 17 Dec 2017 16:16:20 +0100
Subject: Possible bug in formal grammar for extended grapheme cluster
Message-ID: <ACF5899C-1BB4-4BD2-A64F-D2C519F71846@nonceword.org>

Hi,

It?s possible I?m missing something, but the formal grammar/regular
expression given for extended grapheme clusters appears to have a bug
in it.
<https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters>

The bug is here:

    RI-Sequence := Regional_Indicator+

If the formal grammar is intended to exactly match the rules given the
the ?Grapheme Cluster Boundary Rules? section below it as-is, then
this should be

    RI-Sequence := Regional_Indicator Regional_Indicator

since as given it would cause any number of RI characters to coalesce
into a single grapheme cluster, instead of pairs of characters. That
is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
grapheme cluster instead of the correct two.

-- 
dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/
   we do these things not because they are easy,      +49 159 03847809
      but because we thought they were going to be easy
          ? ?The Programmers? Credo?, Maciej Ceg?owski


From unicode at unicode.org  Sun Dec 17 11:17:57 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 17 Dec 2017 18:17:57 +0100
Subject: Possible bug in formal grammar for extended grapheme cluster
In-Reply-To: <ACF5899C-1BB4-4BD2-A64F-D2C519F71846@nonceword.org>
References: <ACF5899C-1BB4-4BD2-A64F-D2C519F71846@nonceword.org>
Message-ID: <CAJ2xs_FY8fLa-YLS6HaSyWu8uNjRU4k+rh1fc6kHuH4pTqcqbA@mail.gmail.com>

Thanks for the feedback. You're correct about this; that is a holdover from
an earlier version of the document when there was a more basic treatment of
RI sequences.

There is already an action to modify these. There is a placeholder review
note about that just above

http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters

(scroll up just a bit).

Mark

Mark <https://twitter.com/mark_e_davis>

On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
unicode at unicode.org> wrote:

> Hi,
>
> It?s possible I?m missing something, but the formal grammar/regular
> expression given for extended grapheme clusters appears to have a bug
> in it.
> <https://unicode.org/reports/tr29/#Table_Combining_Char_
> Sequences_and_Grapheme_Clusters>
>
> The bug is here:
>
>     RI-Sequence := Regional_Indicator+
>
> If the formal grammar is intended to exactly match the rules given the
> the ?Grapheme Cluster Boundary Rules? section below it as-is, then
> this should be
>
>     RI-Sequence := Regional_Indicator Regional_Indicator
>
> since as given it would cause any number of RI characters to coalesce
> into a single grapheme cluster, instead of pairs of characters. That
> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
> grapheme cluster instead of the correct two.
>
> --
> dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/
>    we do these things not because they are easy,      +49 159 03847809
>       but because we thought they were going to be easy
>           ? ?The Programmers? Credo?, Maciej Ceg?owski
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171217/a572e211/attachment.html>

From unicode at unicode.org  Mon Dec 18 03:59:06 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Mon, 18 Dec 2017 09:59:06 +0000
Subject: Possible bug in formal grammar for extended grapheme cluster
In-Reply-To: <CAJ2xs_FY8fLa-YLS6HaSyWu8uNjRU4k+rh1fc6kHuH4pTqcqbA@mail.gmail.com>
References: <ACF5899C-1BB4-4BD2-A64F-D2C519F71846@nonceword.org>
 <CAJ2xs_FY8fLa-YLS6HaSyWu8uNjRU4k+rh1fc6kHuH4pTqcqbA@mail.gmail.com>
Message-ID: <E41F5B67-4A0A-4252-8D64-D7815A8B9FD8@lboro.ac.uk>

Ah! That explains why

pcre2grep -u '^\X{1}$'

matches with

????
????????
????????????
????????????????????

...etc...

Andr? Schappo

On 17 Dec 2017, at 17:17, Mark Davis ?? via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

Thanks for the feedback. You're correct about this; that is a holdover from an earlier version of the document when there was a more basic treatment of RI sequences.

There is already an action to modify these. There is a placeholder review note about that just above

http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters

(scroll up just a bit).

Mark

Mark<https://twitter.com/mark_e_davis>

On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:
Hi,

It?s possible I?m missing something, but the formal grammar/regular
expression given for extended grapheme clusters appears to have a bug
in it.
<https://unicode.org/reports/tr29/#Table_Combining_Char_Sequences_and_Grapheme_Clusters>

The bug is here:

    RI-Sequence := Regional_Indicator+

If the formal grammar is intended to exactly match the rules given the
the ?Grapheme Cluster Boundary Rules? section below it as-is, then
this should be

    RI-Sequence := Regional_Indicator Regional_Indicator

since as given it would cause any number of RI characters to coalesce
into a single grapheme cluster, instead of pairs of characters. That
is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
grapheme cluster instead of the correct two.

--
dpk (David P. Kendal) ? Nassauische Str. 36, 10717 DE ? http://dpk.io/
   we do these things not because they are easy,      +49 159 03847809<tel:%2B49%20159%2003847809>
      but because we thought they were going to be easy
          ? ?The Programmers? Credo?, Maciej Ceg?owski


?? ?? ??
Andr? Schappo
https://schappo.blogspot.co.uk
https://twitter.com/andreschappo
https://weibo.com/andreschappo
https://groups.google.com/forum/#!forum/computer-science-curriculum-internationalization


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171218/87bff05f/attachment.html>

From unicode at unicode.org  Mon Dec 18 08:15:11 2017
From: unicode at unicode.org (Serge Rosmorduc via Unicode)
Date: Mon, 18 Dec 2017 15:15:11 +0100
Subject: Word_Break for Hieroglyphs
In-Reply-To: <20171216220603.7137899c@JRWUBU2>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
 <20171216220603.7137899c@JRWUBU2>
Message-ID: <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr>

Hello,

Concerning word separation and clusters, there was a variety on different practices.

At best, one could say that statistically, there is  a positive correlation between word cuts and cluster limits.

This being said, it depends widely on the era, the quality of the inscription, and the available space.

At some times (for instance, XIIth dynasty), the scribes would work hard to avoid cutting a word between two lines. At other time, and 
in other circumstances (limited available space), word cutting could be extreme. For instance, in Stela Cairo CGC 34025 (AKA Israel Stela), Merenptah?s text, reusing a stela by Amenophis III, lacks room. 

Hence, you have things like (like 5-6) :  : the word ?sy ? small ?, is cut between the two lines. The phonetic part is line 5, and the bird determinative is alone on line 5, above the preposition ? m ?, which is itself above the consonnant ? m ? which is the first consonant of the following word. I have written the three words in different colours to display their intrication.


Best regards,

Serge Rosmorduc


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171218/6251431f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Xsy.png
Type: image/png
Size: 2790 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20171218/6251431f/attachment.png>

From unicode at unicode.org  Mon Dec 18 08:17:04 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 18 Dec 2017 15:17:04 +0100
Subject: Possible bug in formal grammar for extended grapheme cluster
In-Reply-To: <E41F5B67-4A0A-4252-8D64-D7815A8B9FD8@lboro.ac.uk>
References: <ACF5899C-1BB4-4BD2-A64F-D2C519F71846@nonceword.org>
 <CAJ2xs_FY8fLa-YLS6HaSyWu8uNjRU4k+rh1fc6kHuH4pTqcqbA@mail.gmail.com>
 <E41F5B67-4A0A-4252-8D64-D7815A8B9FD8@lboro.ac.uk>
Message-ID: <CAJ2xs_HjPLLEqDzUP5EykSLjOVM_VniJhFYHoD_dYGfcrk3mRg@mail.gmail.com>

If you look back at http://www.unicode.org/reports/tr29/tr29-27.html#GB8a
(2015), the rule was simply not to break sequences of RI characters.

We changed that in http://www.unicode.org/reports/tr29/tr29-29.html#GB12
(2016) to only group pairs. Unfortunately, the (informative) table
http://www.unicode.org/reports/tr29/tr29-31.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
was not updated after 2015 to keep pace with the changes in rules. So that
is still to do....


Mark <https://twitter.com/mark_e_davis>

On Mon, Dec 18, 2017 at 10:59 AM, Andre Schappo via Unicode <
unicode at unicode.org> wrote:

> Ah! That explains why
>
> pcre2grep -u '^\X{1}$'
>
> matches with
>
> ????
> ????????
> ????????????
> ????????????????????
>
> ...etc...
>
> Andr? Schappo
>
> On 17 Dec 2017, at 17:17, Mark Davis ?? via Unicode <unicode at unicode.org>
> wrote:
>
> Thanks for the feedback. You're correct about this; that is a holdover
> from an earlier version of the document when there was a more basic
> treatment of RI sequences.
>
> There is already an action to modify these. There is a placeholder review
> note about that just above
>
> http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_
> Sequences_and_Grapheme_Clusters
>
> (scroll up just a bit).
>
> Mark
>
> Mark <https://twitter.com/mark_e_davis>
>
> On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
> unicode at unicode.org> wrote:
>
>> Hi,
>>
>> It?s possible I?m missing something, but the formal grammar/regular
>> expression given for extended grapheme clusters appears to have a bug
>> in it.
>> <https://unicode.org/reports/tr29/#Table_Combining_Char_Sequ
>> ences_and_Grapheme_Clusters>
>>
>> The bug is here:
>>
>>     RI-Sequence := Regional_Indicator+
>>
>> If the formal grammar is intended to exactly match the rules given the
>> the ?Grapheme Cluster Boundary Rules? section below it as-is, then
>> this should be
>>
>>     RI-Sequence := Regional_Indicator Regional_Indicator
>>
>> since as given it would cause any number of RI characters to coalesce
>> into a single grapheme cluster, instead of pairs of characters. That
>> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
>> grapheme cluster instead of the correct two.
>>
>> --
>> dpk (David P. Kendal) ? Nassauische Str. 36, 10717
>> <https://maps.google.com/?q=Nassauische+Str.+36,+10717&entry=gmail&source=g>
>> DE ? http://dpk.io/
>>    we do these things not because they are easy,      +49 159 03847809
>>       but because we thought they were going to be easy
>>           ? ?The Programmers? Credo?, Maciej Ceg?owski
>>
>>
>>
>
> ?? ?? ??
> Andr? Schappo
> https://schappo.blogspot.co.uk
> https://twitter.com/andreschappo
> https://weibo.com/andreschappo
> https://groups.google.com/forum/#!forum/computer-science-curriculum-
> internationalization
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171218/9661d970/attachment.html>

From unicode at unicode.org  Wed Dec 20 02:46:33 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Wed, 20 Dec 2017 08:46:33 +0000
Subject: Word_Break for Hieroglyphs
In-Reply-To: <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
 <20171216220603.7137899c@JRWUBU2>
 <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr>
Message-ID: <20171220084633.56725ae1@JRWUBU2>

On Mon, 18 Dec 2017 15:15:11 +0100
Serge Rosmorduc via Unicode <unicode at unicode.org> wrote:

> Hence, you have things like (like 5-6) :  : the word ?sy ? small ?,
> is cut between the two lines. The phonetic part is line 5, and the
> bird determinative is alone on line 5, above the preposition ? m ?,
> which is itself above the consonnant ? m ? which is the first
> consonant of the following word. I have written the three words in
> different colours to display their intrication.

In an implementation that offered genuine whole word selection, and
thus tackled with the challenges of Chinese, Japanese, Korean and
Vietnamese (both scripts, not just CJKV) as well as Thai, I would
expect the selections to be bounded by word boundaries.  Thus, if the
cited line break (labelled by '6') were not in the text, I would expect
double-clicking on the quadrat G37:Aa13:Aa13 to select all three words.

Looking at the rendering in
https://mjn.host.cs.st-andrews.ac.uk/egyptian/texts/corpus/pdf/Merneptah.pdf,
it is worth noting that the cartouche in Line 4 of the inscription is
not broken between lines.  I don't know whether this is to avoid
breaking the cartouche or to avoid separating the facing figure therein.

Richard.


From unicode at unicode.org  Wed Dec 20 03:06:28 2017
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Wed, 20 Dec 2017 18:06:28 +0900
Subject: Word_Break for Hieroglyphs
In-Reply-To: <20171220084633.56725ae1@JRWUBU2>
References: <20171214080957.419a5668@JRWUBU2>
 <CAJ2xs_GH30EfPmdrAG6q8UfoTxfsmP-LsMrTfDXfwJakC71uAg@mail.gmail.com>
 <0D1254C0-16B6-4F89-B8F0-06342D945B9E@evertype.com>
 <CAJ2xs_GBscDaq7=drSSYZqfgq-xhGz7FrgQnwOwBOvLJhLNEjg@mail.gmail.com>
 <20171216220603.7137899c@JRWUBU2>
 <02383052-7697-41D9-A197-CC65C3DDD48E@iut.univ-paris8.fr>
 <20171220084633.56725ae1@JRWUBU2>
Message-ID: <0c4fb320-f1c5-4d02-b8d2-4c1ddd8531ca@it.aoyama.ac.jp>

On 2017/12/20 17:46, Richard Wordingham via Unicode wrote:

> In an implementation that offered genuine whole word selection, and
> thus tackled with the challenges of Chinese, Japanese, Korean and
> Vietnamese (both scripts, not just CJKV) as well as Thai, I would
> expect the selections to be bounded by word boundaries.  Thus, if the
> cited line break (labelled by '6') were not in the text, I would expect
> double-clicking on the quadrat G37:Aa13:Aa13 to select all three words.

This may be common knowledge to some, but I just had a Japanese document 
open in MS Word, and tried what happened on double-clicking. What it 
does is select same-script runs. This means that a run of kanji, a run 
of hiragana, or a run of katakana (interestingly, the (kata)kana length 
mark is treated as a forth script) is selected. This is of course not 
the same as words, but it can match, and it comes close in terms of 
offering something for editorial convenience while being easy to implement.

Regards,   Martin.

From unicode at unicode.org  Thu Dec 21 02:55:33 2017
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Thu, 21 Dec 2017 17:55:33 +0900
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <20171214224023.23e5723f@JRWUBU2>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
Message-ID: <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>


On 2017/12/15 07:40, Richard Wordingham via Unicode wrote:
> On Mon, 11 Dec 2017 21:45:23 +0000
> Cibu Johny (????) <cibu at google.com> wrote:

>> Malayalam could be a similar story. In case of Malayalam, it can be
>> font specific because of the existence of traditional and reformed
>> writing styles. A conjunct might be a ligature in traditional; and it
>> might get displayed with explicit virama in the reformed style. For
>> example see the poster with word ??????? broken as [u, sa-virama,
>> ta-aa, da-virama] - as it is written in the reformed style. As per
>> the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama].
>> These breaks would be used by the traditional style of writing.
> 
> Working round that seems to be tricky.  The best I can think of is to
> have two different locales, traditional and reformed, and hope that the
> right font is selected.  It doesn't seem at all straightforward to
> work out what the font is doing even from a character to glyph map
> without knowing what the glyphs are.  I'm not sure how one should have
> the difference designated - language variants, or two scripts?

I'm not at all familiar with Malayalam, but from my experience with 
typing Japanese (where the average kana character requires two 
keystrokes for input, but only one for deleting) would lead to different 
advice. When typing, it is very helpful to know how many times one has 
to hit backspace when making an error. This kind of knowledge is usually 
assimilated into what one calls muscle memory, i.e. it is done without 
thinking about it. I would guess that would be very difficult to 
maintain two different kinds of muscle memory for typing Malayalam. (My 
assumption is that the populations typing traditional and reformed 
writing styles are not disjoint.)

Regards,   Martin.

From unicode at unicode.org  Thu Dec 21 15:44:49 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 21 Dec 2017 21:44:49 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
Message-ID: <20171221214449.7567d0ac@JRWUBU2>

On Thu, 21 Dec 2017 17:55:33 +0900
"Martin J. D?rst via Unicode" <unicode at unicode.org> wrote:

> On 2017/12/15 07:40, Richard Wordingham via Unicode wrote:
> > On Mon, 11 Dec 2017 21:45:23 +0000
> > Cibu Johny (????) <cibu at google.com> wrote:  

> >> For example see the poster with word ??????? broken as [u,
> >> sa-virama, ta-aa, da-virama] - as it is written in the reformed
> >> style. As per the proposed algorithm, it would be [u,
> >> sa-virama-ta-aa, da-virama]. These breaks would be used by the
> >> traditional style of writing.  

> I'm not at all familiar with Malayalam, but from my experience with 
> typing Japanese (where the average kana character requires two 
> keystrokes for input, but only one for deleting) would lead to
> different advice. When typing, it is very helpful to know how many
> times one has to hit backspace when making an error. This kind of
> knowledge is usually assimilated into what one calls muscle memory,
> i.e. it is done without thinking about it. I would guess that would
> be very difficult to maintain two different kinds of muscle memory
> for typing Malayalam. (My assumption is that the populations typing
> traditional and reformed writing styles are not disjoint.)

When deleting by backspace, the usual practice is to delete one Unicode
character for each key press.  The proposed change to the definition of
grapheme clusters will not affect this.

What will change, for some systems, is stepping through Indic text in
most scripts. (The visual order scripts will be unaffected.)  In Linux
applications, one can often step to the start of each grapheme cluster,
i.e. to the breaks in |u|sa-virama|ta-aa|da-virama|. If the proposal to
expand extended grapheme clusters to whole aksharas goes through, a
likely effect for traditional Malayalam is that one will only be able to
step to the positions marked as breaks in
|u|sa-virama-ta-aa|da-virama|.  Every major system will then be in the
same position as Windows, where already only the reduced set of cursor
positions is allowed.  Thus if the 'sa' were mistyped, one would have
to retype the entire 4-character akshara.  I find this an unpleasant
prospect, and some Indians already find it extremely annoying not to be
able to edit the join between consonants, e.g. to replace <virama> by
<virama, ZWJ>.

Richard.


From unicode at unicode.org  Thu Dec 21 18:18:34 2017
From: unicode at unicode.org (Karl Williamson via Unicode)
Date: Thu, 21 Dec 2017 17:18:34 -0700
Subject: Inconsistency between UTS 39 and 24
Message-ID: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com>

In http://unicode.org/reports/tr39/#Mixed_Script_Detection
it says, "For more information on the Script_Extensions property and 
Jpan, Kore, and Hanb, see UAX #24"

In http://www.unicode.org/reports/tr24/, there certainly is more 
information on scx; however, none of the terms Jpan Kore nor Hanb are 
mentioned.

From unicode at unicode.org  Thu Dec 21 20:11:37 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 22 Dec 2017 03:11:37 +0100
Subject: Inconsistency between UTS 39 and 24
In-Reply-To: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com>
References: <7e516a02-14eb-1020-3561-864251346a34@khwilliamson.com>
Message-ID: <CAGa7JC2=UdELwTLNNZW+fYmGRcpTACi9h2uKTsxMZo==wx2pvg@mail.gmail.com>

These are ISO 15924 script codes for script variants or groups of related
scripts, not used in Unicode classification of characters due to their
unification (even if there are registered variants for them)

2017-12-22 1:18 GMT+01:00 Karl Williamson via Unicode <unicode at unicode.org>:

> In http://unicode.org/reports/tr39/#Mixed_Script_Detection
> it says, "For more information on the Script_Extensions property and Jpan,
> Kore, and Hanb, see UAX #24"
>
> In http://www.unicode.org/reports/tr24/, there certainly is more
> information on scx; however, none of the terms Jpan Kore nor Hanb are
> mentioned.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171222/f02360b9/attachment.html>

From unicode at unicode.org  Fri Dec 22 00:04:37 2017
From: unicode at unicode.org (Manish Goregaokar via Unicode)
Date: Thu, 21 Dec 2017 22:04:37 -0800
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <20171221214449.7567d0ac@JRWUBU2>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
Message-ID: <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>

> When deleting by backspace, the usual practice is to delete one Unicode
character for each key press.

This seems to depend on the operating system and program involved. For
example, on OSX any native text input field (Spotlight, TextEdit, etc) will
delete by extended grapheme cluster. Chrome also deletes by extended
grapheme cluster.

However, Firefox deletes by code point. Or, more accurately, something
codepoint-like. Backspace will delete flag emoji wholesale, but will delete
the jamos in `????????` (a single EGC) one at a time. It also deletes the
variation selector and the heart in `????????` in a single keystroke.
There's probably a simple metric being used here, but I haven't looked into
it yet.


-----------


Overall it seems like there's a different preference for forming clusters
in different scripts. Perhaps we should have a specific "cluster forming
virama" category for viramas from scripts that almost always prefer
clusters? (e.g. devanagari). IIRC some indic scripts prefer explicit virama
rendering.


-Manish

On Thu, Dec 21, 2017 at 1:44 PM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Thu, 21 Dec 2017 17:55:33 +0900
> "Martin J. D?rst via Unicode" <unicode at unicode.org> wrote:
>
> > On 2017/12/15 07:40, Richard Wordingham via Unicode wrote:
> > > On Mon, 11 Dec 2017 21:45:23 +0000
> > > Cibu Johny (????) <cibu at google.com> wrote:
>
> > >> For example see the poster with word ??????? broken as [u,
> > >> sa-virama, ta-aa, da-virama] - as it is written in the reformed
> > >> style. As per the proposed algorithm, it would be [u,
> > >> sa-virama-ta-aa, da-virama]. These breaks would be used by the
> > >> traditional style of writing.
>
> > I'm not at all familiar with Malayalam, but from my experience with
> > typing Japanese (where the average kana character requires two
> > keystrokes for input, but only one for deleting) would lead to
> > different advice. When typing, it is very helpful to know how many
> > times one has to hit backspace when making an error. This kind of
> > knowledge is usually assimilated into what one calls muscle memory,
> > i.e. it is done without thinking about it. I would guess that would
> > be very difficult to maintain two different kinds of muscle memory
> > for typing Malayalam. (My assumption is that the populations typing
> > traditional and reformed writing styles are not disjoint.)
>
> When deleting by backspace, the usual practice is to delete one Unicode
> character for each key press.  The proposed change to the definition of
> grapheme clusters will not affect this.
>
> What will change, for some systems, is stepping through Indic text in
> most scripts. (The visual order scripts will be unaffected.)  In Linux
> applications, one can often step to the start of each grapheme cluster,
> i.e. to the breaks in |u|sa-virama|ta-aa|da-virama|. If the proposal to
> expand extended grapheme clusters to whole aksharas goes through, a
> likely effect for traditional Malayalam is that one will only be able to
> step to the positions marked as breaks in
> |u|sa-virama-ta-aa|da-virama|.  Every major system will then be in the
> same position as Windows, where already only the reduced set of cursor
> positions is allowed.  Thus if the 'sa' were mistyped, one would have
> to retype the entire 4-character akshara.  I find this an unpleasant
> prospect, and some Indians already find it extremely annoying not to be
> able to edit the join between consonants, e.g. to replace <virama> by
> <virama, ZWJ>.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171221/f0556874/attachment.html>

From unicode at unicode.org  Fri Dec 22 01:27:15 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 22 Dec 2017 09:27:15 +0200
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
 (message from Manish Goregaokar via Unicode on Thu, 21 Dec 2017
 22:04:37 -0800)
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
Message-ID: <83mv2bmdd8.fsf@gnu.org>

> Date: Thu, 21 Dec 2017 22:04:37 -0800
> Cc: Unicode Public <unicode at unicode.org>
> From: Manish Goregaokar via Unicode <unicode at unicode.org>
> 
> However, Firefox deletes by code point.

As does Emacs, btw.

From unicode at unicode.org  Fri Dec 22 09:36:35 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 22 Dec 2017 15:36:35 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <83mv2bmdd8.fsf@gnu.org>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
 <83mv2bmdd8.fsf@gnu.org>
Message-ID: <20171222153635.67628752@JRWUBU2>

On Fri, 22 Dec 2017 09:27:15 +0200
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Thu, 21 Dec 2017 22:04:37 -0800
> > Cc: Unicode Public <unicode at unicode.org>
> > From: Manish Goregaokar via Unicode <unicode at unicode.org>
> > 
> > However, Firefox deletes by code point.  
> 
> As does Emacs, btw.

And deleting in that fashion from the right is mentioned by UAX#29 with
the implication that it is a sensible way of doing things.

Emacs is civilised in that it allows one to delete character by
character from either end.  That may, however, require some
intelligence on the part of the user so that they don't get confused
or frightened when the text rearranges itself. However, it seems that
one has to modify the source code of Emacs to be able to edit in the
middle of a cluster (other than by substitution commands). Or am I
overlooking some per-window 'reveal codes' mode that the cognoscenti
can use?

Richard.

From unicode at unicode.org  Fri Dec 22 09:44:39 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 22 Dec 2017 17:44:39 +0200
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <20171222153635.67628752@JRWUBU2> (message from Richard
 Wordingham via Unicode on Fri, 22 Dec 2017 15:36:35 +0000)
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
 <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2>
Message-ID: <83tvwilqc8.fsf@gnu.org>

> Date: Fri, 22 Dec 2017 15:36:35 +0000
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> Emacs is civilised in that it allows one to delete character by
> character from either end.  That may, however, require some
> intelligence on the part of the user so that they don't get confused
> or frightened when the text rearranges itself. However, it seems that
> one has to modify the source code of Emacs to be able to edit in the
> middle of a cluster

You can always delete a codepoint at a given position in Emacs,
specifying the position by its number, but there are no user-level
commands to conveniently allow doing that in the middle of a grapheme
cluster.

It was never requested nor deemed necessary to provide such a
capability.  Normally, replacing some portions of a grapheme cluster
produces a radically different display, so it makes more sense to
delete everything and start anew.  Deleting individual codepoints by
Backspace is useful for accents and diacritics, which generally are
input after the base characters, so that is provided.

From unicode at unicode.org  Fri Dec 22 16:56:53 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 22 Dec 2017 22:56:53 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
Message-ID: <20171222225653.01cc4b8a@JRWUBU2>

On Thu, 21 Dec 2017 22:04:37 -0800
Manish Goregaokar via Unicode <unicode at unicode.org> wrote:

> > When deleting by backspace, the usual practice is to delete one
> > Unicode  
> character for each key press.
> 
> This seems to depend on the operating system and program involved. For
> example, on OSX any native text input field (Spotlight, TextEdit,
> etc) will delete by extended grapheme cluster. Chrome also deletes by
> extended grapheme cluster.

That seems nasty, even for Thai with its consonant + vowel + tone
legacy grapheme clusters.  Or does Thai get special treatment?  iPhone
messages shows the normal (mandated?) Thai behaviour of deleting
character by character. Do you not find this mass deletion annoying
for Hindi aksharas with anusvara?

> However, Firefox deletes by code point. Or, more accurately, something
> codepoint-like. Backspace will delete flag emoji wholesale, but will
> delete the jamos in `????????` (a single EGC) one at a time. It
> also deletes the variation selector and the heart in `????????` in a
> single keystroke. There's probably a simple metric being used here,
> but I haven't looked into it yet.

There are some odd behaviours around.  Claws-mail, which I think uses
straight GTK2, has been changing its treatment of Latin diacritics.
Long ago, if I remember correctly, it treated 'e acute' differently
depending on whether it was one or two codepoints, then it started
converting text to NFC on input, and now it treats the NFC and NFD
sequence <x, U+030C COMBINING CARON> as though it were a single
codepoint. This might be using the property 'diacritic', but it isn't
treating Thai tone marks that way, so I'm guessing.  Presumably it's
been implemented on the principle that the user should not receive any
pleasant surprises.

> -----------
> Overall it seems like there's a different preference for forming
> clusters in different scripts. Perhaps we should have a specific
> "cluster forming virama" category for viramas from scripts that
> almost always prefer clusters? (e.g. devanagari). IIRC some indic
> scripts prefer explicit virama rendering.

The denial of "one size fits all" is appropriate within writing systems
as well as across all systems.  For example, using grapheme clusters as
the unit of matching may generally work well, but is a total disaster
in Indic if one needs to replace one vowel by another, as in
Hariraama's plea for help on the Indic list on 6 December.

Moving the editing position within a cluster is another issue.
Sometimes one needs to adjust the type of joining of consonants in an
akshara, e.g. to give a Devanagari text a Hindi look even if the text
gets displayed with a Sanskrit font.  This is where the font-dependent
interpretation of a virama is a disaster.  For Devanagari it might have
been better if there had been three different characters for the two
types of joining (half-forms on one hand and conjuncts or repha on the
other) and one type of non-joining, the visible virama.  Instead, it
seems that people rely on the appropriate gaps in the font capability.
The nettle was grasped for the Myanmar script, which now has an
invisible stacker, a pure vowel killer, and a composite code for the
repha-type combination.

I really do find it hard to believe that it is considered to be bad to
correct a single consonant in the middle of an akshara.

I an not persuaded that the users of languages with many
multi-consonant aksharas think that each distinct akshara is a
different character.  The akshara lies in a hierarchy, between grapheme
cluster and pada patha word.  What is needed is an extra level of
cursor motion, between the levels of word and grapheme cluster.

Thai also shows different levels of division.  For horizontal spacing,
the unit is indeed the grapheme cluster.  However, looking at
dictionary published in 1971, I noticed that a few marks above are
conditionally placed between the grapheme cluster.  The primary
examples are MAI HANAKAHT and MAI THO, neither of which the
Thais considered a vowel back in 1892.  (Michell's dictionary of 1892
is apologetic about treating the former as a vowel.)  I haven't noticed
this behaviour in 20th century Thai with characters separated by several
character widths, though both these marks tend to be placed in the
rightmost part of the space allocated to the base consonant.  Correct
positioning of MAI THO seems to require a grammatical analysis, and the
documentation of Uniscribe certainly used to suggest that this was not
possible at the font level.

Vertical writing in Thai is extremely rare.  There are Thai
crosswords, and they do use the grapheme clusters.  Many, but not all,
of the examples use an irregular pentagon to accommodate marks above and
a different irregular pentagon to accommodate marks below.  Thais seem
better acquainted with Scrabble played in English.

Commercial vertical signs follow a different grammar.  I have two
examples, segmented ??-??-?? 'Yamaha', and ??-??-?? 'video', the
latter typically accompanied by V-D-O in Roman letters. It is not clear
whether these words are split into super-extended grapheme clusters or
syllables.

When it comes to line-breaking, the Thai preference for emergency
line-breaks (which are supposed to be beyond the scope of Unicode) seems
to be for division into syllables.  This seems to be the standard for
Lao line-breaking, though it might be connected with the facts that
syllable boundaries are easier to detect with modern Lao spelling and
that there are far fewer users of Lao than of Thai.

When it comes to detecting aksharas in Tamil the situation seems to be
rather simple.  In two environments, U+0BCD TAMIL SIGN VIRAMA behaves
like an invisible stacker.  Otherwise, it behave like a pure killer.

For Malayalam, the two writing syles have different behaviours for
the virama.  Disunification of U+0D4D MALAYALAM SIGN VIRAMA is probably
not an option.  In theory, one could try demanding that ambiguous
intentionally visible virama be spelt with ZWNJ, but I doubt that such
a command would be heeded.  

For Sinhalese, it may be that there is no ambiguity in the effect of
virama, provided one is aware that the current Unicode prescription is
contrary to the rules laid down by the government of Sri Lanka.  Note
that the recent W3C investigation of Indic layout requirements was
restricted to *Indian* Indic scripts.  Does anyone here know what a
Burmese 'dropped capital' looks like?  The investigation did not cover
Insular Southeast Asia, where there are characters of Indic syllabic
category virama.  The coding of mainland Southeast Asian Indic scripts
has evolved beyond the virama, using an invisible stacker and a pure
killer instead. The ISCII stage is, so far as I am aware, restricted to
India and Sri Lanka.

Richard.


From unicode at unicode.org  Fri Dec 22 19:39:40 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 23 Dec 2017 01:39:40 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <83tvwilqc8.fsf@gnu.org>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
 <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2>
 <83tvwilqc8.fsf@gnu.org>
Message-ID: <20171223013940.7530f89b@JRWUBU2>

On Fri, 22 Dec 2017 17:44:39 +0200
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> You can always delete a codepoint at a given position in Emacs,
> specifying the position by its number, but there are no user-level
> commands to conveniently allow doing that in the middle of a grapheme
> cluster.
> 
> It was never requested nor deemed necessary to provide such a
> capability.

Kenichi Handa provided such a capability for Emacs, but it has not been
accepted for the main line.  I am using a version of his code which
he kindly provided for my own editing.  The discussion is available at
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20140 .  The figure
attached to it shows stepping through a 7-character (6 graphic and one
'invisible' stacker) akshara.

> Normally, replacing some portions of a grapheme cluster
> produces a radically different display, so it makes more sense to
> delete everything and start anew.

Non sequitur.

> Deleting individual codepoints by
> Backspace is useful for accents and diacritics, which generally are
> input after the base characters, so that is provided.

Don't forget that a cluster can be a large constellation of
characters.

Richard.

From unicode at unicode.org  Wed Dec 27 15:31:19 2017
From: unicode at unicode.org (Karl Williamson via Unicode)
Date: Wed, 27 Dec 2017 14:31:19 -0700
Subject: Traditional and Simplified Han in UTS 39
Message-ID: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>

In UTS 39, it says, that optionally,

"Mark Chinese strings as ?mixed script? if they contain both simplified 
(S) and traditional (T) Chinese characters, using the Unihan data in the 
Unicode Character Database [UCD].

"The criterion can only be applied if the language of the string is 
known to be Chinese."

What does it mean for the language to "be known to be Chinese"?  Is this 
something algorithmically determinable, or does it come from information 
about the input text that comes from outside the UCD?

The example given shows some Hirigana in the text.  That clearly 
indicates the language isn't Chinese.  So in this example we can 
algorithmically rule out that its Chinese.

And what does Chinese really mean here?


From unicode at unicode.org  Wed Dec 27 16:20:10 2017
From: unicode at unicode.org (Phake Nick via Unicode)
Date: Thu, 28 Dec 2017 06:20:10 +0800
Subject: Traditional and Simplified Han in UTS 39
In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
Message-ID: <CAGHjPPKxPXCGoxKXOTPqFNNA3ngXAgouJUJkUQq=6v8G8XaY=Q@mail.gmail.com>

2017?12?28? ??5:34 ? "Karl Williamson via Unicode" <unicode at unicode.org> ???
>
> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as ?mixed script? if they contain both simplified
(S) and traditional (T) Chinese characters, using the Unihan data in the
Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is known
to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"?
As in, the string is written in Chinese language, not Japanese language,
not old Korean/Vietnamese text that use Chinese character, nor any other
languages that use Chinese characters.
According to my knowledge, some Chinese dialects/variants also use both
Simplified and Traditional characters together with different etymology and
that probably shouldn't be considered as mixed script too, although they
aren't really common and is not mentioned in the UTS either.

> Is this something algorithmically determinable, or does it come from
information about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text.  That clearly
indicates the language isn't Chinese.  So in this example we can
algorithmically rule out that its Chinese.

Usually when there are Japanese kana in the mix then the text would be
Japanese instead of Chinese. However the reverse is not necessarily true,
especially for a single word or short pharse, older styled text and such,
where a string with only Chinese characters can still be a Japanese text.

>
> And what does Chinese really mean here?
>
The written form of the (Mandarin) Chinese language?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171228/563b1e34/attachment.html>

From unicode at unicode.org  Wed Dec 27 16:39:21 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Wed, 27 Dec 2017 23:39:21 +0100
Subject: Traditional and Simplified Han in UTS 39
In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
Message-ID: <CAGa7JC1rnYayxwOg6=VBHC2BMTCiyXuMEJbgXAGbZqcNKGZfwA@mail.gmail.com>

I bet it means the difference in terms of scripts, not in terms of
languages. So it says to use "Hani" instead of "Hans" or "Hant" if the
character forms cannot be determined, and this will apply equally if the
language is Chinese/Mandarin, Cantonese/Yue, Taiwanese, Wu, or even
Japanese.

For the Japanese language there's an additional mixed script code "Jpan"
when it uses a mis of sinograms, Katakana and Hiragana.
For the Chinese languages there should be a script code for
sinograms+Bopomofo (Bopomofo is rarely used alone, but most often with
Traditional sinograms; it occurs sometimes with Simplified sinograms as
well)

2017-12-27 22:31 GMT+01:00 Karl Williamson via Unicode <unicode at unicode.org>
:

> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as ?mixed script? if they contain both simplified
> (S) and traditional (T) Chinese characters, using the Unihan data in the
> Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is known
> to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"?  Is this
> something algorithmically determinable, or does it come from information
> about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text.  That clearly indicates
> the language isn't Chinese.  So in this example we can algorithmically rule
> out that its Chinese.
>
> And what does Chinese really mean here?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171227/7bf43425/attachment.html>

From unicode at unicode.org  Wed Dec 27 23:24:52 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 27 Dec 2017 21:24:52 -0800
Subject: Traditional and Simplified Han in UTS 39
In-Reply-To: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
References: <5c85cffc-0648-3d51-aa2c-5ade05da4810@khwilliamson.com>
Message-ID: <1c9f4bf5-7589-a5a6-ddd2-dd4de4e5d0a0@ix.netcom.com>

The full excerpt from the UTS reads:

> Mark Chinese strings as ?mixed script? if they contain both simplified 
> (S) and traditional (T) Chinese characters, using the Unihan data in 
> the Unicode Character Database [UCD 
> <http://www.unicode.org/reports/tr39/#UCD>].
>
>  1. The criterion can only be applied if the language of the string is
>     known to be Chinese. So, for example, the string ????????? ?
>     is Japanese, and should not be marked as mixed script because of a
>     mixture of S and T characters.
>  2. Testing for whether a character is S or T needs to be based not on
>     whether the character /has/?a S or T variant , but whether the
>     character /is/?an S or T variant.
>

There are several issues with this.

First and foremost, the definition of S and T variants is not something 
that is universally agreed upon. The .cn, .hk or .tw registries are 
using a definition of S and T variants that does not agree with the 
Unihan data in many particulars. Therefore, using the Unihan data would 
result in false positives. (And false negatives).

Second, there are many characters that are variants that are acceptable 
with both "S" or "T" labels. You only have to look at the published 
Label Generation Rulesets (or IDN tables) for these domains to see many 
examples. And, as mentioned above, you cannot reverse engineer these 
tables from Unihan data.

Third, the same domains mentioned have a policy of delegating up to 
three label to the same applicant: a "traditional", "simplified" and a 
mixed label matching the spelling of the label in the original 
application (for situations where a mixed label is appropriate). In 
other words, certain mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing any 
other mixed label that is a variant of the three from being allocated to 
an unrelated party. If you "know" that the language has to be Chinese, 
because the domain is a ccTLD, then at the same time the check is 
superfluous. Other registries are not known to have similar policies, so 
for them additional spoof detection may be useful --- however it is 
precisely those cases where it's impossible to know whether a label is 
intended to be in the Chinese language.

Fifth, generally the only thing that can be ascertained is that a label 
is *not* in Chinese: by virtue of having Kana or Hangul characters mixed 
in. However, the reverse is not true. You will find labels registered 
under .jp that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the state 
of the art is to have a coordinated policy that prevents "random" 
variant labels from coexisting in the registry. An example of this kind 
of effort is being developed for the root zone. By definition, for the 
root zone, there is no implied information about the language context, 
unlike the case for the second level, where the presence of a ccTLD in 
the full domain name may give a clue.

Seventh, attempting to determine whether a label is potentially valid 
based on variant data (of any kind) is doomed, because actual usage is 
not limited to "pure" labels. The variant mechanism is something that 
works differently (in those registries that apply it): instead of 
looking at a single label, the registry can implement "mutual 
exclusion". Once one variant label from a given set has been delegated, 
all others are excluded (or in practice, all but three, which are 
limited to the same applicant). Without access to the registry data, you 
cannot predict which variants in a set are the "good ones", and with 
access to the data, spoof labels are rejected and cannot be registered.

In conclusion, my recommendation would be to retract this particular 
passage.

A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:
> In UTS 39, it says, that optionally,
>
> "Mark Chinese strings as ?mixed script? if they contain both 
> simplified (S) and traditional (T) Chinese characters, using the 
> Unihan data in the Unicode Character Database [UCD].
>
> "The criterion can only be applied if the language of the string is 
> known to be Chinese."
>
> What does it mean for the language to "be known to be Chinese"? Is 
> this something algorithmically determinable, or does it come from 
> information about the input text that comes from outside the UCD?
>
> The example given shows some Hirigana in the text.? That clearly 
> indicates the language isn't Chinese.? So in this example we can 
> algorithmically rule out that its Chinese.
>
> And what does Chinese really mean here?
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171227/d2b84e9b/attachment.html>

From unicode at unicode.org  Fri Dec 29 19:08:00 2017
From: unicode at unicode.org (David Starner via Unicode)
Date: Sat, 30 Dec 2017 01:08:00 +0000
Subject: Linearized tilde?
Message-ID: <CAMZ=zj51NnP26DAaMS5w=BA6W=+ah30B2P654oXGEb4evYjkcg@mail.gmail.com>

https://en.wikipedia.org/wiki/African_reference_alphabet says "The 1982
revision of the alphabet was made by Michael Mann and David Dalby, who had
attended the Niamey conference. It has 60 letters; some are quite different
from the 1978 version." and offers the linearized tilde, a tilde squeezed
into the space and location of the normal lowercase 'x' or 'o'. (See
https://commons.wikimedia.org/wiki/File:Latin_letter_Linearized_tilde_(Mann-Dalby_form).svg
)The
German WP article specifically says "Der Buchstabe ist in keine aktuelle
Orthografie <https://de.wikipedia.org/wiki/Orthografie> ?bernommen und ist
auch nicht in Unicode <https://de.wikipedia.org/wiki/Unicode> enthalten
(Stand 2013, Unicode Version 6.3)." "The letter is not included in any
current spelling and is not included in Unicode." Should it be?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171230/699a3a1d/attachment.html>

From unicode at unicode.org  Fri Dec 29 20:54:19 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 30 Dec 2017 03:54:19 +0100
Subject: Linearized tilde?
In-Reply-To: <CAMZ=zj51NnP26DAaMS5w=BA6W=+ah30B2P654oXGEb4evYjkcg@mail.gmail.com>
References: <CAMZ=zj51NnP26DAaMS5w=BA6W=+ah30B2P654oXGEb4evYjkcg@mail.gmail.com>
Message-ID: <CAGa7JC0XvQYbkiFji3fmB9uCggK2yz++FFzJzQELVhrozCnyDA@mail.gmail.com>

Isn't it a rounded variant of Latin letter n ? Then it could exist also in
uppercase form (like "n" and "N")

It could also be used as a spacing version of the combining tilde
diacritic, to be written after the letter instead of being combined above
it (so "el Ni?o" would we written with it as "el Nin<LATIN SMALL TILDE
LETTER>o" (without using the encoded tilde symbol in the ugly "el Nin~o",
but with a normal letter), or capitalized as "EL NIN<LATIN CAPITAL TILDE
LETTER>O" (instead of the ugly "EL NIN~O").

I don't think that "LINEARIZED TILDE" is the correct name. I think it's
better named LATIN TILDE LETTER, to be sorted between LATIN LETTER N and
LATIN LETTER O (unlike the ASCII tile symbol which sorts with other symbols
after spacing whitespaces but before all digits and letters)


2017-12-30 2:08 GMT+01:00 David Starner via Unicode <unicode at unicode.org>:

> https://en.wikipedia.org/wiki/African_reference_alphabet says "The 1982
> revision of the alphabet was made by Michael Mann and David Dalby, who had
> attended the Niamey conference. It has 60 letters; some are quite different
> from the 1978 version." and offers the linearized tilde, a tilde squeezed
> into the space and location of the normal lowercase 'x' or 'o'. (See
> https://commons.wikimedia.org/wiki/File:Latin_
> letter_Linearized_tilde_(Mann-Dalby_form).svg )The German WP article
> specifically says "Der Buchstabe ist in keine aktuelle Orthografie
> <https://de.wikipedia.org/wiki/Orthografie> ?bernommen und ist auch nicht
> in Unicode <https://de.wikipedia.org/wiki/Unicode> enthalten (Stand 2013,
> Unicode Version 6.3)." "The letter is not included in any current
> spelling and is not included in Unicode." Should it be?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20171230/b87de7d5/attachment.html>

From unicode at unicode.org  Sat Dec 30 12:54:44 2017
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Sat, 30 Dec 2017 11:54:44 -0700
Subject: Linearized tilde?
Message-ID: <343575ACDCA641938CD4C9EC6BA9C909@DougEwell>

David Starner wrote:

> "The letter is not included in any current spelling and is not
> included in Unicode." Should it be?

Did anyone ever use the 1982 alphabet, other than Mann and Dalby?

If not, I wonder if this letter is a bit like the "proposed new 
punctuation marks" that show up in proposals from time to time, but have 
never been used except by their inventors and to talk about them.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Sat Dec 30 12:59:36 2017
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Sat, 30 Dec 2017 11:59:36 -0700
Subject: Linearized tilde?
Message-ID: <DDE5B14E3FFA4E6FBB5AF32D628113EA@DougEwell>

Philippe Verdy wrote:

> Isn't it a rounded variant of Latin letter n ? Then it could exist
> also in uppercase form (like "n" and "N")

A defining characteristic of the 1982 African Reference Alphabet was 
that it was lowercase-only. An uppercase form would be an invention with 
no basis in history or usage.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Sat Dec 30 13:02:41 2017
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Sat, 30 Dec 2017 19:02:41 +0000
Subject: Linearized tilde?
In-Reply-To: <DDE5B14E3FFA4E6FBB5AF32D628113EA@DougEwell>
References: <DDE5B14E3FFA4E6FBB5AF32D628113EA@DougEwell>
Message-ID: <0DFBECBC-F43D-4F92-869D-678CFCCBDE19@evertype.com>

On 30 Dec 2017, at 18:59, Doug Ewell via Unicode <unicode at unicode.org> wrote:
> A defining characteristic of the 1982 African Reference Alphabet was that it was lowercase-only. An uppercase form would be an invention with no basis in history or usage.

Which is why it failed. Everybody who used anything like it or derived from it ended up devising capital letters. 

Doke?s click letters are better candidates for encoding.

Michael Everson

From unicode at unicode.org  Sun Dec 31 12:09:59 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 31 Dec 2017 18:09:59 +0000
Subject: Proposed Expansion of Grapheme Clusters to Whole Aksharas -
 Implementation Issues
In-Reply-To: <83tvwilqc8.fsf@gnu.org>
References: <20171208220619.3eb2fcbe@JRWUBU2>
 <CAJ2xs_ED0AF2TTJ3Vuisi0POAGwfjOrL8qN9F++4c_BiKTWEMg@mail.gmail.com>
 <20171209203017.77dbcbf9@JRWUBU2>
 <CAJ2xs_F+gApdS0BBjiTSEPRfcg1idR+-mHjB0wp=WOy7Kbzwow@mail.gmail.com>
 <CAFPeFPLgGqH=ywt8Uf0EOnbxd_oGJha_2hmG_HO87GvQQya1Uw@mail.gmail.com>
 <20171214224023.23e5723f@JRWUBU2>
 <859cf994-4b57-b90d-7c87-a814576b6c4a@it.aoyama.ac.jp>
 <20171221214449.7567d0ac@JRWUBU2>
 <CAFOnWk=5x6adXv90T1jxDJ3fmQJawhRrzdg6snC-SfP=i9b_8A@mail.gmail.com>
 <83mv2bmdd8.fsf@gnu.org> <20171222153635.67628752@JRWUBU2>
 <83tvwilqc8.fsf@gnu.org>
Message-ID: <20171231180959.649681bb@JRWUBU2>

On Fri, 22 Dec 2017 17:44:39 +0200
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Fri, 22 Dec 2017 15:36:35 +0000
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > However, it seems
> > that one has to modify the source code of Emacs to be able to edit
> > in the middle of a cluster  

> You can always delete a codepoint at a given position in Emacs,
> specifying the position by its number, but there are no user-level
> commands to conveniently allow doing that in the middle of a grapheme
> cluster.

> It was never requested nor deemed necessary to provide such a
> capability.

Whilst not the nicest of mechanisms, it turns out that Emacs does have a
'standard' command auto-composition-mode which will toggle automatic
clustering. If one disables automatic clustering, one can then step
through the clusters character by character.  This is the sort of thing
Hariraama has been asking for on the Indic list, though he would like
the capability for Microsoft Word.

Richard.

From unicode at unicode.org  Sun Dec 31 20:14:36 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Mon, 1 Jan 2018 07:44:36 +0530
Subject: Popular wordprocessors treating U+00A0 as fixed-width
Message-ID: <CAH-HCWXz5MtVD7DJDKND_L_f2dyYxRgGL1bwMTnP_fRnkOM7-Q@mail.gmail.com>

While http://unicode.org/reports/tr14/ clearly states that:

<quote>
When expanding or compressing interword space according to common
typographical practice, only the spaces marked by U+0020 SPACE and
U+00A0 NO-BREAK SPACE are subject to compression, and only spaces
marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces
marked by U+2009 THIN SPACE are subject to expansion. All other space
characters normally have fixed width.
</quote>

? really sad to see the misunderstanding around U+00A0:

https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1

https://bugs.documentfoundation.org/show_bug.cgi?id=41652

-- 
Shriramana Sharma ???????????? ???????????? ????????????????????????


From unicode at unicode.org  Sun Dec 31 21:43:26 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 1 Jan 2018 04:43:26 +0100
Subject: Popular wordprocessors treating U+00A0 as fixed-width
In-Reply-To: <CAH-HCWXz5MtVD7DJDKND_L_f2dyYxRgGL1bwMTnP_fRnkOM7-Q@mail.gmail.com>
References: <CAH-HCWXz5MtVD7DJDKND_L_f2dyYxRgGL1bwMTnP_fRnkOM7-Q@mail.gmail.com>
Message-ID: <CAGa7JC1zLy0u+c85JnsO3x5sqxoQjjJOdmC_vQrHMEM29boNHw@mail.gmail.com>

Well it's unfortunate that Microsoft's own response (by its MSVP) is
completely wrong, suggesting to use Narrow non-breaking space to get
justification, which is exactly the reverse where these NNBSP should NOT be
justified and keep their width.

Microsoft's developers have absolutely misunderstood the standard where
both SPACE and NBSP should really behave the same for justification (being
different only for the existence of the break opportunity).

This Microsoft response is completrrely supid, and it even breaks the
classic typography for French use of NNBSP ("fine" in French) around some
punctuations (before :;!?? or after ?) and as group separators in numbers
(note that NNBSP was introduced in Unicode very late in the standard (and
before that NBSP was used only because this was the only non-breaking space
available but it was much too large!)

Still many documents use NBSP instead of NNBSP around punctuations or as
group separators (but in Word these contextual occurences of NBSP which are
easy to detect, could have been autoreplaced when typesetting, or proposed
as a correction in the integrated speller, at least for French). But the
old behavior of old versions of Office (before NNBSP existed in Unicode)
should have been cleaned up since long.

It's clear that MS Office developers don't know the standards and do what
they want (they also don't know the correct standards for maths in Excel
and use a lot of very stupid assumptions, as if they were smarter than
their users that suffer since long from these bugs !) and don't want to fix
their past errors.

2018-01-01 3:14 GMT+01:00 Shriramana Sharma via Unicode <unicode at unicode.org
>:

> While http://unicode.org/reports/tr14/ clearly states that:
>
> <quote>
> When expanding or compressing interword space according to common
> typographical practice, only the spaces marked by U+0020 SPACE and
> U+00A0 NO-BREAK SPACE are subject to compression, and only spaces
> marked by U+0020 SPACE, U+00A0 NO-BREAK SPACE, and occasionally spaces
> marked by U+2009 THIN SPACE are subject to expansion. All other space
> characters normally have fixed width.
> </quote>
>
> ? really sad to see the misunderstanding around U+00A0:
>
> https://answers.microsoft.com/en-us/msoffice/forum/msoffice_
> word-mso_windows8-mso_2016/nonbreakable-space-justification-in-word-2016/
> 4fa1ad30-004c-454f-9775-a3beaa91c88b?auth=1
>
> https://bugs.documentfoundation.org/show_bug.cgi?id=41652
>
> --
> Shriramana Sharma ???????????? ???????????? ????????????????????????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180101/d39d7001/attachment.html>