From unicode at unicode.org  Thu Aug  3 19:34:15 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 3 Aug 2017 17:34:15 -0700
Subject: Feedback on the proposal to change U+FFFD generation when
 decoding ill-formed UTF-8
In-Reply-To: <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>
References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com>
 <CAJ2xs_G=gT9owHYjZgzZvu2rg38Ch7yK8F7PMAvuEjp1cS0uYA@mail.gmail.com>
 <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp>
 <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>
 <CAN49p6qhdeNX+FNE5Q8FWzQ7_SvUE3VdgafqyRTgHPQxy4MRaQ@mail.gmail.com>
 <ca71f97f-3158-f55e-423c-8f5c7a2ac7f9@ix.netcom.com>
 <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com>
 <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp>
 <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>
Message-ID: <CAJ2xs_EwGfbHamXMhsmq+WGhKdKQxCQMOdEo2XUDi=3iq3Y4Dw@mail.gmail.com>

FYI, the UTC retracted the following.

*[151-C19 <http://www.unicode.org/cgi-bin/GetL2Ref.pl?151-C19>]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>, for Unicode
version 11.0.

Mark

(https://twitter.com/mark_e_davis)

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson via Unicode <
unicode at unicode.org> wrote:

> On 05/24/2017 12:46 AM, Martin J. D?rst wrote:
>
>> On 2017/05/24 05:57, Karl Williamson via Unicode wrote:
>>
>>> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:
>>>
>>
>> Adding a "recommendation" this late in the game is just bad standards
>>>> policy.
>>>>
>>>
>> Unless I misunderstand, you are missing the point.  There is already a
>>> recommendation listed in TUS,
>>>
>>
>> That's indeed correct.
>>
>>
>> and that recommendation appears to have
>>> been added without much thought.
>>>
>>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>
> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>
> And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>
> I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>
> But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>
> It appears to me that little thought was given to the fact that these
> changes would cause overlongs to now be at least two units instead of one,
> making long existing code no longer be best practice.  You are effectively
> saying I'm wrong about this.  I thought I had been paying attention to
> PRI's since the 5.x series, and I don't remember anything about this.  If
> you have evidence to the contrary, please give it. However, I would have
> thought Markus would have dug any up and given it in his proposal.
>
>
>
>>
>> There is no proposal to add a
>>> recommendation "this late in the game".
>>>
>>
>> True. The proposal isn't for an addition, it's for a change. The "late in
>> the game" however, still applies.
>>
>> Regards,   Martin.
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170803/92ea2d71/attachment.html>

From unicode at unicode.org  Fri Aug  4 08:26:27 2017
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Fri, 4 Aug 2017 16:26:27 +0300
Subject: Feedback on the proposal to change U+FFFD generation when
 decoding ill-formed UTF-8
In-Reply-To: <CAJ2xs_EwGfbHamXMhsmq+WGhKdKQxCQMOdEo2XUDi=3iq3Y4Dw@mail.gmail.com>
References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com>
 <CAJ2xs_G=gT9owHYjZgzZvu2rg38Ch7yK8F7PMAvuEjp1cS0uYA@mail.gmail.com>
 <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp>
 <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>
 <CAN49p6qhdeNX+FNE5Q8FWzQ7_SvUE3VdgafqyRTgHPQxy4MRaQ@mail.gmail.com>
 <ca71f97f-3158-f55e-423c-8f5c7a2ac7f9@ix.netcom.com>
 <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com>
 <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp>
 <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>
 <CAJ2xs_EwGfbHamXMhsmq+WGhKdKQxCQMOdEo2XUDi=3iq3Y4Dw@mail.gmail.com>
Message-ID: <CAJQvAuc1VTHUh_z4y2YEY0ZhUBQi=TNNLg0Vsm0c9qqdZfSBBg@mail.gmail.com>

On Fri, Aug 4, 2017 at 3:34 AM, Mark Davis ?? via Unicode
<unicode at unicode.org> wrote:
> FYI, the UTC retracted the following.
>
> [151-C19] Consensus: Modify the section on "Best Practices for Using FFFD"
> in section "3.9 Encoding Forms" of TUS per the recommendation in L2/17-168,
> for Unicode version 11.0.

Thank you!

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From unicode at unicode.org  Thu Aug  3 23:58:06 2017
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Fri, 4 Aug 2017 13:58:06 +0900
Subject: Feedback on the proposal to change U+FFFD generation when
 decoding ill-formed UTF-8
In-Reply-To: <CAJ2xs_EwGfbHamXMhsmq+WGhKdKQxCQMOdEo2XUDi=3iq3Y4Dw@mail.gmail.com>
References: <20170517134156.665a7a7059d7ee80bb4d670165c8327d.0fd4eb2a77.wbe@email03.godaddy.com>
 <CAJ2xs_G=gT9owHYjZgzZvu2rg38Ch7yK8F7PMAvuEjp1cS0uYA@mail.gmail.com>
 <1c37163e-08a2-a46f-c46c-223b5ea67bd1@it.aoyama.ac.jp>
 <79d4a0dc-db16-3105-98fe-8c0f83938fc2@ix.netcom.com>
 <CAN49p6qhdeNX+FNE5Q8FWzQ7_SvUE3VdgafqyRTgHPQxy4MRaQ@mail.gmail.com>
 <ca71f97f-3158-f55e-423c-8f5c7a2ac7f9@ix.netcom.com>
 <2db88940-6eae-eb74-b70f-5d2772f3738d@khwilliamson.com>
 <0fc11646-7af3-5945-2faf-3cf6d47b5189@it.aoyama.ac.jp>
 <076d2114-dd6c-b45b-4159-07469451967f@khwilliamson.com>
 <CAJ2xs_EwGfbHamXMhsmq+WGhKdKQxCQMOdEo2XUDi=3iq3Y4Dw@mail.gmail.com>
Message-ID: <a3d63520-a607-0261-5e50-da649f2fa9d2@it.aoyama.ac.jp>

Hello Mark,

On 2017/08/04 09:34, Mark Davis ?? wrote:
> FYI, the UTC retracted the following.

Thanks for letting us know!

Regards,   Martin.

> *[151-C19 <http://www.unicode.org/cgi-bin/GetL2Ref.pl?151-C19>]
> Consensus:* Modify
> the section on "Best Practices for Using FFFD" in section "3.9 Encoding
> Forms" of TUS per the recommendation in L2/17-168
> <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/17-168>, for Unicode
> version 11.0.
> 
> Mark

From unicode at unicode.org  Mon Aug  7 01:53:50 2017
From: unicode at unicode.org (=?UTF-8?Q?Martin_J._D=c3=bcrst?= via Unicode)
Date: Mon, 7 Aug 2017 15:53:50 +0900
Subject: Inadvertent copies of test data in L2/17-197 ?
Message-ID: <b0e2ef01-d97e-a27d-e185-ac89d14ae33c@it.aoyama.ac.jp>

Hello Henry,

I just had a look at 
http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf to use the test 
data in there for Ruby.

I was under the impression from previous looks at it that it contained a 
lot of test data. However, when I looked at the test data more carefully 
(I had read the text before the test data carefully at least two times 
before, but not looked at the test data in that much detail), I 
discovered that there might be up to 7 copies of the same data. The 
first one starts on page 9, and then there's a new one about every 4 or 
5 pages.

Can you check/confirm? Any idea what might have caused this?

Regards,   Martin.

From unicode at unicode.org  Mon Aug  7 02:36:00 2017
From: unicode at unicode.org (Henri Sivonen via Unicode)
Date: Mon, 7 Aug 2017 10:36:00 +0300
Subject: Inadvertent copies of test data in L2/17-197 ?
In-Reply-To: <b0e2ef01-d97e-a27d-e185-ac89d14ae33c@it.aoyama.ac.jp>
References: <b0e2ef01-d97e-a27d-e185-ac89d14ae33c@it.aoyama.ac.jp>
Message-ID: <CAJQvAucvECe1tOGkovmC5Ko4es+6V4YEbJxOL3FFBS1-fsTH6A@mail.gmail.com>

On Mon, Aug 7, 2017 at 9:53 AM, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.

It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input
(https://hsivonen.fi/broken-utf-8/test.html).

> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
>
> Can you check/confirm? Any idea what might have caused this?

The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each:
https://hsivonen.fi/broken-utf-8/spec.html
https://hsivonen.fi/broken-utf-8/one-per-byte.html
https://hsivonen.fi/broken-utf-8/win32.html
https://hsivonen.fi/broken-utf-8/java.html
https://hsivonen.fi/broken-utf-8/python2.html with non-conforming
output replaced with italic text saying what the bytes were
https://hsivonen.fi/broken-utf-8/perl5.html
https://hsivonen.fi/broken-utf-8/icu.html

I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no
duplicates).

For reference, I tested the Ruby standard library with the following program:

data = IO.read("test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
IO.write("ruby.html", encoded)

...where test.html was the file available at
https://hsivonen.fi/broken-utf-8/test.html

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/


From unicode at unicode.org  Thu Aug 10 14:09:21 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 10 Aug 2017 12:09:21 -0700
Subject: Root Zone Label Generation Rules extended to 6 scripts
In-Reply-To: <2a4e66b3c35a4b03bf63a5090857eaf3@PMBX112-W1-CA-1.PEXCH112.ICANN.ORG>
References: <2a4e66b3c35a4b03bf63a5090857eaf3@PMBX112-W1-CA-1.PEXCH112.ICANN.ORG>
Message-ID: <5d474c74-d876-9e9d-9085-58c260524297@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170810/8ffaada1/attachment.html>
-------------- next part --------------
_______________________________________________
Fab5 mailing list
Fab5 at icann.org
https://mm.icann.org/mailman/listinfo/fab5


From unicode at unicode.org  Wed Aug 16 08:26:17 2017
From: unicode at unicode.org (Mike FABIAN via Unicode)
Date: Wed, 16 Aug 2017 15:26:17 +0200
Subject: Should U+3248 ... U+324F be wide characters?
Message-ID: <s9do9rfskxy.fsf@redhat.com>

EastAsianWidth.txt contains:

3248..324F;A     # No     [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE

i.e. it classifies the width of the characters at codepoints
between 3248 and 324F as ambiguous.

Is this really correct? Shouldn?t they be ?W?, i.e. wide?

In most fonts these characters seem to be square shaped wide characters.

-- 
Mike FABIAN <mfabian at redhat.com>


From unicode at unicode.org  Wed Aug 16 09:04:11 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 16 Aug 2017 07:04:11 -0700
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <s9do9rfskxy.fsf@redhat.com>
References: <s9do9rfskxy.fsf@redhat.com>
Message-ID: <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170816/3721d239/attachment.html>

From unicode at unicode.org  Wed Aug 16 09:23:02 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Wed, 16 Aug 2017 16:23:02 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
Message-ID: <CAGa7JC0R=pkCdQ5t3MpSk_cDwww+Wf4ujyiWMj=ZKJKrqOJ1kQ@mail.gmail.com>

I do agree, only CJK fonts using in CJK contexts will render them as "W"
(i.e. the fixed-width srandard ideogaphic composition square). If they are
used in Latin, they will adopt the metrics of the Latin font including
them, thery will be square but not necessarily aligned with the ideographic
square but could be aligned to that their internal digits will align on the
same base as normal digits, and the sqaure will include the Latin descender
and ascender height, the width will be adapted to match it (digits may need
to be compacted horizontally to fit the square with those metrics, but will
preserve their baseline alignment. If the normal Latin digits don't have
descenders, the squared variants may not include the full height used by
Latin letters with descenders.

These squared (or circled) characters however do not have registered
variants for digits with descenders (used in traditional typographic fonts
for Latin), such as 4, 7 or 9, or variable-width digits (not using the more
modern digits with "figure-space" fixed width), but I think the later would
not require such variant given that it's more the width of the enclosing
square (or circle) which is important, and digits will be adjusted in width
and interdigit gaps, as needed to fit.

2017-08-16 16:04 GMT+02:00 Asmus Freytag via Unicode <unicode at unicode.org>:

> On 8/16/2017 6:26 AM, Mike FABIAN via Unicode wrote:
>
> EastAsianWidth.txt contains:
>
> 3248..324F;A     # No     [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE
>
> i.e. it classifies the width of the characters at codepoints
> between 3248 and 324F as ambiguous.
>
> Is this really correct? Shouldn?t they be ?W?, i.e. wide?
>
> In most fonts these characters seem to be square shaped wide characters.
>
>
> "W" not only implies display width, but also a different treatment in the
> context of line breaking and vertical layout of text.
>
> "W" characters behave more like Ideographs, for the most part, while "N"
> are treated as forming words (for the most part).
>
> "A" means, you get to decide whether to treat these as "W" or "N" based on
> context. If used in a non ideographic context, they behave like all other
> symbols (but happen to fill an EM square).
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170816/41b28a65/attachment.html>

From unicode at unicode.org  Wed Aug 16 14:04:51 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 16 Aug 2017 12:04:51 -0700
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <CAGa7JC0R=pkCdQ5t3MpSk_cDwww+Wf4ujyiWMj=ZKJKrqOJ1kQ@mail.gmail.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <CAGa7JC0R=pkCdQ5t3MpSk_cDwww+Wf4ujyiWMj=ZKJKrqOJ1kQ@mail.gmail.com>
Message-ID: <59267397-05ca-fd1b-3e28-9159d571ca95@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170816/1a586eae/attachment.html>

From unicode at unicode.org  Thu Aug 17 05:52:22 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Thu, 17 Aug 2017 16:22:22 +0530
Subject: Version linking?
Message-ID: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>

A propos http://blog.unicode.org/2017/08/unicode-emoji-60-initial-drafts-draft.html
I would like to know whether it is intended that Emoji version N will
be always targeted at Unicode version N + 5 and published in year N +
2012.

I did not find the question or answer at
http://unicode.org/faq/emoji_dingbats.html ? hence asking here. I hope
I didn't miss something.

-- 
Shriramana Sharma ???????????? ????????????


From unicode at unicode.org  Thu Aug 17 06:28:40 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 17 Aug 2017 13:28:40 +0200
Subject: Version linking?
In-Reply-To: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
Message-ID: <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>

?Emoji versions are (currently) on a somewhat faster schedule than Unicode
:
U10.0 ?
? ?
E5.0, ?E6.0?
? (TBD)?
U09.0 ? E3.0
?,
E4.0????
Intermediate versions can't add any new characters, but can add sequences
and properties, including "emojification" of existing characters.

?
{phone}

On Aug 17, 2017 03:57, "Shriramana Sharma via Unicode" <unicode at unicode.org>
wrote:

A propos http://blog.unicode.org/2017/08/unicode-emoji-60-initial-dra
fts-draft.html
I would like to know whether it is intended that Emoji version N will
be always targeted at Unicode version N + 5 and published in year N +
2012.

I did not find the question or answer at
http://unicode.org/faq/emoji_dingbats.html ? hence asking here. I hope
I didn't miss something.

--
Shriramana Sharma ???????????? ????????????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170817/5b5d5c99/attachment.html>

From unicode at unicode.org  Thu Aug 17 08:04:56 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Thu, 17 Aug 2017 18:34:56 +0530
Subject: Version linking?
In-Reply-To: <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
Message-ID: <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>

Thanks for your reply, but how can characters be used portably if they
are not part of the published standard yet? Or is it that hereafter
both Unicode Standard + Unicode Emoji Standard will be parallelly
portable or something like that?

-- 
Shriramana Sharma ???????????? ????????????


From unicode at unicode.org  Thu Aug 17 08:45:38 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 17 Aug 2017 15:45:38 +0200
Subject: Version linking?
In-Reply-To: <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
Message-ID: <CAJ2xs_EBGSbWJgGeMkbFoaz+N+AqJpoVCC-z_ca2UAxsbx9+6g@mail.gmail.com>

>Intermediate versions can't add any new characters, but can add sequences
and properties, including "emojification" of existing characters.
E.g. E4.0 didn't reference any characters from U10.0. It did recognize
*sequences* of existing U9.0 characters.
E5.0 did have the emoji properties of some 10.0 characters a bit ahead of
time, but only after they were completely locked down.
{phone}

Mark

(https://twitter.com/mark_e_davis)

On Thu, Aug 17, 2017 at 3:04 PM, Shriramana Sharma <samjnaa at gmail.com>
wrote:

> Thanks for your reply, but how can characters be used portably if they
> are not part of the published standard yet? Or is it that hereafter
> both Unicode Standard + Unicode Emoji Standard will be parallelly
> portable or something like that?
>
> --
> Shriramana Sharma ???????????? ????????????
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170817/c019e7d9/attachment.html>

From unicode at unicode.org  Thu Aug 17 09:24:08 2017
From: unicode at unicode.org (Mike FABIAN via Unicode)
Date: Thu, 17 Aug 2017 16:24:08 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com> (Asmus
 Freytag via Unicode's message of "Wed, 16 Aug 2017 07:04:11 -0700")
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
Message-ID: <s9dh8x6b7cn.fsf@redhat.com>

Asmus Freytag via Unicode <unicode at unicode.org> ????????:

> On 8/16/2017 6:26 AM, Mike FABIAN via Unicode wrote:
>
>     EastAsianWidth.txt contains:
>     
>     3248..324F;A     # No     [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE
>     
>     i.e. it classifies the width of the characters at codepoints
>     between 3248 and 324F as ambiguous.
>     
>     Is this really correct? Shouldn?t they be ?W?, i.e. wide?
>     
>     In most fonts these characters seem to be square shaped wide characters.
>
> "W" not only implies display width, but also a different treatment in the context of line
> breaking and vertical layout of text.
>
> "W" characters behave more like Ideographs, for the most part, while "N" are treated as
> forming words (for the most part).

Most emoji now have "W", for example:

1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS

That seems correct because emoji behave more like Ideographs.

Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
This seems to me also more like an Ideograph.

> "A" means, you get to decide whether to treat these as "W" or "N" based on context. If
> used in a non ideographic context, they behave like all other symbols (but happen to fill
> an EM square).
>
> A./
>

-- 
Mike FABIAN <mfabian at redhat.com>
?????????????

From unicode at unicode.org  Thu Aug 17 09:47:37 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 17 Aug 2017 16:47:37 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <s9dh8x6b7cn.fsf@redhat.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
Message-ID: <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>

2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org>:

> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
> Most emoji now have "W", for example:
>
> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>
> That seems correct because emoji behave more like Ideographs.
>
> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
> This seems to me also more like an Ideograph.


Not really. They have existed since extremely long without being bound to
ideographs or sinographic requirements on metrics. Notably their baseline
and vertical extension do not follow the sinographic em-square layout
convention (except when they are rendered with CJK fonts, or were encoded
in documents with legacy CJK encodings, also rendered with suitable CJK
fonts being then prefered to Latin fonts which won't use the large
siongraphic metrics).

If they were like emojis, they would actually be larger : I think it is a
case for definining a Emoji-variant for them (where they could also be
colored or have some 3D-like look)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170817/1ed66bb6/attachment.html>

From unicode at unicode.org  Thu Aug 17 11:46:49 2017
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Thu, 17 Aug 2017 09:46:49 -0700
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
Message-ID: <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>

On 8/17/2017 7:47 AM, Philippe Verdy wrote:
>
>
> 2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode 
> <unicode at unicode.org <mailto:unicode at unicode.org>>:
>
>     Asmus Freytag via Unicode <unicode at unicode.org
>     <mailto:unicode at unicode.org>> ????????:
>     Most emoji now have "W", for example:
>
>     1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>
>     That seems correct because emoji behave more like Ideographs.
>
>     Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>     This seems to me also more like an Ideograph.
>
> Not really. They have existed since extremely long without being bound 
> to ideographs or sinographic requirements on metrics. Notably their 
> baseline and vertical extension do not follow the sinographic 
> em-square layout convention (except when they are rendered with CJK 
> fonts, or were encoded in documents with legacy CJK encodings, also 
> rendered with suitable CJK fonts being then prefered to Latin fonts 
> which won't use the large siongraphic metrics).
>
> If they were like emojis, they would actually be larger : I think it 
> is a case for definining a Emoji-variant for them (where they could 
> also be colored or have some 3D-like look)

There's an emoji variant for the standard digits.

A./


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170817/e5b44ce4/attachment.html>

From unicode at unicode.org  Thu Aug 17 11:50:39 2017
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Thu, 17 Aug 2017 09:50:39 -0700
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <s9dh8x6b7cn.fsf@redhat.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
Message-ID: <87c1d1b0-1c6d-1d4c-5683-2f457104fb38@ix.netcom.com>

On 8/17/2017 7:24 AM, Mike FABIAN wrote:
> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
>
>> On 8/16/2017 6:26 AM, Mike FABIAN via Unicode wrote:
>>
>>      EastAsianWidth.txt contains:
>>      
>>      3248..324F;A     # No     [8] CIRCLED NUMBER TEN ON BLACK SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE
>>      
>>      i.e. it classifies the width of the characters at codepoints
>>      between 3248 and 324F as ambiguous.
>>      
>>      Is this really correct? Shouldn?t they be ?W?, i.e. wide?
>>      
>>      In most fonts these characters seem to be square shaped wide characters.
>>
>> "W" not only implies display width, but also a different treatment in the context of line
>> breaking and vertical layout of text.
>>
>> "W" characters behave more like Ideographs, for the most part, while "N" are treated as
>> forming words (for the most part).
> Most emoji now have "W", for example:
>
> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>
> That seems correct because emoji behave more like Ideographs.
>
> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
> This seems to me also more like an Ideograph.
>
>> "A" means, you get to decide whether to treat these as "W" or "N" based on context. If
>> used in a non ideographic context, they behave like all other symbols (but happen to fill
>> an EM square).

"A" means, you get to decide whether to treat these as "W" or "N" based on context.

There's really not strong need to change an "A" towards "W", because "A" doesn't get in your way if you decided that "W" works better for you.

Remember that all the EAW properties ares supposed to be "resolved" down to W or N. For some, like Na that resolution is deterministic, for A it is context/application dependent, but when you finally process your data, only W(ide) or N(arrow) remain after resolution.

A./

>>
>> A./
>>


From unicode at unicode.org  Thu Aug 17 15:37:27 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 17 Aug 2017 21:37:27 +0100
Subject: Version linking?
In-Reply-To: <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
Message-ID: <20170817213727.0263c2d8@JRWUBU2>

On Thu, 17 Aug 2017 18:34:56 +0530
Shriramana Sharma via Unicode <unicode at unicode.org> wrote:

> Thanks for your reply, but how can characters be used portably if they
> are not part of the published standard yet? Or is it that hereafter
> both Unicode Standard + Unicode Emoji Standard will be parallelly
> portable or something like that?

A hypothetical application could correctly claim to correctly render
every sequence (up to some reasonable length limit) of assigned Unicode
characters from a recent version (e.g. 10.0) while completely ignoring
the Unicode Emoji Standard.

That doesn't mean a great deal though, as Unicode appears not to be a
standard for the encoding of text strings, but merely for the
encoding of characters.

Thus, at the level of undisputable text, in Indic scripts there appears
to be no provision for the ordering of multiple left matras that are
to be stored in logical order (i.e. backing order) after the onset
consonants. (Thus, this is not a problem for the Thai script.)
Fortunately, there is no good evidence that the occurrence of multiple
distinct left matras is anything but a typing error, though I can easily
see how it might be used as a lexicographical convention on the fuzzy
edge of plain text.

In a similar vein, in Malayalam, we get repeats of the 2-part vowel
U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html),
but I'm not sure what the legitimate encodings of the example word
???? (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are.

Richard.


From unicode at unicode.org  Thu Aug 17 16:34:39 2017
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Thu, 17 Aug 2017 23:34:39 +0200 (CEST)
Subject: Version linking?
In-Reply-To: <CAJ2xs_EBGSbWJgGeMkbFoaz+N+AqJpoVCC-z_ca2UAxsbx9+6g@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <CAJ2xs_EBGSbWJgGeMkbFoaz+N+AqJpoVCC-z_ca2UAxsbx9+6g@mail.gmail.com>
Message-ID: <2042899746.82259.1503005679162@ox.hosteurope.de>

Mark Davis ??:
> 
> E5.0 did have the emoji properties of some 10.0 characters a bit ahead of
> time, but only after they were completely locked down.

This should be absolutely avoided in the future, because it was effectively the other way around: No changes to the beta characters were possible because they were already in UTS#51.


From unicode at unicode.org  Thu Aug 17 18:50:22 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 18 Aug 2017 01:50:22 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
 <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
Message-ID: <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>

2017-08-17 18:46 GMT+02:00 Asmus Freytag (c) via Unicode <
unicode at unicode.org>:

> On 8/17/2017 7:47 AM, Philippe Verdy wrote:
>
> 2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org>:
>
>> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
>> Most emoji now have "W", for example:
>>
>> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>>
>> That seems correct because emoji behave more like Ideographs.
>>
>> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>> This seems to me also more like an Ideograph.
>
>
> Not really. They have existed since extremely long without being bound to
> ideographs or sinographic requirements on metrics. Notably their baseline
> and vertical extension do not follow the sinographic em-square layout
> convention (except when they are rendered with CJK fonts, or were encoded
> in documents with legacy CJK encodings, also rendered with suitable CJK
> fonts being then prefered to Latin fonts which won't use the large
> siongraphic metrics).
>
> If they were like emojis, they would actually be larger : I think it is a
> case for definining a Emoji-variant for them (where they could also be
> colored or have some 3D-like look)
>
>
> There's an emoji variant for the standard digits.
>

Do you speak about circled numbers ? I don't think so.

I (and Mike as well to which I was replying) was speaking about a good case
for defining emoji variant of these circled (or squared) numbers (Mike
spoke about circled number 10, which is not encoded as an emoji and not
even as an ideograph, and that he proposed to give a wide width property
like ideographs).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170818/ce426225/attachment.html>

From unicode at unicode.org  Fri Aug 18 06:22:48 2017
From: unicode at unicode.org (Mike FABIAN via Unicode)
Date: Fri, 18 Aug 2017 13:22:48 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <87c1d1b0-1c6d-1d4c-5683-2f457104fb38@ix.netcom.com> (Asmus
 Freytag via Unicode's message of "Thu, 17 Aug 2017 09:50:39 -0700")
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <87c1d1b0-1c6d-1d4c-5683-2f457104fb38@ix.netcom.com>
Message-ID: <s9dvallktmf.fsf@redhat.com>

"Asmus Freytag (c) via Unicode" <unicode at unicode.org> ????????:

> On 8/17/2017 7:24 AM, Mike FABIAN wrote:
>> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
>>
>>> On 8/16/2017 6:26 AM, Mike FABIAN via Unicode wrote:
>>>
>>>      EastAsianWidth.txt contains:
>>>           3248..324F;A     # No     [8] CIRCLED NUMBER TEN ON BLACK
>>> SQUARE..CIRCLED NUMBER EIGHTY ON BLACK SQUARE
>>>           i.e. it classifies the width of the characters at
>>> codepoints
>>>      between 3248 and 324F as ambiguous.
>>>           Is this really correct? Shouldn?t they be ?W?, i.e. wide?
>>>           In most fonts these characters seem to be square shaped
>>> wide characters.
>>>
>>> "W" not only implies display width, but also a different treatment in the context of line
>>> breaking and vertical layout of text.
>>>
>>> "W" characters behave more like Ideographs, for the most part, while "N" are treated as
>>> forming words (for the most part).
>> Most emoji now have "W", for example:
>>
>> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>>
>> That seems correct because emoji behave more like Ideographs.
>>
>> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>> This seems to me also more like an Ideograph.
>>
>>> "A" means, you get to decide whether to treat these as "W" or "N" based on context. If
>>> used in a non ideographic context, they behave like all other symbols (but happen to fill
>>> an EM square).
>
> "A" means, you get to decide whether to treat these as "W" or "N" based on context.
>
> There's really not strong need to change an "A" towards "W", because
> "A" doesn't get in your way if you decided that "W" works better for
> you.
>
> Remember that all the EAW properties ares supposed to be "resolved"
> down to W or N. For some, like Na that resolution is deterministic,
> for A it is context/application dependent, but when you finally
> process your data, only W(ide) or N(arrow) remain after resolution.

OK, that means that is OK to decide that in the context of glibc
resolving these to W(ide) is best, right?

-- 
Mike FABIAN <mfabian at redhat.com>


From unicode at unicode.org  Fri Aug 18 07:21:07 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Fri, 18 Aug 2017 12:21:07 +0000
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
 <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
 <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>
Message-ID: <ABFAFA9C-D4B5-48F9-8D28-2A49F7B96BA7@lboro.ac.uk>


On 18 Aug 2017, at 00:50, Philippe Verdy via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:


2017-08-17 18:46 GMT+02:00 Asmus Freytag (c) via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>:
On 8/17/2017 7:47 AM, Philippe Verdy wrote:
2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>>:
Asmus Freytag via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> ????????:
Most emoji now have "W", for example:

1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS

That seems correct because emoji behave more like Ideographs.

Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
This seems to me also more like an Ideograph.

Not really. They have existed since extremely long without being bound to ideographs or sinographic requirements on metrics. Notably their baseline and vertical extension do not follow the sinographic em-square layout convention (except when they are rendered with CJK fonts, or were encoded in documents with legacy CJK encodings, also rendered with suitable CJK fonts being then prefered to Latin fonts which won't use the large siongraphic metrics).

If they were like emojis, they would actually be larger : I think it is a case for definining a Emoji-variant for them (where they could also be colored or have some 3D-like look)

There's an emoji variant for the standard digits.

Do you speak about circled numbers ? I don't think so.

I (and Mike as well to which I was replying) was speaking about a good case for defining emoji variant of these circled (or squared) numbers (Mike spoke about circled number 10, which is not encoded as an emoji and not even as an ideograph, and that he proposed to give a wide width property like ideographs).


Are not CJK ideographs both (W)ide and (S)quare? Does (W)ide imply or define that the ideograph should also be (S)quare?

It seems to me that there are many characters that are both (W)ide and (S)quare eg emoji

Andr? Schappo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170818/231b7008/attachment.html>

From unicode at unicode.org  Fri Aug 18 15:48:26 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 18 Aug 2017 22:48:26 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <ABFAFA9C-D4B5-48F9-8D28-2A49F7B96BA7@lboro.ac.uk>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
 <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
 <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>
 <ABFAFA9C-D4B5-48F9-8D28-2A49F7B96BA7@lboro.ac.uk>
Message-ID: <CAGa7JC3DuDjAWXvcnAs+s+3hRipEet=qyet6H1k=uE3vxz=MHw@mail.gmail.com>

I don't think that emojis are necessarily "square", they could be larger
(e.g. a train or a snake or an horizontal railway, or a group of several
peoples, or a cloud) or narrower (e.g. a candle).

Rendering them as square will make sense only in contexts where this makes
sense ** if possible** : monospaced fonts. But there are cases where a
single character cell would not be enough and multiple cells would be
needed (notably in text terminals, but as well in sinographic contexts
uwing multiple em-squares in a row).

The classification of widths in CJK if there to help determine how many
cells will be needed in two cases: narrow rectangular cells used in text
terminals, or square cells in classic sinographic typesetting (which is
still not mandatory because variable-width rendering is also possible, even
if it is less common, using more specific fonts for such artistic use or to
correctly render handwritten calligraphy). This classification of widths
makes no sense in Latin where it variable-width is still prefered and more
common.

So there will be both variants for variable-width and "monospaced"
(cell-based) rendering of emojis, like they both exist for CJK and Latin:
Latin letters has a "narrow" width in sinographic square contexts only to
allow two letters side-by-side per square instead of centering them with
wide gaps or rendering them in widdened variants. Most Asian emojis from
CJK charsets will render in a single square cell, but others may still need
two square cells for better rendering (without having to use variable width
that would break the grid layout).

When rendering Latin words in CJK contexts, the alignment to the grid may
also be made only on spans of Latin letters (one or more words), by
recentering it in a row of as many cells that could fit: it would be even
more useful for Arabic sequences. This technic however would not fit very
well in classic "text terminals" where half-width Latin, Hebrew and Arabic
will still be preferable (or full-width for some Arabic letters with some
extenders, or some long Arabic ligatures).


2017-08-18 14:21 GMT+02:00 Andre Schappo <A.Schappo at lboro.ac.uk>:

>
> On 18 Aug 2017, at 00:50, Philippe Verdy via Unicode <unicode at unicode.org>
> wrote:
>
>
> 2017-08-17 18:46 GMT+02:00 Asmus Freytag (c) via Unicode <
> unicode at unicode.org>:
>
>> On 8/17/2017 7:47 AM, Philippe Verdy wrote:
>>
>> 2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org>:
>>
>>> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
>>> Most emoji now have "W", for example:
>>>
>>> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>>>
>>> That seems correct because emoji behave more like Ideographs.
>>>
>>> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>>> This seems to me also more like an Ideograph.
>>
>>
>> Not really. They have existed since extremely long without being bound to
>> ideographs or sinographic requirements on metrics. Notably their baseline
>> and vertical extension do not follow the sinographic em-square layout
>> convention (except when they are rendered with CJK fonts, or were encoded
>> in documents with legacy CJK encodings, also rendered with suitable CJK
>> fonts being then prefered to Latin fonts which won't use the large
>> siongraphic metrics).
>>
>> If they were like emojis, they would actually be larger : I think it is a
>> case for definining a Emoji-variant for them (where they could also be
>> colored or have some 3D-like look)
>>
>>
>> There's an emoji variant for the standard digits.
>>
>
> Do you speak about circled numbers ? I don't think so.
>
> I (and Mike as well to which I was replying) was speaking about a good
> case for defining emoji variant of these circled (or squared) numbers (Mike
> spoke about circled number 10, which is not encoded as an emoji and not
> even as an ideograph, and that he proposed to give a wide width property
> like ideographs).
>
>
>
> Are not CJK ideographs both (W)ide and (S)quare? Does (W)ide imply or
> define that the ideograph should also be (S)quare?
>
> It seems to me that there are many characters that are both (W)ide and
> (S)quare eg emoji
>
> Andr? Schappo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170818/0218fcd6/attachment.html>

From unicode at unicode.org  Fri Aug 18 16:29:17 2017
From: unicode at unicode.org (Peter Edberg via Unicode)
Date: Fri, 18 Aug 2017 14:29:17 -0700
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <CAGa7JC3DuDjAWXvcnAs+s+3hRipEet=qyet6H1k=uE3vxz=MHw@mail.gmail.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
 <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
 <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>
 <ABFAFA9C-D4B5-48F9-8D28-2A49F7B96BA7@lboro.ac.uk>
 <CAGa7JC3DuDjAWXvcnAs+s+3hRipEet=qyet6H1k=uE3vxz=MHw@mail.gmail.com>
Message-ID: <9F6FE3E0-34AE-4FF6-A3FA-63BCD9B327B5@apple.com>

Per UTS #51 (see http://www.unicode.org/reports/tr51/#Design_Guidelines <http://www.unicode.org/reports/tr51/#Design_Guidelines>):

"Current practice is for emoji to have a square aspect ratio, deriving from their origin in Japanese. For interoperability, it is recommended that this practice be continued with current and future emoji. They will typically have about the same vertical placement and advance width as CJK ideographs.'

- Peter E

> On Aug 18, 2017, at 1:48 PM, Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> 
> I don't think that emojis are necessarily "square", they could be larger (e.g. a train or a snake or an horizontal railway, or a group of several peoples, or a cloud) or narrower (e.g. a candle).
> 
> Rendering them as square will make sense only in contexts where this makes sense ** if possible** : monospaced fonts. But there are cases where a single character cell would not be enough and multiple cells would be needed (notably in text terminals, but as well in sinographic contexts uwing multiple em-squares in a row).
> 
> The classification of widths in CJK if there to help determine how many cells will be needed in two cases: narrow rectangular cells used in text terminals, or square cells in classic sinographic typesetting (which is still not mandatory because variable-width rendering is also possible, even if it is less common, using more specific fonts for such artistic use or to correctly render handwritten calligraphy). This classification of widths makes no sense in Latin where it variable-width is still prefered and more common.
> 
> So there will be both variants for variable-width and "monospaced" (cell-based) rendering of emojis, like they both exist for CJK and Latin: Latin letters has a "narrow" width in sinographic square contexts only to allow two letters side-by-side per square instead of centering them with wide gaps or rendering them in widdened variants. Most Asian emojis from CJK charsets will render in a single square cell, but others may still need two square cells for better rendering (without having to use variable width that would break the grid layout).
> 
> When rendering Latin words in CJK contexts, the alignment to the grid may also be made only on spans of Latin letters (one or more words), by recentering it in a row of as many cells that could fit: it would be even more useful for Arabic sequences. This technic however would not fit very well in classic "text terminals" where half-width Latin, Hebrew and Arabic will still be preferable (or full-width for some Arabic letters with some extenders, or some long Arabic ligatures).
> 
> 
> 
> 2017-08-18 14:21 GMT+02:00 Andre Schappo <A.Schappo at lboro.ac.uk <mailto:A.Schappo at lboro.ac.uk>>:
> 
>> On 18 Aug 2017, at 00:50, Philippe Verdy via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>> 
>> 
>> 2017-08-17 18:46 GMT+02:00 Asmus Freytag (c) via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>>:
>> On 8/17/2017 7:47 AM, Philippe Verdy wrote:
>>> 2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>>:
>>> Asmus Freytag via Unicode <unicode at unicode.org <mailto:unicode at unicode.org>> ????????:
>>> Most emoji now have "W", for example:
>>> 
>>> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>>> 
>>> That seems correct because emoji behave more like Ideographs.
>>> 
>>> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>>> This seems to me also more like an Ideograph.
>>>  
>>> Not really. They have existed since extremely long without being bound to ideographs or sinographic requirements on metrics. Notably their baseline and vertical extension do not follow the sinographic em-square layout convention (except when they are rendered with CJK fonts, or were encoded in documents with legacy CJK encodings, also rendered with suitable CJK fonts being then prefered to Latin fonts which won't use the large siongraphic metrics).
>>> 
>>> If they were like emojis, they would actually be larger : I think it is a case for definining a Emoji-variant for them (where they could also be colored or have some 3D-like look)
>> 
>> There's an emoji variant for the standard digits.
>> 
>> Do you speak about circled numbers ? I don't think so.
>> 
>> I (and Mike as well to which I was replying) was speaking about a good case for defining emoji variant of these circled (or squared) numbers (Mike spoke about circled number 10, which is not encoded as an emoji and not even as an ideograph, and that he proposed to give a wide width property like ideographs).
>> 
>> 
> 
> Are not CJK ideographs both (W)ide and (S)quare? Does (W)ide imply or define that the ideograph should also be (S)quare?
> 
> It seems to me that there are many characters that are both (W)ide and (S)quare eg emoji
> 
> Andr? Schappo
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170818/7a980c93/attachment.html>

From unicode at unicode.org  Sat Aug 19 08:34:32 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 19 Aug 2017 15:34:32 +0200
Subject: Should U+3248 ... U+324F be wide characters?
In-Reply-To: <9F6FE3E0-34AE-4FF6-A3FA-63BCD9B327B5@apple.com>
References: <s9do9rfskxy.fsf@redhat.com>
 <2fcd217d-5f10-5bc3-304d-009efd7e67a9@ix.netcom.com>
 <s9dh8x6b7cn.fsf@redhat.com>
 <CAGa7JC3Zi+d=3LmcT3mzDJcK+jPNbZsWn2wUyok2LnUPjbLFUg@mail.gmail.com>
 <cc34c9ba-03d3-3548-d86a-a0bcdaf6bd9d@ix.netcom.com>
 <CAGa7JC18qnHQEFN==BgSu6xEz3LzDrYE7qKyFHrAKLWBpUoPvA@mail.gmail.com>
 <ABFAFA9C-D4B5-48F9-8D28-2A49F7B96BA7@lboro.ac.uk>
 <CAGa7JC3DuDjAWXvcnAs+s+3hRipEet=qyet6H1k=uE3vxz=MHw@mail.gmail.com>
 <9F6FE3E0-34AE-4FF6-A3FA-63BCD9B327B5@apple.com>
Message-ID: <CAGa7JC3_VTo+vxx-QL-WjsY5e9rr_3XA4+VWWngk3BW8ikG9WQ@mail.gmail.com>

But today's new emojis do not come all from Japanese. Many new emojis zre
not even tied to Asian cultures. I don't see why a candle, or train, or F1
racing car would necessarily be square, they would look just ugly (with
excessive horizontal margins, or smaller than needed with excessive
vertical margins).
Same thing about new emopjis showing groups of people (e.g. a family with
two adults and two children).
For emojis showing a single person face (emoticons) there's however no
problem to fit them in a square, same things about the various symbols in
squares or circles used in Japanese.

2017-08-18 23:29 GMT+02:00 Peter Edberg <pedberg at apple.com>:

> Per UTS #51 (see http://www.unicode.org/reports/tr51/#Design_Guidelines):
>
> "Current practice is for emoji to have a square aspect ratio, deriving
> from their origin in Japanese. For interoperability, it is recommended that
> this practice be continued with current and future emoji. They will
> typically have about the same vertical placement and advance width as CJK
> ideographs.'
>
> - Peter E
>
> On Aug 18, 2017, at 1:48 PM, Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
> I don't think that emojis are necessarily "square", they could be larger
> (e.g. a train or a snake or an horizontal railway, or a group of several
> peoples, or a cloud) or narrower (e.g. a candle).
>
> Rendering them as square will make sense only in contexts where this makes
> sense ** if possible** : monospaced fonts. But there are cases where a
> single character cell would not be enough and multiple cells would be
> needed (notably in text terminals, but as well in sinographic contexts
> uwing multiple em-squares in a row).
>
> The classification of widths in CJK if there to help determine how many
> cells will be needed in two cases: narrow rectangular cells used in text
> terminals, or square cells in classic sinographic typesetting (which is
> still not mandatory because variable-width rendering is also possible, even
> if it is less common, using more specific fonts for such artistic use or to
> correctly render handwritten calligraphy). This classification of widths
> makes no sense in Latin where it variable-width is still prefered and more
> common.
>
> So there will be both variants for variable-width and "monospaced"
> (cell-based) rendering of emojis, like they both exist for CJK and Latin:
> Latin letters has a "narrow" width in sinographic square contexts only to
> allow two letters side-by-side per square instead of centering them with
> wide gaps or rendering them in widdened variants. Most Asian emojis from
> CJK charsets will render in a single square cell, but others may still need
> two square cells for better rendering (without having to use variable width
> that would break the grid layout).
>
> When rendering Latin words in CJK contexts, the alignment to the grid may
> also be made only on spans of Latin letters (one or more words), by
> recentering it in a row of as many cells that could fit: it would be even
> more useful for Arabic sequences. This technic however would not fit very
> well in classic "text terminals" where half-width Latin, Hebrew and Arabic
> will still be preferable (or full-width for some Arabic letters with some
> extenders, or some long Arabic ligatures).
>
>
>
> 2017-08-18 14:21 GMT+02:00 Andre Schappo <A.Schappo at lboro.ac.uk>:
>
>>
>> On 18 Aug 2017, at 00:50, Philippe Verdy via Unicode <unicode at unicode.org>
>> wrote:
>>
>>
>> 2017-08-17 18:46 GMT+02:00 Asmus Freytag (c) via Unicode <
>> unicode at unicode.org>:
>>
>>> On 8/17/2017 7:47 AM, Philippe Verdy wrote:
>>>
>>> 2017-08-17 16:24 GMT+02:00 Mike FABIAN via Unicode <unicode at unicode.org>
>>> :
>>>
>>>> Asmus Freytag via Unicode <unicode at unicode.org> ????????:
>>>> Most emoji now have "W", for example:
>>>>
>>>> 1F600..1F64F;W   # So    [80] GRINNING FACE..PERSON WITH FOLDED HANDS
>>>>
>>>> That seems correct because emoji behave more like Ideographs.
>>>>
>>>> Isn?t this the same for ?CIRCLED NUMBER TEN ON BLACK SQUARE??
>>>> This seems to me also more like an Ideograph.
>>>
>>>
>>> Not really. They have existed since extremely long without being bound
>>> to ideographs or sinographic requirements on metrics. Notably their
>>> baseline and vertical extension do not follow the sinographic em-square
>>> layout convention (except when they are rendered with CJK fonts, or were
>>> encoded in documents with legacy CJK encodings, also rendered with suitable
>>> CJK fonts being then prefered to Latin fonts which won't use the large
>>> siongraphic metrics).
>>>
>>> If they were like emojis, they would actually be larger : I think it is
>>> a case for definining a Emoji-variant for them (where they could also be
>>> colored or have some 3D-like look)
>>>
>>>
>>> There's an emoji variant for the standard digits.
>>>
>>
>> Do you speak about circled numbers ? I don't think so.
>>
>> I (and Mike as well to which I was replying) was speaking about a good
>> case for defining emoji variant of these circled (or squared) numbers (Mike
>> spoke about circled number 10, which is not encoded as an emoji and not
>> even as an ideograph, and that he proposed to give a wide width property
>> like ideographs).
>>
>>
>>
>> Are not CJK ideographs both (W)ide and (S)quare? Does (W)ide imply or
>> define that the ideograph should also be (S)quare?
>>
>> It seems to me that there are many characters that are both (W)ide and
>> (S)quare eg emoji
>>
>> Andr? Schappo
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170819/c3f9243d/attachment.html>

From unicode at unicode.org  Mon Aug 21 11:29:31 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Mon, 21 Aug 2017 16:29:31 +0000
Subject: Unicode 10 Cover Art
Message-ID: <34CC0C1F-B9A8-4D34-8348-75C432BCF8F9@lboro.ac.uk>

Unicode 10.0 Cover Design Art

Were there entries and, if yes, which won?

Andr? Schappo


From unicode at unicode.org  Tue Aug 22 12:08:25 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Tue, 22 Aug 2017 17:08:25 +0000
Subject: Unicode 10 Cover Art
In-Reply-To: <34CC0C1F-B9A8-4D34-8348-75C432BCF8F9@lboro.ac.uk>
References: <34CC0C1F-B9A8-4D34-8348-75C432BCF8F9@lboro.ac.uk>
Message-ID: <DM2PR21MB002822ADF5A786F7EEA363EED5840@DM2PR21MB0028.namprd21.prod.outlook.com>

http://blog.unicode.org/2017/08/unicode-consortium-announces-cover.html


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andre Schappo via Unicode
Sent: Monday, August 21, 2017 9:30 AM
To: Unicode <unicode at unicode.org>
Subject: Unicode 10 Cover Art

Unicode 10.0 Cover Design Art

Were there entries and, if yes, which won?

Andr? Schappo


From unicode at unicode.org  Wed Aug 23 07:48:39 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Wed, 23 Aug 2017 18:18:39 +0530
Subject: Ah the power of emoji! To encompass even science and mythology!
Message-ID: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>

?????? <-- lunar eclipse

?????? <-- solar eclipse

?????? <-- apocalypse

https://twitter.com/AstroKatie/status/518697246305439745

?

Not new (2014) but I hadn't seen this till today and felt it a propos re
the recent pair of eclipses.

Shriramana Sharma.


From unicode at unicode.org  Wed Aug 23 17:35:18 2017
From: unicode at unicode.org (David Faulks via Unicode)
Date: Wed, 23 Aug 2017 22:35:18 +0000 (UTC)
Subject: Assamese and Unicode.
References: <559555838.976649.1503527718806.ref@mail.yahoo.com>
Message-ID: <559555838.976649.1503527718806@mail.yahoo.com>

It appears that the Indian government will submit an 'Assamese' proposal.

http://silchar.com/unicode-standard-for-assamese-in-the-offing/

Since everything I know about Assamese Script indicates that it is basically the same as Bengali and the Unicode Assamese controversy is derived entirely from a sub-nationalistic fit over character and script names, I expect that this proposal will not be accepted.

However, 'popular nationalism' will probably be used to attack Unicode then.

David Faulks


From unicode at unicode.org  Wed Aug 23 17:53:13 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 24 Aug 2017 00:53:13 +0200
Subject: Ah the power of emoji! To encompass even science and mythology!
In-Reply-To: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>
References: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>
Message-ID: <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>

Why this distinction with the left oright side on which you'll place the
"half moon" (which "half moon" when eclipses actually occur either on full
moons or new moons?) and the Sun ???

Note that Solar eclipses occur normally during the day at places where they
are observable, but not necessarily at zeith (midday) and the
representation of the Earth relative to the Sun, viewed from the side is in
fact wrong: we should have a "half Earth" and a blackspot on the light hald
of the moon or the light half of the Earth.

As well, looking an eclipse from the side (from space) you will not see
easily the eclipse occuring along the border of the light half.

Given the meaning associated to the half moons used below which imply
specific periods at which eclipses can absolutely never occur, I would not
use these compositions at all.

Showing cones of shadows would be more explicit, but I would only represent
two discs partly covering themselves:

- For the solar eclipse, a dark "New Moon" (light gray) passing in front of
the bright solar disc and hiding it partly. The Earth would not be
repredented directly. A coronal eclipse is also representable with some
solar rays emerging from one border.

    The common perception of eclipses are the solar ones as they occur only
during the day. They will not necessarily notice it if the solar eclipse
occurs in the early morning for them and the Sun is still low on the
horizon and sur to the diffusion of the solar light, by atmosphere, the
eclipse willl be difficult to see (the shadow will be much less black, the
effect will be similar to the presence of clouds on the horizon, they will
feel the sun is just a bit late to wake up or is falling down a bit sooner
and they won't feel the shift of temperatures).

- For the lunar eclipse, a light blue Earth passing in front of the full
Moon disc (dark grey) on which the Earth starts projecting a disc of
shadow. The Sun would not be represented directly

    Note that many people ignore the fact of lunar eclipses when they occur
(this occurs by night anyway, most people are sleeping. Many people don't
know if we are in times of new moons or full moons, the night may be
blacker than usual only because of clouds).

Given the composition I hardly see how you would compose it with existing
emojis for the (light blue) Earth emoji and the the (spotted) full moon, or
the (dark) new moon and the "spotted" Sun, except by encoding in the middle
a specific emoji representing both the act of eclipsing and the shadow
cone, or using joiners with a combining shadow.

- <EARTH + COMBINING SHADOW + ZWJ + FULL MOON> = lunar eclipse
- <NEW MOON + COMBINING SHADOW + ZWJ + SUN> = solar eclipse


2017-08-23 14:48 GMT+02:00 Shriramana Sharma via Unicode <
unicode at unicode.org>:

> ?????? <-- lunar eclipse
>
> ?????? <-- solar eclipse
>
> ?????? <-- apocalypse
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/05cbb687/attachment.html>

From unicode at unicode.org  Wed Aug 23 18:08:24 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Wed, 23 Aug 2017 16:08:24 -0700
Subject: Ah the power of emoji! To encompass even science and mythology!
In-Reply-To: <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>
References: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>
 <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>
Message-ID: <6b23b439-3181-cf13-3cdf-1e3093cab90c@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170823/5e06e2fe/attachment.html>

From unicode at unicode.org  Wed Aug 23 18:09:11 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 24 Aug 2017 01:09:11 +0200
Subject: Ah the power of emoji! To encompass even science and mythology!
In-Reply-To: <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>
References: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>
 <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>
Message-ID: <CAGa7JC1Z4djrjki5mt-FgU7Z+=aPg5HqeWnVKdmVE-gZWczR=A@mail.gmail.com>

other interesting combinations:

- <SUN + ZWJ + UMBRELLA + COMBINING SHADOW> = parasol
- <RAINING CLOUD + ZWJ + UMBRELLA> = parapluie
- <SUN + ZWJ + WEARING GLASSES + COMBINING SHADOW> = sun glasses
- <THUNDERBOLT+ZWJ+ANTENNA+COMBINING SHADOW> = parafoudre

Note that a "combining" shadow is not absolutely necessary, but I don't how
a shadow can exist with the object creating it and giving its form to the
shadow.

The direction of the shadow does not matter (I suggest is is oriented by
default to the bottom right for LTR contexts including emojis, and to the
bottom left for RTL contexts, using Bidi controls could override this
direction). And it could easily apply to any character cluster
(transforming it into a symbol, or encoding the shadow as ignorable with
combining class 0, i.e. requiring its encoding at end of the cluster from
which it derives the shadow form to render). It would not necessarily
require new fonts (shadowing can be synthetized by text renderers from
existing glyphs)

- LATIN LETTER A + COMBINING SHADOW = shadowed letter A (the shadow to the
bottom right is a mirrored and skewed version of the letter, projected from
its natural baseline)


2017-08-24 0:53 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Why this distinction with the left oright side on which you'll place the
> "half moon" (which "half moon" when eclipses actually occur either on full
> moons or new moons?) and the Sun ???
>
> Note that Solar eclipses occur normally during the day at places where
> they are observable, but not necessarily at zeith (midday) and the
> representation of the Earth relative to the Sun, viewed from the side is in
> fact wrong: we should have a "half Earth" and a blackspot on the light hald
> of the moon or the light half of the Earth.
>
> As well, looking an eclipse from the side (from space) you will not see
> easily the eclipse occuring along the border of the light half.
>
> Given the meaning associated to the half moons used below which imply
> specific periods at which eclipses can absolutely never occur, I would not
> use these compositions at all.
>
> Showing cones of shadows would be more explicit, but I would only
> represent two discs partly covering themselves:
>
> - For the solar eclipse, a dark "New Moon" (light gray) passing in front
> of the bright solar disc and hiding it partly. The Earth would not be
> repredented directly. A coronal eclipse is also representable with some
> solar rays emerging from one border.
>
>     The common perception of eclipses are the solar ones as they occur
> only during the day. They will not necessarily notice it if the solar
> eclipse occurs in the early morning for them and the Sun is still low on
> the horizon and sur to the diffusion of the solar light, by atmosphere, the
> eclipse willl be difficult to see (the shadow will be much less black, the
> effect will be similar to the presence of clouds on the horizon, they will
> feel the sun is just a bit late to wake up or is falling down a bit sooner
> and they won't feel the shift of temperatures).
>
> - For the lunar eclipse, a light blue Earth passing in front of the full
> Moon disc (dark grey) on which the Earth starts projecting a disc of
> shadow. The Sun would not be represented directly
>
>     Note that many people ignore the fact of lunar eclipses when they
> occur (this occurs by night anyway, most people are sleeping. Many people
> don't know if we are in times of new moons or full moons, the night may be
> blacker than usual only because of clouds).
>
> Given the composition I hardly see how you would compose it with existing
> emojis for the (light blue) Earth emoji and the the (spotted) full moon, or
> the (dark) new moon and the "spotted" Sun, except by encoding in the middle
> a specific emoji representing both the act of eclipsing and the shadow
> cone, or using joiners with a combining shadow.
>
> - <EARTH + COMBINING SHADOW + ZWJ + FULL MOON> = lunar eclipse
> - <NEW MOON + COMBINING SHADOW + ZWJ + SUN> = solar eclipse
>
>
>
> 2017-08-23 14:48 GMT+02:00 Shriramana Sharma via Unicode <
> unicode at unicode.org>:
>
>> ?????? <-- lunar eclipse
>>
>> ?????? <-- solar eclipse
>>
>> ?????? <-- apocalypse
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/2be2db05/attachment.html>

From unicode at unicode.org  Wed Aug 23 18:30:41 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 24 Aug 2017 01:30:41 +0200
Subject: Assamese and Unicode.
In-Reply-To: <559555838.976649.1503527718806@mail.yahoo.com>
References: <559555838.976649.1503527718806.ref@mail.yahoo.com>
 <559555838.976649.1503527718806@mail.yahoo.com>
Message-ID: <CAGa7JC2Nq7=fjvpRH-=UZjXzNog37fWi9a3vWccTK69dVLUD0A@mail.gmail.com>

It could appear as a supplementary chart for the ISCII standard, but when
converting to Unicode, it should have no impact except possibly encoding
some of their letters in the new chart as pairs of Unicode characters even
if one of them would not be necessary in all contexts (it could be a
variant code or a new combining character, or using a Unicode named
sequence). This would define a transcription rule to convert the
orthographic spelling to codes.

The new ISCII chart could display the Assamese letter names, without impact
on the unified Unicode letters even if Unicode letters have Bengali names.
Those names could be annoted or updated into the Unicode charts as these
annotations are not normative.

The new ISCII charat could even adopt some slight modification in its own
binary sort order. This will not impact the Unicode binary order, and
Assamese will continue to use a collation tailoring by language (in CLDR).

But most probably the real pressure will be to adapt the keyboard where
Bengali keyboard layouts don't match the most frequent use. I don't think
ISCII will need a new version revival when India is already using Uncode.

We have a single unified Latin script used for many languages that are not
even Latin-based such as Turkish and still no "Turkish script" even if
there's a know "Turkish aphabet" (sometimes known as "Altaic" alphabet with
its typical distinction of hard-dotted and undotted letters I and J)


2017-08-24 0:35 GMT+02:00 David Faulks via Unicode <unicode at unicode.org>:

> It appears that the Indian government will submit an 'Assamese' proposal.
>
> http://silchar.com/unicode-standard-for-assamese-in-the-offing/
>
> Since everything I know about Assamese Script indicates that it is
> basically the same as Bengali and the Unicode Assamese controversy is
> derived entirely from a sub-nationalistic fit over character and script
> names, I expect that this proposal will not be accepted.
>
> However, 'popular nationalism' will probably be used to attack Unicode
> then.
>
> David Faulks
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/d087ff57/attachment.html>

From unicode at unicode.org  Wed Aug 23 18:40:21 2017
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 23 Aug 2017 15:40:21 -0800
Subject: Assamese and Unicode.
In-Reply-To: <559555838.976649.1503527718806@mail.yahoo.com>
References: <559555838.976649.1503527718806.ref@mail.yahoo.com>
 <559555838.976649.1503527718806@mail.yahoo.com>
Message-ID: <CABPY6Z0DRRPQNj0rANHTaDjRTEq1umKNHaN99VR9q4XrTAyT9g@mail.gmail.com>

The Eastern Nagari script is used to write Bengali and Assamese, as
well as a few other languages.  To the best of my knowledge, the
existing Unicode encoding includes coverage for the minor typographic
differences between Bengali and Assamese text.

Any proposal for separate Assamese code points should be judged on its
merits, and it's a "non-starter".

'Popular nationalism' might be best served by returning to separate
code pages.  For me, that's also a "non-starter".

Best regards,

James Kass

From unicode at unicode.org  Wed Aug 23 20:18:17 2017
From: unicode at unicode.org (Michael Everson via Unicode)
Date: Thu, 24 Aug 2017 02:18:17 +0100
Subject: Assamese and Unicode.
In-Reply-To: <559555838.976649.1503527718806@mail.yahoo.com>
References: <559555838.976649.1503527718806.ref@mail.yahoo.com>
 <559555838.976649.1503527718806@mail.yahoo.com>
Message-ID: <14DE0BFF-9EAA-4D96-8620-8D46CE6D9130@evertype.com>

?The uniqueness of the Assamese script was perhaps unknown to the mainly American experts of Unicode, sources said.?

They have never been able to show the difference to anyone in SC2 (which has more than Americans in it), because there is no difference to show. 

Michael Everson

> On 23 Aug 2017, at 23:35, David Faulks via Unicode <unicode at unicode.org> wrote:
> 
> It appears that the Indian government will submit an 'Assamese' proposal.
> 
> http://silchar.com/unicode-standard-for-assamese-in-the-offing/
> 
> Since everything I know about Assamese Script indicates that it is basically the same as Bengali and the Unicode Assamese controversy is derived entirely from a sub-nationalistic fit over character and script names, I expect that this proposal will not be accepted.
> 
> However, 'popular nationalism' will probably be used to attack Unicode then.
> 
> David Faulks
> 


From unicode at unicode.org  Thu Aug 24 02:19:50 2017
From: unicode at unicode.org (Dreiheller, Albrecht via Unicode)
Date: Thu, 24 Aug 2017 07:19:50 +0000
Subject: Rendering variants of U+3127 Bopomofo Letter I
Message-ID: <3E10480FE4510343914E4312AB46E74212CD95EE@DEFTHW99EH5MSX.ww902.siemens.net>

Hello Chinese experts,

The Letter I in the Bopomofo alphabet (U+3127)  has a two rendering variants, a vertical bar and a horizontal bar.

Can anyone please tell me the context criteria, when should which variant be used?
Is it VR China using the vertical form (like in font SimSun)  and Taiwan using the horizontal form (like in font  PMingLiU) ?

Thanks
Albrecht
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/b9fb58f2/attachment.html>

From unicode at unicode.org  Thu Aug 24 03:45:04 2017
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Thu, 24 Aug 2017 09:45:04 +0100 (BST)
Subject: Ah the power of emoji! To encompass even science and mythology!
In-Reply-To: <6b23b439-3181-cf13-3cdf-1e3093cab90c@ix.netcom.com>
References: <CAH-HCWXiiyvwQUPT-dy3vNn92bhm9X-iKGgSqnNFmVhDcLhBMQ@mail.gmail.com>
 <CAGa7JC2L1XdsGjKoBdHhP-GGW-+Gw6C7GfxNyZVrk-JRzMwhfw@mail.gmail.com>
 <6b23b439-3181-cf13-3cdf-1e3093cab90c@ix.netcom.com>
Message-ID: <3200640.8427.1503564304808.JavaMail.defaultUser@defaultHost>

Asmus Freytag wrote:  
    
> Philippe,
      
> thank you for your earnest efforts at explaining away a joke.
      
> I'm sure I'm speaking for the assembled congregation in applauding you for your tireless energy in setting the record straight.
      
Well, as Asmus is purporting to speak for other people, I write to state my own opinion.

I enjoyed Philippe's two posts thus far on this topic.

They caused me to think.

Philippe's idea for a COMBINING SHADOW character is a good one. I like it. How about U+20F1 as the code point.

There is an interesting new document in the Unicode Technical Committee Document Register.

http://www.unicode.org/L2/L2017/17304-moon-var.pdf

That is about the moon symbols.

There is an interesting section entitled "Leaning (lit part of the) Moon".

I am thinking that if one had after the moon symbol and before any variation selector a two character sequence of a U+200D ZERO WIDTH JOINER character followed by a character from the set

{U+24EA CIRCLED DIGIT ZERO, U+2460 CiRCLED DIGIT ONE .. U+2470 CIRCLED NUMBER SEVENTEEN}

then a specific lean, in 10 degree steps clockwise, could be specified.

For example, ZWJ followed by U+2462 CIRCLED DIGIT THREE would indicate a lean of 30 degrees clockwise.

Yet I am unsure as to whether encoding it like that would both encompass every possibility needed yet also not encode situations that could not exist astronomically.

So, I put forward the idea of the ZWJ followed by a circled digit or a circled number so as to initiate discussion of what would be the best way to encode all of the various situations that could occur astronomically.

Best regards,

William Overington

Thursday 24 August 2017

From unicode at unicode.org  Thu Aug 24 11:06:08 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Thu, 24 Aug 2017 16:06:08 +0000
Subject: Unicode education in Schools
Message-ID: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>

I came across this School Unicode exercise https://community.computingatschool.org.uk/resources/4546 The exercise concerns Emoji but to me the important point is that the schoolchildren are having to think about SMP characters. I do not know if schools gives an explanation of the BMP and SMP planes.

Slowly, Unicode is making it's way into School Curricula

Andr? Schappo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/bee4cf5a/attachment.html>

From unicode at unicode.org  Thu Aug 24 11:40:38 2017
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Thu, 24 Aug 2017 09:40:38 -0700
Subject: Rendering variants of U+3127 Bopomofo Letter I
In-Reply-To: <3E10480FE4510343914E4312AB46E74212CD95EE@DEFTHW99EH5MSX.ww902.siemens.net>
References: <3E10480FE4510343914E4312AB46E74212CD95EE@DEFTHW99EH5MSX.ww902.siemens.net>
Message-ID: <ffbee15d-ac38-0d0e-56e8-f7f6a227fb2b@att.net>

Albrecht,

See TUS, Section 18.3, Bopomofo, p. 707:

http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf#G22553

--Ken


On 8/24/2017 12:19 AM, Dreiheller, Albrecht via Unicode wrote:
>
> Hello Chinese experts,
>
> The Letter I in the Bopomofo alphabet (U+3127)has a two rendering 
> variants, a vertical bar and a horizontal bar.
>
> Can anyone please tell me the context criteria, when should which 
> variant be used?
>
> Is it VR China using the vertical form (like in font SimSun) and 
> Taiwan using the horizontal form (like in fontPMingLiU) ?
>
> Thanks
>
> Albrecht
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/2ff38166/attachment.html>

From unicode at unicode.org  Thu Aug 24 11:45:32 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Thu, 24 Aug 2017 22:15:32 +0530
Subject: Unicode education in Schools
In-Reply-To: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
Message-ID: <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>

So how do you think it matters if the characters are in the BMP or SMP?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/5db7b345/attachment.html>

From unicode at unicode.org  Thu Aug 24 12:17:10 2017
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Thu, 24 Aug 2017 17:17:10 +0000
Subject: Unicode education in Schools
In-Reply-To: <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
Message-ID: <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>


Because there are many systems that can now handle BMP characters but not cannot handle SMP characters.

One example being systems that use mysql utf8 (3 byte encoding) and have not yet updated to utf8mb4 (4 byte encoding)

So, I consider it important to familiarise students with SMP characters as well as BMP characters. Then when they develop software they will, at the start, be thinking beyond ASCII and Unicode BMP characters.

Andr? Schappo

> On 24 Aug 2017, at 17:45, Shriramana Sharma <samjnaa at gmail.com> wrote:
> 
> So how do you think it matters if the characters are in the BMP or SMP?


From unicode at unicode.org  Thu Aug 24 13:19:10 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 24 Aug 2017 20:19:10 +0200
Subject: Unicode education in Schools
In-Reply-To: <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
Message-ID: <CAGa7JC3-kM6T9Aqf0unTtvLfEdvwO28VpZws42T86PK1UwQRaA@mail.gmail.com>

2017-08-24 19:17 GMT+02:00 Andre Schappo via Unicode <unicode at unicode.org>:

>
> Because there are many systems that can now handle BMP characters but not
> cannot handle SMP characters.
>
> One example being systems that use mysql utf8 (3 byte encoding) and have
> not yet updated to utf8mb4 (4 byte encoding)
>

Mysql's utf8 is known to cause severe problems, notably on wikis installed
by default with it: the presence of any non-BMP character (SMP or emojis
are now very frequent and available on almost all modern smartphones) in
the edited text will cause its **silent** truncation when uploading it to
the server (when it will save the text to the database) even if any unsaved
preview was correct. You will see the truncation when the page is loaded
again.

Mysql's "utf8" should have been dropped since long and replaced by utf8mb4
or setup so that data send to an "utf8"-encoded database would cause a SQL
error that cannot be silently ignored with truncation (or it least it
should only cause the non-BMP characters to be filtered out, without
silently deleting everything that follows).

This is an old severe bug of Mysql (on the server itself) or in the
connection protocol, or internal filters used by Mysql client library, that
has caused many severe security issues (such as discarding logs or todo
lists, or loss of pending commercial transactions such as lists of payments
to process to a bank or truncated billings sent to customers, or loss of
contact address or name, or broken complete addresses for product delivery
to a customer, or missing items in a delivered box and lost products in the
middle of their routing).

This is a demosntration that not signaling encoding errors to an
application, or not clearly specifiying that an API may cause encoding
exceptions that must be caught and must not ignored in applications, can
hurt. Even if you use "utf8mb4" encoding errors are still possible and must
not be ignored as the final result will be unpredictable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/c404a12e/attachment.html>

From unicode at unicode.org  Thu Aug 24 16:01:52 2017
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Thu, 24 Aug 2017 14:01:52 -0700
Subject: Unicode education in Schools
In-Reply-To: <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
Message-ID: <c4566f60-3c69-fb00-7ae3-e243234547d4@ix.netcom.com>

On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote:
> Because there are many systems that can now handle BMP characters but not cannot handle SMP characters.
>
> One example being systems that use mysql utf8 (3 byte encoding) and have not yet updated to utf8mb4 (4 byte encoding)
>
> So, I consider it important to familiarise students with SMP characters as well as BMP characters. Then when they develop software they will, at the start, be thinking beyond ASCII and Unicode BMP characters.

The thinking "beyond BMP" part only comes in when you work in encoding 
forms where the BMP uses a different number of code units than the SMP 
(or any other non-BMP "page"). This is true for both utf8 and utf16 but 
not if you work in utf32 or in scalar values (as in the posted exercise).


The trick with using emoji in this lesson is that the descriptions and 
images are meaningful to any English speaker, so it gets the student to 
learn about character names.

The same exercise would be more of a challenge for students whose native 
tongue is not English.

A./
>
> Andr? Schappo
>
>> On 24 Aug 2017, at 17:45, Shriramana Sharma <samjnaa at gmail.com> wrote:
>>
>> So how do you think it matters if the characters are in the BMP or SMP?
>
>


From unicode at unicode.org  Thu Aug 24 18:23:40 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Fri, 25 Aug 2017 00:23:40 +0100
Subject: Unicode education in Schools
In-Reply-To: <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
Message-ID: <20170825002340.6c8c7798@JRWUBU2>

On Thu, 24 Aug 2017 17:17:10 +0000
Andre Schappo via Unicode <unicode at unicode.org> wrote:

> So, I consider it important to familiarise students with SMP
> characters as well as BMP characters. Then when they develop software
> they will, at the start, be thinking beyond ASCII and Unicode BMP
> characters.

Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.

From unicode at unicode.org  Thu Aug 24 18:24:36 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 25 Aug 2017 01:24:36 +0200
Subject: Version linking?
In-Reply-To: <20170817213727.0263c2d8@JRWUBU2>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
Message-ID: <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>

2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> Thus, at the level of undisputable text, in Indic scripts there appears
> to be no provision for the ordering of multiple left matras that are
> to be stored in logical order (i.e. backing order) after the onset
> consonants. (Thus, this is not a problem for the Thai script.)
> Fortunately, there is no good evidence that the occurrence of multiple
> distinct left matras is anything but a typing error, though I can easily
> see how it might be used as a lexicographical convention on the fuzzy
> edge of plain text.
>
> In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html
> ),
> but I'm not sure what the legitimate encodings of the example word
> ???? (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are.
>

Even if there were typing errors, the input method should either signal it
visually to the user (using canonical reordering), or the user could still
cancel this reordering (e.g. CTRL+Z for undoing it) and the input method
could still fix it and mainting the order by then inserting combining
joiners automatically even if the user did not enter them directly.

The joiners should better be removed transparently by the text editor
without requiring the user to perform complex selections or pressing
BACKSPACE multiple times, as I don't see any use of these joiners at end of
graphemes, or multiple joiners in a sequence.

Then the user can even click in the middle of the uncommon sequences of
matras, to correct a missing consonnant if needed: here also the joiner
that is encoded but hidden there would be dropped automatically.

If there are specific sequences requiring other uses of joiners for useful
distinction in some pairs of letters or diacritics, the input editor could
offer a way to enter the sequence directly or to change the encoding of
that pair with or without the joiner in the middle. Having to retype
completely the matra (using BACKSPACE deleting transparently the joiners,
or using normal text selection over full clusters) should be the exception.

If such special sequences requiring joiners are frequent, there should be a
way to enter that sequence directly for the target language, the input
editor could propose it with a point and clik/touch palette or some
function/control keys or contextual menu when selecting a candidate
occurance where alternate encodings are possible and known (possibly
registered by the user himself within his own input preferences or in his
personal lexical file of alternate words where they would have been when
they deviate from the most common orthographic rules). Which UI widget or
function key will be used by the input editor is left to the system or
application UI.

But the system should not decide alone that a sequence is invalid for some
orthographic system, when Unicode provides valid ways to deviate from any
ortographic system and to bypass the common canonical equivalences by
adding some transparent controls.

Even for Latin, one can freely enter SHY controls at any place within
words, even if they are not at correct syllabic separations: this will
impact the rendering if there are linebreaks, but this is done on purpose,
and still easy to correct if this was made by error (a spell checker could
also help locate these uncommons errors in existing texts but would not
automatically correct them without instruction given by the user and a user
can also choose to ignore/discard these signals and store the text as is).

Whever the text with uncommon sequences will be easy to render correctly is
not the problem, the editor will jsut attempt to give a best effort
representation, and if this approximative representation is too frequent,
fonts and renderers will be updated later to support and reder correctly
the "uncommon" sequence (without even needing to change the Unicode
standard itself). But inputing such text will not be blocked.

The case of confusable two-part vowels in Indic scripts however causes a
problem of interpretation and it's not reasonable to think that users will
use one sequence instead of the other, when both would render the same with
the existing typographic rules implemented in renderers, but they collate
differently (this may be a problem for plain-text searches if we look for
distinctions, or sorting, but this can be fixed by definining collation
strengths or search flags to apply or not some collation equivalences, by
enabling or disabling some tailorings), and then this can help setup a
spell checker to signal or ignore some suggested corrections.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/85ca068d/attachment.html>

From unicode at unicode.org  Thu Aug 24 19:17:32 2017
From: unicode at unicode.org (David Starner via Unicode)
Date: Fri, 25 Aug 2017 00:17:32 +0000
Subject: Fwd: Unicode education in Schools
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
Message-ID: <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>

---------- Forwarded message ---------
From: David Starner <prosfilaes at gmail.com>
Date: Thu, Aug 24, 2017, 6:16 PM
Subject: Re: Unicode education in Schools
To: Richard Wordingham <richard.wordingham at ntlworld.com>


On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
>
> Richard.
>

Steer them away from reinventing the wheel. If they use Java, use Java
strings. If they're using GTK, use strings compatible with GTK. If they're
writing JavaScript, use JavaScript strings. There's basically no system
without Unicode strings or that they would be better off rewriting the
wheel.

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/597c973b/attachment.html>

From unicode at unicode.org  Thu Aug 24 21:29:06 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Fri, 25 Aug 2017 04:29:06 +0200
Subject: Unicode education in Schools
In-Reply-To: <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
 <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
Message-ID: <CAGa7JC2kqoAj+AN78n-jakRwx6yieyvRVzYaCKySkOxLmGJjsg@mail.gmail.com>

Strings in Java and JavaScript are basically the same as they are arbitrary
sequences of 16-bit code units, and not restricted to text with valid
UTF-16 encoding. The differences are in the set of access methods, but they
are both normally immutable, and both allow (but do enforce) substrings to
share their backing store between distinct instances. The same applies to
C/C++ "wide strings" when their code units are larger than 1 byte, but
C/C++ do not make them immutable, except using dedicated classes, which
will transiently allow setting their content through constructors, and
C/C++ wide strings exist with several signed and unsigned code units (when
Java only have unsigned 16-bit code units in their "char", and Javascript
has no "char" type but only "Number" types with valid range restrictions
applied when constructing String instances from code units or from
codepoint values.

Javascript should soon have a new numeric type (it is provisionnaly named
"BigInt", a signed 64-bit integer and will have constants sufixed by "n",
and there will be no implicit promotion from/to Number but only explicit
conversions by checked constructors) and new code unit types for mutable
buffers (but only for the rangechecks of their write accessors, using
"Number" 64-bit floating points or the newer "BigInt" 64-bit integers)

There are similar designs in Perl, PHP, and most languages: Unicode support
and conformance for using these types for valid text is implemented only by
libraries in their standard text API or in their I/O APIs taking immutable
strings or mutable buffers in parameters, or returning sharable but
immutable string instances or a mutable buffer referenced on input or
allocated internally, but these API's are not restricted to just valid
Unicode text handling and allow using their strings with any other encoding.

With immutable strings implemented as classes, the backing store is
normally not directly accessible even by reference, you can just reference
the class referencing internally the backing store... implemented using
mutable buffers and using an internal encoding which may be different from
the one exposed by the string class (possibly using compression technics
for their backing store, on demand, and implicit atomization of most
frequently used string values, notably the empty string and string values
representing a single character with an 8-bit only code point value, or
strings containing any repetition of the same code point value:  these
values do not need any internally allocated buffer for their backing store,
so these instances are allocated very fast, and do not stress the garbage
collector when they are no longer used).

When Unicode text handling methods are supported by their exposed methods,
the Unicode validation rules are not necessarily checked everywhere, so it
is still possible to have strings or buffers containing a single unpaired
surrogate value. The backing store may also allow storing code units
outside the ranges used by valid UTF-16 or valid UTF-32 (the backing stores
are virtualized and could be on disk and swapped on demand with reusable
buffers from a pool).

2017-08-25 2:17 GMT+02:00 David Starner via Unicode <unicode at unicode.org>:

>
>
> ---------- Forwarded message ---------
> From: David Starner <prosfilaes at gmail.com>
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham <richard.wordingham at ntlworld.com>
>
>
>
>
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:
>
>> Just steer them away from UTF-16!  (And vigorously prohibit the very
>> concept of UCS-2).
>>
>> Richard.
>>
>
> Steer them away from reinventing the wheel. If they use Java, use Java
> strings. If they're using GTK, use strings compatible with GTK. If they're
> writing JavaScript, use JavaScript strings. There's basically no system
> without Unicode strings or that they would be better off rewriting the
> wheel.
>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/64ecd5d8/attachment.html>

From unicode at unicode.org  Fri Aug 25 00:14:42 2017
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Fri, 25 Aug 2017 05:14:42 +0000
Subject: Unicode education in Schools
In-Reply-To: <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
 <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
Message-ID: <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>

I thought Javascript had a UCS-2 understanding of Unicode strings. Has it managed to progress beyond that?


Peter


From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner via Unicode
Sent: Thursday, August 24, 2017 5:18 PM
To: Unicode Mailing List <unicode at unicode.org>
Subject: Fwd: Unicode education in Schools


---------- Forwarded message ---------
From: David Starner <prosfilaes at gmail.com<mailto:prosfilaes at gmail.com>>
Date: Thu, Aug 24, 2017, 6:16 PM
Subject: Re: Unicode education in Schools
To: Richard Wordingham <richard.wordingham at ntlworld.com<mailto:richard.wordingham at ntlworld.com>>


On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:
Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.

Steer them away from reinventing the wheel. If they use Java, use Java strings. If they're using GTK, use strings compatible with GTK. If they're writing JavaScript, use JavaScript strings. There's basically no system without Unicode strings or that they would be better off rewriting the wheel.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/9d97a5b1/attachment.html>

From unicode at unicode.org  Fri Aug 25 00:40:33 2017
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Fri, 25 Aug 2017 11:10:33 +0530
Subject: Unicode education in Schools
In-Reply-To: <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
 <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
 <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>
Message-ID: <CAH-HCWWN2LZbCO+N56fEsFmGGnZWEgFRvyru=8vkg9ac9vknRQ@mail.gmail.com>

IIUC the limitation seems to be only that functions such as "charAt" do not
recognize that surrogates aren't valid characters:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
via https://stackoverflow.com/a/8716157/1503120.

This is a problem of many 32-bit char based toolkits too and doesn't
(can't?) have an efficient solution for SMP without counting the surrogates
(and checking them). Right?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/a3488272/attachment.html>

From unicode at unicode.org  Fri Aug 25 01:29:39 2017
From: unicode at unicode.org (via Unicode)
Date: Thu, 24 Aug 2017 23:29:39 -0700
Subject: Unicode education in Schools
In-Reply-To: <CAH-HCWWN2LZbCO+N56fEsFmGGnZWEgFRvyru=8vkg9ac9vknRQ@mail.gmail.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
 <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
 <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>
 <CAH-HCWWN2LZbCO+N56fEsFmGGnZWEgFRvyru=8vkg9ac9vknRQ@mail.gmail.com>
Message-ID: <4ed9aec4-2348-46a2-b1e5-2a0db6026791@Spark>

Use String.codePointAt() etc.


El ago. 24, 2017 10:42 PM -0700, Shriramana Sharma via Unicode <unicode at unicode.org>, escribi?:
> IIUC the limitation seems to be only that functions such as "charAt" do not recognize that surrogates aren't valid characters:
>
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt via?https://stackoverflow.com/a/8716157/1503120.
>
> This is a problem of many 32-bit char based toolkits too and doesn't (can't?) have an efficient solution for SMP without counting the surrogates (and checking them). Right?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170824/5850c4a8/attachment.html>

From unicode at unicode.org  Fri Aug 25 01:36:00 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 25 Aug 2017 09:36:00 +0300
Subject: Unicode education in Schools
In-Reply-To: <20170825002340.6c8c7798@JRWUBU2> (message from Richard
 Wordingham via Unicode on Fri, 25 Aug 2017 00:23:40 +0100)
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
Message-ID: <83tw0w6tnz.fsf@gnu.org>

> Date: Fri, 25 Aug 2017 00:23:40 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> On Thu, 24 Aug 2017 17:17:10 +0000
> Andre Schappo via Unicode <unicode at unicode.org> wrote:
> 
> > So, I consider it important to familiarise students with SMP
> > characters as well as BMP characters. Then when they develop software
> > they will, at the start, be thinking beyond ASCII and Unicode BMP
> > characters.
> 
> Just steer them away from UTF-16!

Which will leave them entirely unprepared for the MS-Windows Unicode
programming, something they of course will never need in their
careers.

From unicode at unicode.org  Fri Aug 25 03:28:20 2017
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Fri, 25 Aug 2017 10:28:20 +0200
Subject: Unicode education in Schools
In-Reply-To: <c4566f60-3c69-fb00-7ae3-e243234547d4@ix.netcom.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <c4566f60-3c69-fb00-7ae3-e243234547d4@ix.netcom.com>
Message-ID: <CAJ2xs_EMSqpkALYGPx47uwUoY4rsW9_tOJ9NfbZOg86bJ+1D=A@mail.gmail.com>

Mark

(https://twitter.com/mark_e_davis)

On Thu, Aug 24, 2017 at 11:01 PM, Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote:
>
>> Because there are many systems that can now handle BMP characters but not
>> cannot handle SMP characters.
>>
>> One example being systems that use mysql utf8 (3 byte encoding) and have
>> not yet updated to utf8mb4 (4 byte encoding)
>>
>> So, I consider it important to familiarise students with SMP characters
>> as well as BMP characters. Then when they develop software they will, at
>> the start, be thinking beyond ASCII and Unicode BMP characters.
>>
>
> The thinking "beyond BMP" part only comes in when you work in encoding
> forms where the BMP uses a different number of code units than the SMP (or
> any other non-BMP "page"). This is true for both utf8 and utf16 but not if
> you work in utf32 or in scalar values (as in the posted exercise).
>
>
> The trick with using emoji in this lesson is that the descriptions and
> images are meaningful to any English speaker, so it gets the student to
> learn about character names.
>
> The same exercise would be more of a challenge for students whose native
> tongue is not English.


?> The trick with using emoji...

True. For emoji names it would be better to use the CLDR names with
non-anglophone audiences, since those names are available in a number of
languages.

eg http://www.unicode.org/cldr/charts/31/annotations/romance.html#?? (that
was last release's version; next release will have improvements...)
?

>
>
> A./
>
>
>> Andr? Schappo
>>
>> On 24 Aug 2017, at 17:45, Shriramana Sharma <samjnaa at gmail.com> wrote:
>>>
>>> So how do you think it matters if the characters are in the BMP or SMP?
>>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/80147f3a/attachment.html>

From unicode at unicode.org  Fri Aug 25 03:52:56 2017
From: unicode at unicode.org (Dreiheller, Albrecht via Unicode)
Date: Fri, 25 Aug 2017 08:52:56 +0000
Subject: AW: Rendering variants of U+3127 Bopomofo Letter I
In-Reply-To: <ffbee15d-ac38-0d0e-56e8-f7f6a227fb2b@att.net>
References: <3E10480FE4510343914E4312AB46E74212CD95EE@DEFTHW99EH5MSX.ww902.siemens.net>
 <ffbee15d-ac38-0d0e-56e8-f7f6a227fb2b@att.net>
Message-ID: <3E10480FE4510343914E4312AB46E74212CD9AC1@DEFTHW99EH5MSX.ww902.siemens.net>

Thanks a lot!
Albrecht

Von: Ken Whistler [mailto:kenwhistler at att.net]
Gesendet: Donnerstag, 24. August 2017 18:41
An: Dreiheller, Albrecht (DF MC TTI PLM 6)
Cc: unicode at unicode.org
Betreff: Re: Rendering variants of U+3127 Bopomofo Letter I


Albrecht,

See TUS, Section 18.3, Bopomofo, p. 707:

http://www.unicode.org/versions/Unicode10.0.0/ch18.pdf#G22553

--Ken

On 8/24/2017 12:19 AM, Dreiheller, Albrecht via Unicode wrote:
Hello Chinese experts,

The Letter I in the Bopomofo alphabet (U+3127)  has a two rendering variants, a vertical bar and a horizontal bar.

Can anyone please tell me the context criteria, when should which variant be used?
Is it VR China using the vertical form (like in font SimSun)  and Taiwan using the horizontal form (like in font  PMingLiU) ?

Thanks
Albrecht

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170825/d29fb7cc/attachment.html>

From unicode at unicode.org  Fri Aug 25 06:57:37 2017
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 25 Aug 2017 12:57:37 +0100 (BST)
Subject: Unicode education in Schools
In-Reply-To: <20170825002340.6c8c7798@JRWUBU2>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
Message-ID: <27487708.22771.1503662257061.JavaMail.defaultUser@defaultHost>

Richard Wordingham wrote:

> Just steer them away from UTF-16!  (And vigorously prohibit the very concept of UCS-2).

UTF-16 is very useful. I use it in my research project.

If the byte content of a UTF-16 file is displayed in a hexadecimal display then for all plane 0 characters the byte content of the character codes are thereby displayed directly.

Also, all characters that can be encoded in Unicode can be stored in a UTF-16 file.

William Overington

Friday 25 August 2017


----Original message----
>From : unicode at unicode.org
Date : 2017/08/25 - 00:23 (GMTST)
To : unicode at unicode.org
Subject : Re: Unicode education in Schools

On Thu, 24 Aug 2017 17:17:10 +0000
Andre Schappo via Unicode <unicode at unicode.org> wrote:

> So, I consider it important to familiarise students with SMP
> characters as well as BMP characters. Then when they develop software
> they will, at the start, be thinking beyond ASCII and Unicode BMP
> characters.

Just steer them away from UTF-16!  (And vigorously prohibit the very
concept of UCS-2).

Richard.


From unicode at unicode.org  Sat Aug 26 04:58:38 2017
From: unicode at unicode.org (Norbert Lindenberg via Unicode)
Date: Sat, 26 Aug 2017 02:58:38 -0700
Subject: Unicode education in Schools
In-Reply-To: <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <CAMZ=zj54_xEPgUE4xr16BXbA6nh8F1HHomX0qfL4iNPbj_ndbA@mail.gmail.com>
 <CAMZ=zj49k7S_6x=qXzFfXKjdd3n3ez54P=SC+t1qNLX_fk=9Gw@mail.gmail.com>
 <DM2PR21MB0028CB581E302F53F7750A62D59B0@DM2PR21MB0028.namprd21.prod.outlook.com>
Message-ID: <D206EE91-76C1-4C23-8133-2EFE76E54392@lindenbergsoftware.com>

ECMAScript 6 fixed that, largely along the lines of my proposal:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html

Norbert


> On Aug 24, 2017, at 22:14 , Peter Constable via Unicode <unicode at unicode.org> wrote:
> 
> I thought Javascript had a UCS-2 understanding of Unicode strings. Has it managed to progress beyond that?
> 
>  
> 
>  
> 
> Peter
> 
>  
> 
>  
> 
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner via Unicode
> Sent: Thursday, August 24, 2017 5:18 PM
> To: Unicode Mailing List <unicode at unicode.org>
> Subject: Fwd: Unicode education in Schools
> 
>  
> 
>  
> 
> ---------- Forwarded message ---------
> From: David Starner <prosfilaes at gmail.com>
> Date: Thu, Aug 24, 2017, 6:16 PM
> Subject: Re: Unicode education in Schools
> To: Richard Wordingham <richard.wordingham at ntlworld.com>
> 
>  
> 
>  
> 
> On Thu, Aug 24, 2017, 5:26 PM Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> Just steer them away from UTF-16!  (And vigorously prohibit the very
> concept of UCS-2).
> 
> Richard.
> 
>  
> 
> Steer them away from reinventing the wheel. If they use Java, use Java strings. If they're using GTK, use strings compatible with GTK. If they're writing JavaScript, use JavaScript strings. There's basically no system without Unicode strings or that they would be better off rewriting the wheel.
> 


From unicode at unicode.org  Sat Aug 26 08:50:49 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Aug 2017 14:50:49 +0100
Subject: Unicode education in Schools
In-Reply-To: <27487708.22771.1503662257061.JavaMail.defaultUser@defaultHost>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2>
 <27487708.22771.1503662257061.JavaMail.defaultUser@defaultHost>
Message-ID: <20170826145049.751b05d7@JRWUBU2>

On Fri, 25 Aug 2017 12:57:37 +0100 (BST)
William_J_G Overington via Unicode <unicode at unicode.org> wrote:

> UTF-16 is very useful. I use it in my research project.

> If the byte content of a UTF-16 file is displayed in a hexadecimal
> display then for all plane 0 characters the byte content of the
> character codes are thereby displayed directly.

But only plane 0.

How tedious (and expensive) would it be to obtain a licence to convert,
and freely share, the UCD to UTF-8 or UTF-16?  The code charts might
have to be a separate issue because of the fonts.

> Also, all characters that can be encoded in Unicode can be stored in
> a UTF-16 file.

Or UTF-8.  UTF-32 support is a bit limited.

Richard.

From unicode at unicode.org  Sat Aug 26 10:09:33 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Aug 2017 16:09:33 +0100
Subject: Unicode education in Schools
In-Reply-To: <83tw0w6tnz.fsf@gnu.org>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
Message-ID: <20170826160933.1e79c5a6@JRWUBU2>

On Fri, 25 Aug 2017 09:36:00 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Fri, 25 Aug 2017 00:23:40 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>
> > 
> > On Thu, 24 Aug 2017 17:17:10 +0000
> > Andre Schappo via Unicode <unicode at unicode.org> wrote:
> >   
> > > So, I consider it important to familiarise students with SMP
> > > characters as well as BMP characters. Then when they develop
> > > software they will, at the start, be thinking beyond ASCII and
> > > Unicode BMP characters.  
> > 
> > Just steer them away from UTF-16!  
> 
> Which will leave them entirely unprepared for the MS-Windows Unicode
> programming, something they of course will never need in their
> careers.

It shouldn't.  UTF-16 works just like UTF-8, except that the code units
are bigger.  The problem is that accidentally ignoring the difference
between UTF-16 and UCS-2 takes longer to be detected, and therefore
correcting the error may be very difficult.  Ignoring the difference
between ASCII (or an 8-bit coding) and UTF-8 shows up very quickly, and
therefore is less difficult to fix, for less is broken by the obvious
correction.

Richard.


From unicode at unicode.org  Sat Aug 26 10:55:25 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sat, 26 Aug 2017 18:55:25 +0300
Subject: Unicode education in Schools
In-Reply-To: <20170826160933.1e79c5a6@JRWUBU2> (message from Richard
 Wordingham via Unicode on Sat, 26 Aug 2017 16:09:33 +0100)
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <20170826160933.1e79c5a6@JRWUBU2>
Message-ID: <83inha5no2.fsf@gnu.org>

> Date: Sat, 26 Aug 2017 16:09:33 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > > Just steer them away from UTF-16!  
> > 
> > Which will leave them entirely unprepared for the MS-Windows Unicode
> > programming, something they of course will never need in their
> > careers.
> 
> It shouldn't.  UTF-16 works just like UTF-8, except that the code units
> are bigger.

Not really, since UTF-8 doesn't have surrogates.

From unicode at unicode.org  Sat Aug 26 12:52:03 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Aug 2017 18:52:03 +0100
Subject: Unicode education in Schools
In-Reply-To: <83inha5no2.fsf@gnu.org>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <20170826160933.1e79c5a6@JRWUBU2> <83inha5no2.fsf@gnu.org>
Message-ID: <20170826185203.05733893@JRWUBU2>

On Sat, 26 Aug 2017 18:55:25 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Sat, 26 Aug 2017 16:09:33 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > units are bigger.  

> Not really, since UTF-8 doesn't have surrogates.

It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
of the few systems that comes close to allowing them the dignity of
integer values of their own - 3FFF80?? to 3FFFFF?? for the code units
0x80 to 0xFF.

I well remembered when Unicode regular expressions were required to
allow one to search for lone surrogates, but there was no such concept
of looking for isolated ill-associated bytes in Unicode 8-bit strings.

The point is that if one understands how UTF-8 works, UTF-16 is a
system that works using a subset of the same principles, and one should
therefore understand how UTF-16 works, until one comes to the weird and
dubious concept of surrogate points having properties.  I believe the
latter concept is of value only in code that lacks the concept of
gibberish.  In UTF-8, the distinction between code unit value and
Unicode scalar value is very clear; in UTF-16, it is muddied by the
concept of 'codepoint'.

Richard.


From unicode at unicode.org  Sat Aug 26 13:20:45 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sat, 26 Aug 2017 21:20:45 +0300
Subject: Unicode education in Schools
In-Reply-To: <20170826185203.05733893@JRWUBU2> (message from Richard
 Wordingham via Unicode on Sat, 26 Aug 2017 18:52:03 +0100)
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <20170826160933.1e79c5a6@JRWUBU2> <83inha5no2.fsf@gnu.org>
 <20170826185203.05733893@JRWUBU2>
Message-ID: <83fuce5gxu.fsf@gnu.org>

> Date: Sat, 26 Aug 2017 18:52:03 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > > It shouldn't.  UTF-16 works just like UTF-8, except that the code
> > > units are bigger.  
> 
> > Not really, since UTF-8 doesn't have surrogates.
> 
> It has 115 surrogates, thoroughly oppressed by the UTC - there are 64
> trailing surrogates 0x80 to 0xBF, 51 leading surrogates 0xC2 to 0xF4 ,
> and 0xC0, 0xC1 and 0xF5 to 0xFF suffer the indignity of being the 13
> uncodepoints - not even allowed in Unicode 8-bit strings. Emacs is one
> of the few systems that comes close to allowing them the dignity of
> integer values of their own - 3FFF80?? to 3FFFFF?? for the code units
> 0x80 to 0xFF.
> 
> I well remembered when Unicode regular expressions were required to
> allow one to search for lone surrogates, but there was no such concept
> of looking for isolated ill-associated bytes in Unicode 8-bit strings.
> 
> The point is that if one understands how UTF-8 works, UTF-16 is a
> system that works using a subset of the same principles, and one should
> therefore understand how UTF-16 works, until one comes to the weird and
> dubious concept of surrogate points having properties.  I believe the
> latter concept is of value only in code that lacks the concept of
> gibberish.  In UTF-8, the distinction between code unit value and
> Unicode scalar value is very clear; in UTF-16, it is muddied by the
> concept of 'codepoint'.

We are miscommunicating.  My point was that programming for MS-Windows
needs a good understanding of what the UTF-16 surrogates are, and in
what MS-Windows APIs/library functions they can and cannot be used.
Without this understanding, one cannot figure out why the likes of
iwspace and iswupper only support the BMP, and what APIs to use to
lift this limitation.  Likewise with display-related APIs, used to
display Unicode text.

If you don't teach UTF-16 including these details, the programmers
will feel lost when they meet with these complications.

From unicode at unicode.org  Sat Aug 26 14:28:36 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Aug 2017 20:28:36 +0100
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
Message-ID: <20170826202836.7163d6c8@JRWUBU2>

On Fri, 25 Aug 2017 01:24:36 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2017-08-17 22:37 GMT+02:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  
> 
> > Fortunately, there is no good evidence that the occurrence
> > of multiple distinct left matras is anything but a typing error,
> > though I can easily see how it might be used as a lexicographical
> > convention on the fuzzy edge of plain text.
> >
> > In a similar vein, in Malayalam, we get repeats of the 2-part vowel
> > U+0D4B MALAYALAM VOWEL SIGN OO (see Cibu Johny's report at
> > https://lists.freedesktop.org/archives/harfbuzz/2013-February/002945.html
> > ),
> > but I'm not sure what the legitimate encodings of the example word
> > ???? (typed here as <U+0D15, U+0D4B, U+0D4B, U+0D4B>) are.
 
> Even if there were typing errors, the input method should either
> signal it visually to the user (using canonical reordering), or the
> user could still cancel this reordering (e.g. CTRL+Z for undoing it)
> and the input method could still fix it and mainting the order by
> then inserting combining joiners automatically even if the user did
> not enter them directly.

I don't see how any of ZWJ, ZWNJ and CGJ would help multiple
distinct left matras or repeated 2-part vowels. You might argue for
insertion of U+25CC as a base consonant, along with the ability to
delete just it.

> The joiners should better be removed transparently by the text editor
> without requiring the user to perform complex selections or pressing
> BACKSPACE multiple times, as I don't see any use of these joiners at
> end of graphemes, or multiple joiners in a sequence.

I believe <ZWNJ, ZWJ> has a r?le in some Arabic script writing systems,
and possibly in other cursive Semitic scripts, such as Mongolian.
<Virama, ZWNJ> is required at some syllable boundaries, and it is nice
to have ZWNJ honoured in the sequence <U+1A36 TAI THAM LETTER NA,
U+200C ZWNJ, U+1A63 TAI THAM VOWEL SIGN AA>, which is composed of two
extended grapheme clusters, <U+1A36, U+200C> and <U+1A63>.  This latter,
of course, is no more than one would require of good Latin typography
that works well with an English spell-checker - I would expect 'caecum'
to have a ligature but not 'sundae'.

> Even for Latin, one can freely enter SHY controls at any place within
> words, even if they are not at correct syllabic separations: this will
> impact the rendering if there are linebreaks, but this is done on
> purpose, and still easy to correct if this was made by error (a spell
> checker could also help locate these uncommons errors in existing
> texts but would not automatically correct them without instruction
> given by the user and a user can also choose to ignore/discard these
> signals and store the text as is).

Now that beings to mind some interesting cases - <consonant, SHY, right
matra> and <consonant, SHY, left matra>.  I'm not sure where the
handling should go, but Firefox handles the former reasonably.  My one
gripe is that I don't know how to tell the system that a rendered soft
hyphen is invisible.  Some typographers claim that the glyph for the
soft hyphen (i.e. the glyph for U+00AD) should be used when it becomes
manifest.  I haven't found any cases where a line break should go
between a left matra and a base consonant, but I wouldn't be surprised
to encounter an example in a manuscript in a phonetically ordered
script.  (They are far from unknown in Thai, but that's probably due
to software deficiencies.)  TUS treats the rendering of soft hyphens as
beyond its scope except for line-breaking - the rules are
language-dependent and beyond the scope of Unicode.  I don't know if
CLDR handles rendering around line-breaking soft hyphens.

I'm wondering if there are any cases where a SHY _should_ go between
a Latin letter and diacritic.  I can't think of any.

Richard.


From unicode at unicode.org  Sat Aug 26 14:52:19 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sat, 26 Aug 2017 21:52:19 +0200
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <20170826202836.7163d6c8@JRWUBU2>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
 <20170826202836.7163d6c8@JRWUBU2>
Message-ID: <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>

2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

>
> I'm wondering if there are any cases where a SHY _should_ go between a
> Latin letter and diacritic.  I can't think of any.
>

In standard Latin orthography you would not expect it, normally, but there
will be cases where this will still occur at random places between long
spans of letters.

However I did NOT suggest (like you are doing here) using SHY between a
Latin letter and any diacritic.

But may be you've been confused by the fact I took the example of free
insertion of SHY controls in alphabetic scripts in comparison to the free
insertion of joiner controls (not the same thing) between Indic letters
(including vowel matras or subjoined consonants that are encoded as
combining characters but are not really "diacritics").

Of course SHY in this use is not suitable, but who knows if one will not
need this to split in tow parts what would be otherwise a single cluster
(possibly reordered by canonical reordering if one needs to split between
two Indic matras: this would suggest there's a need for a new "empty base
consonnant" for that Indic script, but SHY (U+00AD) should probably not
have the correct effect if it also inserts an undesired line break
opportunity, independantly of how the glyph which would be rendered and the
position (first or second line) where it would be rendered if the linebreak
is honored).

If one wants an, empty base letter to combine with the diacritic after it,
I think it should be NBSP (U+00A0) to avoid the interpretation as a
"defective" cluster using a implied glyph such as the dotted circle (but
NBSP also has its own problems, notably for collation where it would
collate like a space instead of being ignorable at primary level: this can
be fixed however quite easily in collation tailorings, using collation
elements made with "NBSP+combining matra")
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170826/224c8d86/attachment.html>

From unicode at unicode.org  Sat Aug 26 16:07:57 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sat, 26 Aug 2017 22:07:57 +0100
Subject: Unicode education in Schools
In-Reply-To: <83fuce5gxu.fsf@gnu.org>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <20170826160933.1e79c5a6@JRWUBU2> <83inha5no2.fsf@gnu.org>
 <20170826185203.05733893@JRWUBU2> <83fuce5gxu.fsf@gnu.org>
Message-ID: <20170826220757.19a50ff5@JRWUBU2>

On Sat, 26 Aug 2017 21:20:45 +0300
Eli Zaretskii via Unicode <unicode at unicode.org> wrote:

> > Date: Sat, 26 Aug 2017 18:52:03 +0100
> > From: Richard Wordingham via Unicode <unicode at unicode.org>

> We are miscommunicating.  My point was that programming for MS-Windows
> needs a good understanding of what the UTF-16 surrogates are, and in
> what MS-Windows APIs/library functions they can and cannot be used.
> Without this understanding, one cannot figure out why the likes of
> iwspace and iswupper only support the BMP, and what APIs to use to
> lift this limitation.  Likewise with display-related APIs, used to
> display Unicode text.

> If you don't teach UTF-16 including these details, the programmers
> will feel lost when they meet with these complications.

So what's new compared to UTF-8?  The problem would be a misconception
that MSVC's wchar_t supported Unicode - or has that been fixed
recently?  The neutral message is to avoid wchar_t where possible.

C++11 and C11's char32_t ought to have fixed the problem.

Functions iswspace() and iswlower() are not stable, one really has to
replace them by the project's UCD routines.  For example, when the
locale is a Unicode locale with the obvious wchar_t representations, the
value of iswlower(0x13A0) recently changed from non-zero to zero, as
U+13A0 changed from gc=Lo to gc=Lu.  I don't think iswupper() is any
stabler.

Richard.

From unicode at unicode.org  Sat Aug 26 21:37:54 2017
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Sun, 27 Aug 2017 05:37:54 +0300
Subject: Unicode education in Schools
In-Reply-To: <20170826220757.19a50ff5@JRWUBU2> (message from Richard
 Wordingham via Unicode on Sat, 26 Aug 2017 22:07:57 +0100)
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <20170826160933.1e79c5a6@JRWUBU2> <83inha5no2.fsf@gnu.org>
 <20170826185203.05733893@JRWUBU2> <83fuce5gxu.fsf@gnu.org>
 <20170826220757.19a50ff5@JRWUBU2>
Message-ID: <83d17h68hp.fsf@gnu.org>

> Date: Sat, 26 Aug 2017 22:07:57 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> > We are miscommunicating.  My point was that programming for MS-Windows
> > needs a good understanding of what the UTF-16 surrogates are, and in
> > what MS-Windows APIs/library functions they can and cannot be used.
> > Without this understanding, one cannot figure out why the likes of
> > iwspace and iswupper only support the BMP, and what APIs to use to
> > lift this limitation.  Likewise with display-related APIs, used to
> > display Unicode text.
> 
> > If you don't teach UTF-16 including these details, the programmers
> > will feel lost when they meet with these complications.
> 
> So what's new compared to UTF-8?

Who said this is new?  I said this needs to be _taught_, or else
people will be ignorant about these subtleties.

From unicode at unicode.org  Sat Aug 26 23:06:04 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Aug 2017 05:06:04 +0100
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
 <20170826202836.7163d6c8@JRWUBU2>
 <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>
Message-ID: <20170827050604.3d60dc30@JRWUBU2>

On Sat, 26 Aug 2017 21:52:19 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  

> Of course SHY in this use is not suitable, but who knows if one will
> not need this to split in tow parts what would be otherwise a single
> cluster (possibly reordered by canonical reordering if one needs to
> split between two Indic matras: this would suggest there's a need for
> a new "empty base consonnant" for that Indic script, but SHY (U+00AD)
> should probably not have the correct effect if it also inserts an
> undesired line break opportunity, independantly of how the glyph
> which would be rendered and the position (first or second line) where
> it would be rendered if the linebreak is honored).

I am confused as to what conceivable case you have in mind.  An example
would help.  I wonder if I'm misunderstanding what you mean by
'canonical reordering'.  Do you mean the order of codepoints, or the
arrangement of glyphs.  CGJ is available to preserve a specific
ordering of codepoints, though it is completely redundant in most Indic
scripts.

It is a fact that aksharas do get split between lines in manuscripts,
undesirable though it may be.  In a transcription intended to preserve
a division into lines, one would probably use NBSP at such a point,
and worry less about attempting to preserve the structure of the
line-broken akshara.  It seems that Unicode only supports word
boundaries and their absence where they provide or prohibit line
breaks.

Richard.

From unicode at unicode.org  Sat Aug 26 23:26:45 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Sun, 27 Aug 2017 05:26:45 +0100
Subject: Unicode education in Schools
In-Reply-To: <28085931-177C-49AA-AFDF-FE37003CC6E7@gmail.com>
References: <FE0577AA-A6CB-4CE5-BC61-B4D8E6A2C361@lboro.ac.uk>
 <CAH-HCWUwip5+nusX6VbYpEw4P8xm1ZgUmuJp7EaTdVA2GnP89Q@mail.gmail.com>
 <35ED0DD2-829F-4507-9422-C36AE2AFBE49@lboro.ac.uk>
 <20170825002340.6c8c7798@JRWUBU2> <83tw0w6tnz.fsf@gnu.org>
 <28085931-177C-49AA-AFDF-FE37003CC6E7@gmail.com>
Message-ID: <20170827052645.3daa49fe@JRWUBU2>

On Fri, 25 Aug 2017 09:36:44 -0400
John W Kennedy <john.w.kennedy at gmail.com> wrote:

> Just a reminder that in Apple?s Swift a ?Character? is anything that
> looks like a character, including a letter with any theoretically
> unlimited stack of diacritics, a flag, or a skin-toned emoji, and all
> Swift functions working with characters, strings, and substrings
> count characters in this way. There is an underlying store that is,
> for historic reasons, UTF-16, and that can be accessed, but so can
> UTF-8 and UTF-32.

Can the individual Unicode characters be accessed one by one, e.g. for
searching for vowels or other such 'diacritics'?  Or would one only
have access to the code units?

Could one easily search for a subjoined consonant, e.g. COENG RO
<U+17D2 KHMER SIGN COENG, U+179A KHMER LETTER RO> in Khmer, where the
two constituent characters would be in adjacent extended grapheme
clusters?

Richard.


From unicode at unicode.org  Sun Aug 27 12:55:31 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Sun, 27 Aug 2017 19:55:31 +0200
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <20170827050604.3d60dc30@JRWUBU2>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
 <20170826202836.7163d6c8@JRWUBU2>
 <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>
 <20170827050604.3d60dc30@JRWUBU2>
Message-ID: <CAGa7JC3XeNK_RQAHBbhZfi7MnQ6QJjyCYDGyn44pSf8d_7CohQ@mail.gmail.com>

2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> On Sat, 26 Aug 2017 21:52:19 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
> > 2017-08-26 21:28 GMT+02:00 Richard Wordingham via Unicode <
> > unicode at unicode.org>:
>
> > Of course SHY in this use is not suitable, but who knows if one will
> > not need this to split in tow parts what would be otherwise a single
> > cluster (possibly reordered by canonical reordering if one needs to
> > split between two Indic matras: this would suggest there's a need for
> > a new "empty base consonnant" for that Indic script, but SHY (U+00AD)
> > should probably not have the correct effect if it also inserts an
> > undesired line break opportunity, independantly of how the glyph
> > which would be rendered and the position (first or second line) where
> > it would be rendered if the linebreak is honored).
>
> I am confused as to what conceivable case you have in mind.  An example
> would help.  I wonder if I'm misunderstanding what you mean by
> 'canonical reordering'.


Canonical reordering is unambiguously refering to the canonical
equivalences in TUS. These are automated and can occur at any time, and the
only way to avoid them is to insert joiners. But they should never be
needed for normal texts, except to split clusters or introduce semantic
differences where they are relevant (and in that case the renderers will
also try to distinguish them, otherwise they can freely reorder every
sequence of diacritics with distinct non-zero combining classes and will
represent all canonically equivlent sequences exactly the same way without
distinguishing them).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170827/2366663a/attachment.html>

From unicode at unicode.org  Sun Aug 27 21:40:54 2017
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Mon, 28 Aug 2017 03:40:54 +0100
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <CAGa7JC3XeNK_RQAHBbhZfi7MnQ6QJjyCYDGyn44pSf8d_7CohQ@mail.gmail.com>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
 <20170826202836.7163d6c8@JRWUBU2>
 <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>
 <20170827050604.3d60dc30@JRWUBU2>
 <CAGa7JC3XeNK_RQAHBbhZfi7MnQ6QJjyCYDGyn44pSf8d_7CohQ@mail.gmail.com>
Message-ID: <20170828034054.6ea3885a@JRWUBU2>

On Sun, 27 Aug 2017 19:55:31 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
> unicode at unicode.org>:  
 
> Canonical reordering is unambiguously refering to the canonical
> equivalences in TUS. These are automated and can occur at any time,
> and the only way to avoid them is to insert joiners. But they should
> never be needed for normal texts, except to split clusters or
> introduce semantic differences where they are relevant (and in that
> case the renderers will also try to distinguish them, otherwise they
> can freely reorder every sequence of diacritics with distinct
> non-zero combining classes and will represent all canonically
> equivlent sequences exactly the same way without distinguishing them).

This wasn't the sort of problem I was talking about.  The Indic
example with undefined rendering has two left matras with ccc=0.  The
questions was whether they should be displayed from left to right (as in
MS Edge) or right to left (as in Firefox).

The problem of diacritics below having different combining classes has
been raised for minority languages in Thai.  There seems a definite
prospect that the rendering order has to depend on the writing system -
and the other order would simply be wrong.  Standardisation occurs
outside the purview of the UTC.  The order may be forced by CGJ,
which is a joiner in name only when it occurs before combining marks.

Richard.

From unicode at unicode.org  Mon Aug 28 00:20:17 2017
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Mon, 28 Aug 2017 07:20:17 +0200
Subject: Character Sequences of Uncertain Rendering (was: Version linking?)
In-Reply-To: <20170828034054.6ea3885a@JRWUBU2>
References: <CAH-HCWUmV5JfRx1K8DmGs3t=mWY3iziS0Cx3j9=7q5J6LCbdFg@mail.gmail.com>
 <CAJ2xs_HQL0U8D3X9jTyVyYwuFT70E8JiS6ao7qy6SYHFCAFWHg@mail.gmail.com>
 <CAH-HCWVxFecjtj3NLot1GuoOW4VDg=HjLnTshU_ZScBnciCMpA@mail.gmail.com>
 <20170817213727.0263c2d8@JRWUBU2>
 <CAGa7JC1zp8RM22B9Pio8hs0SrO_6MqNToR94ZWKUfH0LJgORQQ@mail.gmail.com>
 <20170826202836.7163d6c8@JRWUBU2>
 <CAGa7JC1=v4_oykQ4NyVYUc76gEx_k3DU7+crOKcMnq8oy5Ga4w@mail.gmail.com>
 <20170827050604.3d60dc30@JRWUBU2>
 <CAGa7JC3XeNK_RQAHBbhZfi7MnQ6QJjyCYDGyn44pSf8d_7CohQ@mail.gmail.com>
 <20170828034054.6ea3885a@JRWUBU2>
Message-ID: <CAGa7JC1mF358kKkQA_m1cBqo6G5u13=et7BBRWhAtJzQw+k3=Q@mail.gmail.com>

Actually the matras in questions in the first message were neither
left-to-right or right-to-left, they were two-part vowels, and repeatedly
encoded after a base letter.
Malayalam itself is left-to-right but this only makes sense for the order
of base letters. matras encoded after that are placed around it according
to the script rule, but two part vowels cause problem if multiple ones are
used. We know how to order the right parts that are postposed, but there's
no clear order for the left parts that are preposed (including when there
are also preposed one-part vowels).

This is kind of similar to the problem of defining the stacking order when
there are multiple diacritics above (or below) when they all compete for
the same position. If generally the option is to render them ordered from
the innermost to the outermost position (so successive diacritics noramlly
positioned above should stack vertically upward, but there are known
exception where they will be instead not stacking vertically but
horizontally either left-to-right or right-to-left, and some cases where
their order will also be reversed).

There are only common positions and stacking options which should be used
by default in absence of any kind of joiners between them. For all other
cases, we need additional joiner controls between them if this is not the
default. But here, what is the default for the uncomon case where there are
multliple occurences of the same two-part matras ? In my opinion, they
should still be ordering their respective left-part or right from from
innermost to outermost, so the left-parts will be rendered right to left,
and the right-parts will be rendered left-to-right.

Here the problem is that this is performed in Firefox only for a limited
number (2) of preposed one-part vowels or preposed diacritics, or preposed
left-parts (of two-part vowels). So after rendering the first two matras,
there's no space left for the third matra, which will then be rendered
entirely after the cluster, in a separate cluster (missing a base
consonnant so you see the dotted glyph in the middle). IE does seem to do
things correctly by supporting more left-side preposed matras or left-side
preposed "half-matras": it first decomposes the two-part matras into two
pseudo-matras for each part and then order the first pseudo-matra like
other preposed vowels, all by default right-to-left (i.e. from innermost to
outsermost when you place the center of view on the base letter).

But there's no special joiners encoded in Unicode to override the placement
(direction) or relative order of diacritics competing to the same position.
If one was used, it should be encoded just before that diacritic, but
twop-part diacritics are even more challenging as they could possibly need
one or two separate overrides (either for the left-part or the right-part,
or both !)

However for the case given above, it makes no sense to use what Google
Chrome currently renders for "????" (U+0D15, followed by 3 occurences of
U+0D4B).

To make it clear, I'll use ASCII-only notation : <M> for the base letter
(U+0D15) and <db> for the two-part diacritic U+0D4B, and <o> the dotted
circle.
- When we encode <M,CD>, the rendering should be "CMD". it is OK in all
browsers.
- When we encode <M,CD,CD> we also see "CCMDD" everywhere including in
Chrome or Firefox.
- Then comes the encoding <M,CD,CD,CD> that IE correctly renders as
"CCCMDDD", but Chrome or Firefox cannot render this correctly, they first
render <M,CD,CD> as "CCMDD" then comes <CD> left alone without base
consonnant, so a dotted circle is inserted and we see "CoD" as a glued (but
now separate) cluster, the final result is "CCMDDCoD" (which is still not
breakable whe ntrying to select it with keyboard/mouse/touch).

I think this is caused by the algorithm used in Chrome and Firefox
renderers that only offer at most two positions for preposed parts when
computing the reordered layout of glyphs. IE does this correctly by not
limiting the number of preposed glyphs or using a higher limit (I did not
test by using arbitrarily-long sequences of preposed vowels or two-part
vowels, or at least 4 of them then more).

I know that IE/Edge is capable now to stack very high stacks of diacritics
(and this was implemented probably for the Tibetan script, or for
supporting mathematical notations).

But still, overriding the default direction of stacking is unspecified in
Unicode, except for a few documented cases where some joiner controls are
used (for the "liquid" vowels that we consider as consonnants in Latin, and
that will be present in words borrowed to Indic languages in their script
using matras) to alter the restation of stacking (but without complex glyph
reordering)

consider also the case of Acute accent in Greek whose default position is
by default altered when they occur contextually with capital letters, from
above, to the left. so <CAPITAL ALPHA, ACUTE> is reordered as
<PREPOSED-ACUTE,CAPITAL ALPHA>, but most Greek fonts will render like their
precombined equivalent, using a single assigned glyph, without needing any
rendering. Now consider <CAPITAL ALPHA, MACRON ABOVE, ACUTE>. As the
diacritics have the same combining class placing them by default above,
they are not freely reorderable in the encoding. But does the ACUTE still
inherit the altered placement after the capital? If so, it would reorder
too as <PREPOSED-ACUTE, ALPHA+MACRON> without stacking, but of not, where
will be the ACUTE ? It will likely not be preposed but will stack
vertically centered above the macron. And there's no way to indicate it
would stack vertically in the other direction, except by using some joiner
and encoding <CAPITAL ALPHA,CGJ,ACUTE,MACRON>, the CGJ before ACUTE
blocking the reordering to render it in the proposed above-left position

The same CGJ could be used to prohibit the default altered placement (and
changed glyph form) of the CEDILLA, which occurs for some Latin letters.

We had the case in Latin for the "double acute" accent, for which the
solution was not to encode a second acute accent prepended with a CGJ, but
to encode a separate double acute instead, so that they won't stock
vertically on top of each other, but we have ne clear solution to indicate
the correct placement of ACUTE+GRAVE diacritics or GRACE+ACUTE (should they
stack vertically or horizontally?) Here again we are in a borderline case
where standard orthographies do not provide a "default" best solution, so
we don't know if we can use joiner controls between diacritics and which
ones (if these diacritics are used in romanizations to mark tones, we could
have multiple tones over the same (long) vowel (which could play a long
"melody").

Another problem came later with the proliferation of letters converted to
diacritics (and possibly needing themselves their own diacritics!). The
question remains open: are the encoded diacritics sufficient to represent
complex layouts? Is the Unicode "standard character model" really correct
and suffivient for all cases?

I'd like to see these probleme finding a clean solution: it's probably more
important than the active encoding of many emojos (now with very long
sequences for groups of people which also include their own complex
placement rules)

2017-08-28 4:40 GMT+02:00 Richard Wordingham via Unicode <
unicode at unicode.org>:

> On Sun, 27 Aug 2017 19:55:31 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
> > 2017-08-27 6:06 GMT+02:00 Richard Wordingham via Unicode <
> > unicode at unicode.org>:
>
> > Canonical reordering is unambiguously refering to the canonical
> > equivalences in TUS. These are automated and can occur at any time,
> > and the only way to avoid them is to insert joiners. But they should
> > never be needed for normal texts, except to split clusters or
> > introduce semantic differences where they are relevant (and in that
> > case the renderers will also try to distinguish them, otherwise they
> > can freely reorder every sequence of diacritics with distinct
> > non-zero combining classes and will represent all canonically
> > equivlent sequences exactly the same way without distinguishing them).
>
> This wasn't the sort of problem I was talking about.  The Indic
> example with undefined rendering has two left matras with ccc=0.  The
> questions was whether they should be displayed from left to right (as in
> MS Edge) or right to left (as in Firefox).
>
> The problem of diacritics below having different combining classes has
> been raised for minority languages in Thai.  There seems a definite
> prospect that the rendering order has to depend on the writing system -
> and the other order would simply be wrong.  Standardisation occurs
> outside the purview of the UTC.  The order may be forced by CGJ,
> which is a joiner in name only when it occurs before combining marks.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170828/09dbca8b/attachment.html>

From unicode at unicode.org  Thu Aug 31 08:07:07 2017
From: unicode at unicode.org (Anshuman Pandey via Unicode)
Date: Thu, 31 Aug 2017 08:07:07 -0500
Subject: The need for a basic register of emoji submissions
Message-ID: <CAOcx4CTdkijR_3cKi63avR-k08GUYVUGzMYX2uG-Ad7v57BQ-g@mail.gmail.com>

There is a need for a basic register of proposals that have been
submitted to the Emoji Subcommittee. Currently, emoji proposals are
posted to the UTC register after they have been reviewed by the ESC as
being actionable by the UTC. For proposals that make the cut, some
time can pass between the date of submission and the date they are
posted. For proposals that are deemed unsuitable, there is simply no
public record.

Consequently, there is no way to know if a particular emoji has been
proposed, either while a submitted proposal is being reviewed or if a
proposal has been rejected. The "Submitting Emoji Proposals" page at
http://unicode.org/emoji/selection.html quixotically notifies the
reader using bold face to "check the Emoji List to make sure your
proposal is new": this list contains emoji that have already been
encoded.

This is a problem. There have been three instances where I have worked
on emoji proposals only to later learn that they were already proposed
earlier. And I learned that only because I check the UTC register
frequently for my script encoding efforts. If there were a basic
register of emoji submissions, I could have easily checked it and
saved the hours I spent in drawing up documents.

The de facto rationale for not posting emoji proposals to the UTC
register right away is that 'there are too many proposals that are
unactionable or of insufficient quality'. But, I think this rationale
does not hold water too well. A basic task of a standards subcommittee
is to maintain a list of artifacts that pertain to its function. For
the ESC, these artifacts include all emoji submissions. And a list of
these artifacts can easily be made available at
http://unicode.org/emoji. So, that instead of pointing prospective
emoji proposal authors to a list of already encoded emoji, they can be
pointed to a list of emoji submissions.

This basic register can be as simple as a list of names. If the ESC
wishes to not post other details, that is fine. I am not asking for a
Roadmap.

I see from the announcement made yesterday that the ESC now has (at
least) four members. Congratulations to the new members, who I believe
to be highly capable of maintaining a simple public list of emoji
submissions in short time.

All my best,
Anshu