From js_choi at icloud.com  Tue Nov  3 12:59:40 2015
From: js_choi at icloud.com (=?utf-8?Q?=22J=2E=C2=A0S=2E_Choi=22?=)
Date: Tue, 03 Nov 2015 12:59:40 -0600
Subject: On emoji and the two rightwards black arrows
In-Reply-To: <CAGa7JC3GDP+O3EXu3Q=7R8yFXqyN5gPAmumt1K3DSgFK0_zdTA@mail.gmail.com>
References: <E3F66161-86EF-4013-A8C6-CD962BEFD2F2@icloud.com>
 <CAGa7JC3GDP+O3EXu3Q=7R8yFXqyN5gPAmumt1K3DSgFK0_zdTA@mail.gmail.com>
Message-ID: <119D9C7A-D475-4BA1-BCBC-871AC66649AE@icloud.com>

Thanks for the reply!

> IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today?

> ?these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured).


If the compatibility mappings are not normative or guaranteed to be stable, then that would weaken one of the two objections to the changes proposed in my questions 1 and 2. The compatibility-mapping and IDNA issues are merely supplemental to my main questions, though.

> Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually.


Perhaps this is true, but regardless of whether the disunification in 2014 (of the Zapf Dingbat U+27A1 from the DPRK/Wingding arrows U+2B05?U+2B07) was justified, or whether the creation in 2014 of U+2B95 was justified, they happened nonetheless; the opportunity to object to it seems to have already passed.

U+2B95 now exists?and it exists with the express purpose to complete U+2B05?U+2B07, based on Michel Suignard?s new representative glyphs and Mark Davis? comments from earlier this year. However, U+2B95?s current absence from UTR #51 and emoji_data.txt?and its lack of text/emoji standardized variation sequences?are perhaps inconsistent with that purpose. The three questions remain:

1. Should U+B295 be added to the set of emoji characters as given by UTR #51 and emoji-data.txt, in order to complete the harmonization with U+2B05?U+2B07 from 2014?

2. If question 1?s answer is yes, then should U+B295 be given text/emoji standardized variation sequences, just as U+2B05?U+2B07 already do?

3. Regardless of the answers to the above, should notes clarifying the differences in intended usage between U+B295 (the right black arrow completing U+2B05?U+2B07) and U+27A1 (the Zapf Dingbat) be added to their entries in the Standard?s code charts? This might clear up a lot of confusion from users and font creators, and would only make clearer what has already been made explicit by 7.0?s glyph changes.

??I?m also uncertain as to the way I?d even initiate a formal process on this. This isn?t even a proposal for a new character; it?s a proposal the for inclusion of an already added character and for the addition of clarifying information in the code charts. The forms at http://www.unicode.org/L2/summary.html <http://www.unicode.org/L2/summary.html> wouldn?t seem to fit this kind of change.

J. S. Choi

> On Oct 30, 2015, at 7:19 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today (such as CP437):
> 
> The reason for that is that the old registrations for legacy 8-bit charsets only showed charts of glyphs with approximative glyphs (often with poor quality, with low resolution rendering on printed papers, and various polluting dots of inks, later scanned with poor resolution), but no actual properties (and often without even listing any name for them). And for long those charts have been interpreted differently by different vendors (such as printer or screen manufacturers, in a time where dot-matrix printers or displays had poor resolution), and sometimes with glyphs changing slightly between devices models or versions from the same vendor.
> 
> So characters in those mapping tables were widely used to mean different variants of characters that are now distinguished in the UCS (e.g. in CP437, the symbol that looks either like an big epsilon or as a "is member of" math symbol ; the mappings to the UCS for other symbols that look like Greek letters in CP437 charsets and similar are not really in stone, it is not even clear if they will map to UCS symbols or to UCS Greek letters ; the same applies to various geometric symbols, including arrows, and bullets).
> 
> Those mappings are just there to help converting some old documents to the UCS, but the choice is sometimes questionable and some corrections may need to be done to select another character, depending on the context of use. Unfortunately, the existing mappings only document mappings of legacy code positions to a single suggested codepoint, and not their other possible alternatives.
> 
> Then we fall into the categories of characters that are easily confusable: may be these mappings tables do not need to be changed, but used together with the datafiles related to confusable characters (the list was initiated during the development of IDNA). There are other data available (visible in Unicode charts) that also indicate a few related/similar characters, but these are mostly notes are not engraved in stone, and this data is difficult ot use.
> 
> So in summary, those mapping tables are just suggestions and implementers may still map legacy encodings to different subsets of the UCS. But we should be concerned by the conversion to the other direction, from the UCS to legacy mappings : all candidate UCS code points should be reversed mapped to the same legacy code position (as much as possible). Those mapping tables are then not part of the stable standard and there's no stability policy about them (IMHO, such policy should not be adopted). They are just contributions in order to help the transition to the UCS, and they are also subject to updates when needed if there are better mappings developed later, and some applications or vendors will still develop their own preferences.
> 
> If you consider the two UCS characters in question, my opinion is that they are basically the same and mappings from Zapf Dingbats or DPRK or Windings/Webdings are just kept for historical reasons, but not necessarily the best ones. And I would see no violation of the standard if a font was made that mapped both UCS characters to exactly the same glyph, using metrics that create a coherent set of black arrows using either the DPRK metrics for all 4 arrows, or the Zapf Dingbats metrics for all 4 arrows. Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually.
> 
> But because they are not canonically equivalent, these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured).
> 
> 2015-10-30 19:51 GMT+01:00 J.S. Choi <js_choi at icloud.com <mailto:js_choi at icloud.com>>:
> # On emoji and the two rightwards black arrows
> 
> (?) The post is about two encoded characters:
> U+27A1 Black Rightwards Arrow <http://www.unicode.org/charts/PDF/U2700.pdf <http://www.unicode.org/charts/PDF/U2700.pdf>>
> and U+2B95 Rightwards Black Arrow <http://www.unicode.org/charts/PDF/U2B00.pdf <http://www.unicode.org/charts/PDF/U2B00.pdf>>.
> 
> (?)
> In any case, I might make a formal proposal in the future, but I first want to determine here how probable that such a proposal would be discussed. What would you say the answers to those three questions are?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151103/b7f32333/attachment.html>

From markus.icu at gmail.com  Tue Nov  3 15:35:54 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 3 Nov 2015 13:35:54 -0800
Subject: Emoji data in UCD xml ?
In-Reply-To: <CAJ2xs_EyMyhAc-_Ce5QzGv9_5FTXi73vAYLMWRVkRk+dZ=JMRw@mail.gmail.com>
References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch>
 <563245DA.2060207@att.net>
 <CAJ2xs_EyMyhAc-_Ce5QzGv9_5FTXi73vAYLMWRVkRk+dZ=JMRw@mail.gmail.com>
Message-ID: <CAN49p6rjOu_=Wz-3s8OjwBuBr3SjGgocMrNRGa7kJBrUX8xBsw@mail.gmail.com>

About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf
which has

Emoji_Presentation (EP)
? Non_Emoji (NE)
? Default_Text (DT)
? Default_Emoji (DE)
? NA


Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for
all code points that are not mentioned in the data?

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151103/b23a273d/attachment.html>

From mark at macchiato.com  Tue Nov  3 16:34:34 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 3 Nov 2015 14:34:34 -0800
Subject: Emoji data in UCD xml ?
In-Reply-To: <CAN49p6rjOu_=Wz-3s8OjwBuBr3SjGgocMrNRGa7kJBrUX8xBsw@mail.gmail.com>
References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch>
 <563245DA.2060207@att.net>
 <CAJ2xs_EyMyhAc-_Ce5QzGv9_5FTXi73vAYLMWRVkRk+dZ=JMRw@mail.gmail.com>
 <CAN49p6rjOu_=Wz-3s8OjwBuBr3SjGgocMrNRGa7kJBrUX8xBsw@mail.gmail.com>
Message-ID: <CAJ2xs_GFtizq3XJZV1Fy1isowOrJHbceJfv4Uhmd3z8opm84bg@mail.gmail.com>

We have revised this completely; see the R2 version.

Mark

On Tue, Nov 3, 2015 at 1:35 PM, Markus Scherer <markus.icu at gmail.com> wrote:

> About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf
> which has
>
> Emoji_Presentation (EP)
> ? Non_Emoji (NE)
> ? Default_Text (DT)
> ? Default_Emoji (DE)
> ? NA
>
>
> Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for
> all code points that are not mentioned in the data?
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151103/37d3a721/attachment.html>

From public at khwilliamson.com  Thu Nov  5 09:57:16 2015
From: public at khwilliamson.com (Karl Williamson)
Date: Thu, 5 Nov 2015 08:57:16 -0700
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <20150327180725.GA9968@math.berkeley.edu>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu>
Message-ID: <563B7C5C.4000209@khwilliamson.com>

Hi,

Several of us are wondering about the reason for reserving bits for the 
extended UTF-8 in perl5.  I'm asking you because you are the apparent 
author of the commits that did this.

To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the 
length of the sequence of bytes that comprise a single character to be 
13 bytes.  This allows code points up to 2**72 - 1 to be represented. 
If the length had been instead 12 bytes, code points up to 2**66 - 1 
could be represented, which is enough to represent any code point 
possible in a 64-bit word.

The comments indicate that these extra bits are "reserved".  So we're 
wondering what potential use you had thought of for these bits.

Thanks

Karl Williamson

From unicode at mxmerz.de  Thu Nov  5 11:10:45 2015
From: unicode at mxmerz.de (Maximilian Merz)
Date: Thu, 5 Nov 2015 18:10:45 +0100
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <C5977D24-A508-4C5D-80ED-D49880D2F145@mxmerz.de>
References: <C5977D24-A508-4C5D-80ED-D49880D2F145@mxmerz.de>
Message-ID: <AB7B4415-AA15-42B0-9B05-47953119EDE8@mxmerz.de>

Hello,

I did not receive any feedback on my last email, but chose to finalize my proposal anyway ? you can download the PDF (673 KB) here: [1].

I would appreciate feedback of any kind.

Best regards,

Max Merz

PS: Is a ?computerized font (True Type or PostScript)? also necessary for emoji characters or do SVG/PDF/PNG images suffice here?

[1]: http://mxmerz.de/unicode/Face_with_One_Eyebrow_Raised.pdf

> On 27.10.2015, at 22:03, Max Merz <unicode at mxmerz.de> wrote:
> 
> Hello,
> 
> I would like to submit a proposal to encode an emoji depicting a ?face with one eyebrow raised?, as to indicate scepticism, surprise, concern, disagreement.
> 
> The ?Submitting Character Proposals? page on unicode.org recommends to discuss preliminary proposals on this mailing list ? I am currently working on my proposal, but I would appreciate general feedback about whether this idea is doomed from the start, has already been discussed, comes at a bad time, etc.?
> 
> Best regards,
> 
> Max Merz


From verdy_p at wanadoo.fr  Thu Nov  5 11:25:05 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 5 Nov 2015 18:25:05 +0100
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <563B7C5C.4000209@khwilliamson.com>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com>
Message-ID: <CAGa7JC380F2A-ZhSRP-QQ9FsPX_oG_RMZs0wR25U64LQQLYWQg@mail.gmail.com>

It won't represent any valid Unicode codepoint (no standard scalar value
defined), so if you use those leading bytes, don't pretend it is for
"UTF-8" (not even "modified UTF-8" which is the variant created in Java for
its internal serialization of unrestricted 16-bit strings, including for
lone surrogates, and modified also in its representation of U+0000 as
<0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create
your own charset identifier (e.g. "perl5-UTF-8-extended" or some name
derived from your Perl5 library) and say it is not fot use for interchange
of standard text.

The extra code points you'll get are then necessarily for private use (but
still not part of the standard PUA set), and have absolutely no defined
properties from the standard. They should not be used to represent any
Unicode character or character sequence. In any API taking some text input,
those code points will never be decoded and will behave on input like
encoding errors.

But these extra code points could be used to represent someting else such
as unique object identifier for internal use in your application, or
virtual object pointers, or or shared memory block handles,
file/pipe/stream I/O handles, service/API handles, user ids, security
tokens, 64-bit content hashes plus some binary flags,
placeholders/references for members in an external unencoded collection or
for URIs, or internal glyph ids when converting text for rendering with one
or more fonts, or some internal serialization of geometric
shapes/colors/styles/visual effects...)

In the standard UTF-8 those extra byte values are not "reserved" but
permanently assigned to be "invalid", and there are no valid encoded
sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC
version of UTF-8 when it allowed code points up to 31 bits, but even this
RFC is obsolete and should no longer be used and it has never been approved
by Unicode).


2015-11-05 16:57 GMT+01:00 Karl Williamson <public at khwilliamson.com>:

> Hi,
>
> Several of us are wondering about the reason for reserving bits for the
> extended UTF-8 in perl5.  I'm asking you because you are the apparent
> author of the commits that did this.
>
> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the
> length of the sequence of bytes that comprise a single character to be 13
> bytes.  This allows code points up to 2**72 - 1 to be represented. If the
> length had been instead 12 bytes, code points up to 2**66 - 1 could be
> represented, which is enough to represent any code point possible in a
> 64-bit word.
>
> The comments indicate that these extra bits are "reserved".  So we're
> wondering what potential use you had thought of for these bits.
>
> Thanks
>
> Karl Williamson
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151105/ca3afc18/attachment.html>

From markus.icu at gmail.com  Thu Nov  5 12:15:28 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 5 Nov 2015 10:15:28 -0800
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <CAGa7JC380F2A-ZhSRP-QQ9FsPX_oG_RMZs0wR25U64LQQLYWQg@mail.gmail.com>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu>
 <563B7C5C.4000209@khwilliamson.com>
 <CAGa7JC380F2A-ZhSRP-QQ9FsPX_oG_RMZs0wR25U64LQQLYWQg@mail.gmail.com>
Message-ID: <CAN49p6rrWmNUPAWp+vRaRTxCq3OurPWPQsGk=y58xZEB=FLWNA@mail.gmail.com>

On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> (0xFF was reserved only in the old RFC version of UTF-8 when it allowed
> code points up to 31 bits, but even this RFC is obsolete and should no
> longer be used and it has never been approved by Unicode).
>

No, even in the original UTF-8 definition, "The octet values FE and FF
never appear." https://tools.ietf.org/html/rfc2279
The highest lead byte was 0xFD.

(For the "really original" version see
http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf)

In the current definition, "The octet values C0, C1, F5 to FF never
appear." https://tools.ietf.org/html/rfc3629 =
https://tools.ietf.org/html/std63

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151105/74425d35/attachment.html>

From richard.wordingham at ntlworld.com  Thu Nov  5 13:19:10 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 5 Nov 2015 19:19:10 +0000
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <CAGa7JC380F2A-ZhSRP-QQ9FsPX_oG_RMZs0wR25U64LQQLYWQg@mail.gmail.com>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu>
 <563B7C5C.4000209@khwilliamson.com>
 <CAGa7JC380F2A-ZhSRP-QQ9FsPX_oG_RMZs0wR25U64LQQLYWQg@mail.gmail.com>
Message-ID: <20151105191910.317175e6@JRWUBU2>

On Thu, 5 Nov 2015 18:25:05 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> But these extra code points could be used to represent someting else
> such as unique object identifier for internal use in your
> application, or virtual object pointers, or or shared memory block
> handles, file/pipe/stream I/O handles, service/API handles, user ids,
> security tokens, 64-bit content hashes plus some binary flags,
> placeholders/references for members in an external unencoded
> collection or for URIs, or internal glyph ids when converting text
> for rendering with one or more fonts, or some internal serialization
> of geometric shapes/colors/styles/visual effects...)

No-one's claiming it is for a Unicode Transformation Format (UTF).  A
possibly relevant example of a something else is a non-precomposed
grapheme cluster, as in Perl6's NFG.  (This isn't a PUA encoding, as
the precomposed characters are created on the fly.)

Richard.

From mark at macchiato.com  Thu Nov  5 14:25:46 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 5 Nov 2015 12:25:46 -0800
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <AB7B4415-AA15-42B0-9B05-47953119EDE8@mxmerz.de>
References: <C5977D24-A508-4C5D-80ED-D49880D2F145@mxmerz.de>
 <AB7B4415-AA15-42B0-9B05-47953119EDE8@mxmerz.de>
Message-ID: <CAJ2xs_ENaim5WTCv5FoP_9=+dLa9ZLFgq_u9wsf-AX-FK1FtPA@mail.gmail.com>

The unicode at unicode.org mailing list isn't the right place for submitting
proposals; see the top of
http://www.unicode.org/emoji/selection.html#submission under "submit
as per Document
Submission Details <http://www.unicode.org/pending/docsubmit.html>."

As for the images, that's also discussed there; they should be PNGs of the
specified format.

(And by the way, a very nicely documented proposal!)

Mark

On Thu, Nov 5, 2015 at 9:10 AM, Maximilian Merz <unicode at mxmerz.de> wrote:

> Hello,
>
> I did not receive any feedback on my last email, but chose to finalize my
> proposal anyway ? you can download the PDF (673 KB) here: [1].
>
> I would appreciate feedback of any kind.
>
> Best regards,
>
> Max Merz
>
> PS: Is a ?computerized font (True Type or PostScript)? also necessary for
> emoji characters or do SVG/PDF/PNG images suffice here?
>
> [1]: http://mxmerz.de/unicode/Face_with_One_Eyebrow_Raised.pdf
>
> > On 27.10.2015, at 22:03, Max Merz <unicode at mxmerz.de> wrote:
> >
> > Hello,
> >
> > I would like to submit a proposal to encode an emoji depicting a ?face
> with one eyebrow raised?, as to indicate scepticism, surprise, concern,
> disagreement.
> >
> > The ?Submitting Character Proposals? page on unicode.org recommends to
> discuss preliminary proposals on this mailing list ? I am currently working
> on my proposal, but I would appreciate general feedback about whether this
> idea is doomed from the start, has already been discussed, comes at a bad
> time, etc.?
> >
> > Best regards,
> >
> > Max Merz
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151105/c6555e76/attachment.html>

From doug at ewellic.org  Thu Nov  5 14:41:42 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 05 Nov 2015 13:41:42 -0700
Subject: Question about Perl5 extended UTF-8 design
Message-ID: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net>

Richard Wordingham wrote:

> No-one's claiming it is for a Unicode Transformation Format (UTF).

Then they ought not to call it "UTF-8" or "extended" or "modified"
UTF-8, or anything of the sort, even if the bit-shifting algorithm is
based on UTF-8.

"UTF-8 encoding form" is defined as a mapping of Unicode scalar values
-- not arbitrary integers -- onto byte sequences. [D92]

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Thu Nov  5 14:47:12 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 05 Nov 2015 13:47:12 -0700
Subject: Emoji Proposal: Face With One Eyebrow Raised
Message-ID: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>

Mark Davis wrote:

> The unicode_at_unicode.org mailing list isn't the right place for
> submitting proposals; see the top of
> http://www.unicode.org/emoji/selection.html#submission under "submit
> as per Document Submission Details
> <http://www.unicode.org/pending/docsubmit.html>."

To be fair, Max did cite his reason for doing so:

> The ?Submitting Character Proposals? page on unicode.org recommends
> to discuss preliminary proposals on this mailing list

That page says:

"Experience has shown that it is often helpful to discuss preliminary
proposals before submitting a detailed proposal. One open forum for such
discussion is the Unicode mail list. (See Public Email Distribution
Lists for subscription instructions.)  Sponsors are urged to send a
message of inquiry or a preliminary proposal there before formal
submission. Many problems and questions can be dealt with there,
minimizing the severity of later revisions."

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From steve at swales.us  Thu Nov  5 14:54:22 2015
From: steve at swales.us (Steve Swales)
Date: Thu, 5 Nov 2015 12:54:22 -0800
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
Message-ID: <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us>

Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector as well.

-steve

> On Nov 5, 2015, at 12:47 PM, Doug Ewell <doug at ewellic.org> wrote:
> 
> Mark Davis wrote:
> 
>> The unicode_at_unicode.org mailing list isn't the right place for
>> submitting proposals; see the top of
>> http://www.unicode.org/emoji/selection.html#submission under "submit
>> as per Document Submission Details
>> <http://www.unicode.org/pending/docsubmit.html>."
> 
> To be fair, Max did cite his reason for doing so:
> 
>> The ?Submitting Character Proposals? page on unicode.org recommends
>> to discuss preliminary proposals on this mailing list
> 
> That page says:
> 
> "Experience has shown that it is often helpful to discuss preliminary
> proposals before submitting a detailed proposal. One open forum for such
> discussion is the Unicode mail list. (See Public Email Distribution
> Lists for subscription instructions.)  Sponsors are urged to send a
> message of inquiry or a preliminary proposal there before formal
> submission. Many problems and questions can be dealt with there,
> minimizing the severity of later revisions."
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
> 
> 


From steve at swales.us  Thu Nov  5 15:05:07 2015
From: steve at swales.us (Steve Swales)
Date: Thu, 5 Nov 2015 13:05:07 -0800
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us>
References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
 <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us>
Message-ID: <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us>

Or perhaps a slightly greenish skin-tone.  This would be useful for depicting dark-net hackers and such as well.

-steve

> On Nov 5, 2015, at 12:54 PM, Steve Swales <steve at swales.us> wrote:
> 
> Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector as well.
> 
> -steve
> 
>> On Nov 5, 2015, at 12:47 PM, Doug Ewell <doug at ewellic.org> wrote:
>> 
>> Mark Davis wrote:
>> 
>>> The unicode_at_unicode.org mailing list isn't the right place for
>>> submitting proposals; see the top of
>>> http://www.unicode.org/emoji/selection.html#submission under "submit
>>> as per Document Submission Details
>>> <http://www.unicode.org/pending/docsubmit.html>."
>> 
>> To be fair, Max did cite his reason for doing so:
>> 
>>> The ?Submitting Character Proposals? page on unicode.org recommends
>>> to discuss preliminary proposals on this mailing list
>> 
>> That page says:
>> 
>> "Experience has shown that it is often helpful to discuss preliminary
>> proposals before submitting a detailed proposal. One open forum for such
>> discussion is the Unicode mail list. (See Public Email Distribution
>> Lists for subscription instructions.)  Sponsors are urged to send a
>> message of inquiry or a preliminary proposal there before formal
>> submission. Many problems and questions can be dealt with there,
>> minimizing the severity of later revisions."
>> 
>> --
>> Doug Ewell | http://ewellic.org | Thornton, CO ????
>> 
>> 
> 
> 


From verdy_p at wanadoo.fr  Thu Nov  5 15:55:03 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 5 Nov 2015 22:55:03 +0100
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us>
References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
 <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us>
 <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us>
Message-ID: <CAGa7JC2nH_R8NjnUK5h7y9LbnVBpAZCsOZf1BnkC_jorKGqzGQ@mail.gmail.com>

And blue ? For Martians or Schtroumpfs (original French name of Peyo's
comics characters, their name vary across languages: los Pitufos,
Smurflars, die Schl?mpfe, el Barrufets, the Smurfs)... However there are
also black ans green Schtroumpfs.

2015-11-05 22:05 GMT+01:00 Steve Swales <steve at swales.us>:

> Or perhaps a slightly greenish skin-tone.  This would be useful for
> depicting dark-net hackers and such as well.
>
> -steve
>
> > On Nov 5, 2015, at 12:54 PM, Steve Swales <steve at swales.us> wrote:
> >
> > Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector
> as well.
> >
> > -steve
> >
> >> On Nov 5, 2015, at 12:47 PM, Doug Ewell <doug at ewellic.org> wrote:
> >>
> >> Mark Davis wrote:
> >>
> >>> The unicode_at_unicode.org mailing list isn't the right place for
> >>> submitting proposals; see the top of
> >>> http://www.unicode.org/emoji/selection.html#submission under "submit
> >>> as per Document Submission Details
> >>> <http://www.unicode.org/pending/docsubmit.html>."
> >>
> >> To be fair, Max did cite his reason for doing so:
> >>
> >>> The ?Submitting Character Proposals? page on unicode.org recommends
> >>> to discuss preliminary proposals on this mailing list
> >>
> >> That page says:
> >>
> >> "Experience has shown that it is often helpful to discuss preliminary
> >> proposals before submitting a detailed proposal. One open forum for such
> >> discussion is the Unicode mail list. (See Public Email Distribution
> >> Lists for subscription instructions.)  Sponsors are urged to send a
> >> message of inquiry or a preliminary proposal there before formal
> >> submission. Many problems and questions can be dealt with there,
> >> minimizing the severity of later revisions."
> >>
> >> --
> >> Doug Ewell | http://ewellic.org | Thornton, CO ????
> >>
> >>
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151105/e2f618c8/attachment.html>

From mark at macchiato.com  Thu Nov  5 16:11:16 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 5 Nov 2015 14:11:16 -0800
Subject: Emoji Proposal: Face With One Eyebrow Raised
In-Reply-To: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net>
Message-ID: <CAJ2xs_EsSncw0aXUAyHwwBuTTay6uo03PXVsBE68TDVSFvb61Q@mail.gmail.com>

While it is always good to get feedback, I think the advice on that page is
outdated. In practice, most proposals to the UTC are not floated on the
public discussion list.

One certainly can float it, but it shouldn't be "urged". We also should
make clear that the discussions on this list are purely personal opinions,
and predominantly from people who are not actually involved in the encoding
process.

Mark

On Thu, Nov 5, 2015 at 12:47 PM, Doug Ewell <doug at ewellic.org> wrote:

> Mark Davis wrote:
>
> > The unicode_at_unicode.org mailing list isn't the right place for
> > submitting proposals; see the top of
> > http://www.unicode.org/emoji/selection.html#submission under "submit
> > as per Document Submission Details
> > <http://www.unicode.org/pending/docsubmit.html>."
>
> To be fair, Max did cite his reason for doing so:
>
> > The ?Submitting Character Proposals? page on unicode.org recommends
> > to discuss preliminary proposals on this mailing list
>
> That page says:
>
> "Experience has shown that it is often helpful to discuss preliminary
> proposals before submitting a detailed proposal. One open forum for such
> discussion is the Unicode mail list. (See Public Email Distribution
> Lists for subscription instructions.)  Sponsors are urged to send a
> message of inquiry or a preliminary proposal there before formal
> submission. Many problems and questions can be dealt with there,
> minimizing the severity of later revisions."
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151105/c90ebf3b/attachment.html>

From nospam-abuse at ilyaz.org  Thu Nov  5 16:11:37 2015
From: nospam-abuse at ilyaz.org (Ilya Zakharevich)
Date: Thu, 5 Nov 2015 14:11:37 -0800
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <563B7C5C.4000209@khwilliamson.com>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com>
Message-ID: <20151105221137.GA5796@math.berkeley.edu>

On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote:
> Several of us are wondering about the reason for reserving bits for
> the extended UTF-8 in perl5.  I'm asking you because you are the
> apparent author of the commits that did this.

To start, the INTERNAL REPRESENTATION of Perl?s strings is the ?utf8?
format (not ?UTF-8?, ?extended? or not).  [I see that this misprint
caused a lot of stir here!]

However, outside of a few contexts, this internal representation
should not be visible.  (However, some of these contexts are close to
the default, like read/write in Unicode mode, with -C switch.)

Perl?s string is just a sequence of Perl?s unsigned integers.
[Depending on the build, this may be, currently, 32-bit or 64-bit.]
By convention, the ?meaning? of small integers coincides with what
Unicode says.

> To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes
> the length of the sequence of bytes that comprise a single character
> to be 13 bytes.  This allows code points up to 2**72 - 1 to be
> represented. If the length had been instead 12 bytes, code points up
> to 2**66 - 1 could be represented, which is enough to represent any
> code point possible in a 64-bit word.
> 
> The comments indicate that these extra bits are "reserved".  So
> we're wondering what potential use you had thought of for these
> bits.

First of all, ?reserved? means that they have no meaning.  Right?

Second, there are 2 ways in which one may need this INTERNAL format to
be extended:
  ? 128-bit architectures may be at hand (sooner or later).
  ? One may need to allow ?objects? to be embedded into Perl strings.

With embedded objects, one must know how to kill them when the string
(or its part) is removed.  So, while a pointer can fit into a Perl
integer, one needs to specify what to do: call DESTROY, or free(), or
a user-defined function.

This gives 5 possibilities (3 extra bits) which may be needed with
?slots? in Perl strings.
  ? Integer (?64 bits)
  ? Integer (?65 bits) 
  ? Pointer to a Perl object
  ? Pointer to a malloc()ed memory
  ? Pointer to a struct which knows how to destroy itself.
      struct self_destroy { void *content; void destroy(struct self_destroy*); }

Why one may need objects embedded into strings?  I explained it in
   http://ilyaz.org/interview
(look for ?Emacs? near the middle).

Hope this helps,
Ilya

From verdy_p at wanadoo.fr  Thu Nov  5 19:00:54 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 6 Nov 2015 02:00:54 +0100
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <20151105221137.GA5796@math.berkeley.edu>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu>
 <563B7C5C.4000209@khwilliamson.com>
 <20151105221137.GA5796@math.berkeley.edu>
Message-ID: <CAGa7JC2gmwAqjQfRq6hSLd30U=iB5kCjheJ1M_znxy-UxftH8Q@mail.gmail.com>

2015-11-05 23:11 GMT+01:00 Ilya Zakharevich <nospam-abuse at ilyaz.org> wrote
>
>   ? 128-bit architectures may be at hand (sooner or later).

This is specialation for something that is still not envisioned: a global
worldwide working space where users and applications would interoperate
transparently in a giant virtualized environment. However, this virtualized
environment will be supported by 64-bit OSes that will never need native
support of more the 64-bit pointers. Those 128-bit entities needed for
adressing will not be used to work on units of data but to address some
small selection of remote entities.

Softwares that would requiring parsing coompletely chunks of memory data
larger than 64-bit would be extremely inefficient, instead this data will
be internally structured/paged, and only virtually mapped to some 128 bit
global reference (such as GUID/UUIDs) only to select smaller chunks within
the structure (and in most cases those chunks will remain in a 32-bit space
(even in today's 64-bit OSes, the largest pages are 20-bit wide, but
typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and
I/O page sizes, networking MTUs), or about 16-bit wide (such as
transmission window for TCP). This will not eveolve significantly before a
major evolution in the worldwide Internet backbones requiring more than
about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only
in massive computing grids, still built with a complex mesh of much slower
data links).

With 64-bit we already reach the physical limits of networking links, and
higher speeds using large buses are only for extremely local links whose
lengths are largely below a few millimters within chips themselves.

128 bit however is possible not for the working spaces (or document sizes)
it will be very unlikely that ANSI C/C++ "size_t" type will be more than
64-bit (ecept for a few experimentations which will fail to be more
efficient).

What is more realist is that internal buses and caches will be 128 bits or
even larger (this is already true for GPU memory), only to support more
parallelism or massive parallelism (and typically by using vectored
instructions working on sets of smaller values).

And some data need 128-bit values for their numerical ranges (ALUs in
CPU/GPU/APU are already 128-bit, as well as common floating point types)
where extra precision is necessary.

I doubt we'll ever see any true native 128-bit architecture in any time of
our remaining life. We are still very far from the limit of the 64-bit
architecture and it won't happend before the next century (if the current
sequential binary model for computing is still used at that time, may be
computing will use predictive technologies returning only heuristic results
with a very high probability of giving a good solution to the problems
we'll need to solve extremely rapidly, and those solutions will then be
validated using today's binary logic with 64-bit computing).

Even in the case where a global 128-bit networking space would appear,
users will never be exposed to all that, msot of this content will be
unacessible to them (restricted by secuiry concerns or privacy) and simply
unmanageable by them : no one on earth is able to have any idea of what
2^64 bits of global data represents, no one will ever need it in their
whole life. That amount of data will only be partly implemented by large
organisations trying to build a giant cloud and whiching to interoperate by
coordinating their addressing spaces (for that we have now IPv6).

So your "sooner or later" is very optimistic.

IMHO we'll stay with 64-bit architectures for very long, up to the time
where our seuqnetial computing model will be deprecated and the concept of
native integer sizes will be obsoleted and replaced by other kinds of
computing "units" (notably parallel vectors, distributed computing, and
heuristic computing, or may be optical computing based on Fourier
transforms on analog signals or quantum computing, where our simple notion
of "integers" or even "bits" will not even be placeable into individual
physically placed units; their persistence will not even be localized, and
there will be redundant/fault-tolerant placements).

In fact our computing limits wil no longer be in terms of storage space,
but in terms of access time, distance and predictability of results.

The next technologies for faster computing will be certainly
predictive/probabilistic rather than affirmative (with today's Turing/Von
Neumann machines). "Algorithms" for working with it will be completely
different. Fuzzy logic will be everywhere and we'll even need less the
binary logic except for small problems. We'll have to live with the
possibility of errors but anyway we already have to live with them evne
with our binary logic (due to human bugs, haardware faults, accidents, and
so on...) In most problems we don't even need to have 100% proven solutions

(e.g. viewing a high-quality video, we already accept the possibility of
some "quirks" occuring, and we already accept some minor alterationj of the
exact pixel colors in which we can't even note any visible difference from
the original ; another example is in what we call a "scientific proof"
which is in fact only a solution with the highest probability of being
correct in almost all known contexts, because we can never reproduce
exactly the same exprimental environment: even a basic binary NAND gate
cannot be warrantied at 100% of always returning a "0" state after a
defined delay when its inputs are all "1"). We can certainly produce
results with the same (or better) probability of giving the expected result
using fuzzy logic (or quantum logic) rather then existing binary logic, and
certainly with smaller computing delays (and better throughputs and better
fault torlerance, incliuding after hardware faults or damages, and even
with better security).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151106/9ff706f5/attachment.html>

From otto.stolz at uni-konstanz.de  Fri Nov  6 05:48:10 2015
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Fri, 6 Nov 2015 12:48:10 +0100
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <20151105221137.GA5796@math.berkeley.edu>
References: <55136F5F.6080808@kli.org>
 <CAGJ7U-Urc7m4PafJNg_QDqNjZfvHu7f+G4io9=3WBzkOMFPwWg@mail.gmail.com>
 <551373BB.70200@kli.org>
 <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost>
 <CAFmvRse43C1nUJ=MHN7xLA36mGDtHBKtCAotSV9h9CfN2uB2CQ@mail.gmail.com>
 <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost>
 <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com>
 <20151105221137.GA5796@math.berkeley.edu>
Message-ID: <563C937A.6020806@uni-konstanz.de>

Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich:
> First of all, ?reserved? means that they have no meaning.  Right?

Almost.

?Reserved? means that they have currently no meaning
but may be assigned a meaning, later; hence you ought
not use them lest your programs, or data, be invalidated
by later amendmends of the pertinent specification.

In contrast, ?invalid?, or ?ill-formed? (Unicode term),
means that the particular bit pattern may never be used
in a sequence that purports to represent Unicode characters.
In practice, that means that no programm is allowed to
send those ill-formed patterns in Unicode-based data exchange,
and every program should refuse to accept those ill-formed
patterns, in Unicode-based data exchange.

What a program does internally is at the discretion (or should
I say: ?whim??) of its author, of course ? as long as the
overall effect of the program complies with the standard.

Best wishes,
   Otto Stolz


From richard.wordingham at ntlworld.com  Fri Nov  6 14:32:20 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 6 Nov 2015 20:32:20 +0000
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net>
References: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net>
Message-ID: <20151106203220.2b2fd15c@JRWUBU2>

On Thu, 05 Nov 2015 13:41:42 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham wrote:
> 
> > No-one's claiming it is for a Unicode Transformation Format (UTF).
> 
> Then they ought not to call it "UTF-8" or "extended" or "modified"
> UTF-8, or anything of the sort, even if the bit-shifting algorithm is
> based on UTF-8.

> "UTF-8 encoding form" is defined as a mapping of Unicode scalar values
> -- not arbitrary integers -- onto byte sequences. [D92]

If it extends the mapping of Unicode scalar values *into* byte
sequences, then it's an extension.  A non-trivial extension of a
mapping of scalar values has to have a larger domain.

I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.

Richard.

From public at khwilliamson.com  Fri Nov  6 22:50:04 2015
From: public at khwilliamson.com (Karl Williamson)
Date: Fri, 6 Nov 2015 21:50:04 -0700
Subject: Question about Perl5 extended UTF-8 design
In-Reply-To: <20151106203220.2b2fd15c@JRWUBU2>
References: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net>
 <20151106203220.2b2fd15c@JRWUBU2>
Message-ID: <563D82FC.5060509@khwilliamson.com>

On 11/06/2015 01:32 PM, Richard Wordingham wrote:
> On Thu, 05 Nov 2015 13:41:42 -0700
> "Doug Ewell" <doug at ewellic.org> wrote:
>
>> Richard Wordingham wrote:
>>
>>> No-one's claiming it is for a Unicode Transformation Format (UTF).
>>
>> Then they ought not to call it "UTF-8" or "extended" or "modified"
>> UTF-8, or anything of the sort, even if the bit-shifting algorithm is
>> based on UTF-8.
>
>> "UTF-8 encoding form" is defined as a mapping of Unicode scalar values
>> -- not arbitrary integers -- onto byte sequences. [D92]
>
> If it extends the mapping of Unicode scalar values *into* byte
> sequences, then it's an extension.  A non-trivial extension of a
> mapping of scalar values has to have a larger domain.
>
> I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks.
>
> Richard.
>

I have no idea how my original message ended up being marked to send to 
this list.  I'm sorry.  It was meant to be a personal message for 
someone who I believe was involved in the original design.

From karl-pentzlin at acssoft.de  Sat Nov  7 04:38:39 2015
From: karl-pentzlin at acssoft.de (Karl Pentzlin)
Date: Sat, 7 Nov 2015 11:38:39 +0100
Subject: Finnish emoji
Message-ID: <1802721603.20151107113839@acssoft.de>

Just FYI (without any claim of relevance by myself),
this site "produced by the [Finnish] Ministry for Foreign Affairs,
Department for Communication" about an "own set of country themed emoji":
http://finland.fi/life-society/the-headbanger-throws-his-phone-away-and-goes-to-sauna/
- Karl Pentzlin


From wjgo_10009 at btinternet.com  Sat Nov  7 09:00:41 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sat, 7 Nov 2015 15:00:41 +0000 (GMT)
Subject: Finnish emoji (offlist)
In-Reply-To: <1802721603.20151107113839@acssoft.de>
References: <1802721603.20151107113839@acssoft.de>
Message-ID: <20180342.28942.1446908441289.JavaMail.defaultUser@defaultHost>

Hi

Thank you for sharing the link.

This is an interesting development.

Best regards,

William Overington

7 November 2015


----Original message----
>From : karl-pentzlin at acssoft.de
Date : 07/11/2015 - 10:38 (GMTST)
To : unicode at unicode.org
Subject : Finnish emoji

Just FYI (without any claim of relevance by myself),
this site "produced by the [Finnish] Ministry for Foreign Affairs,
Department for Communication" about an "own set of country themed emoji":
http://finland.fi/life-society/the-headbanger-throws-his-phone-away-and-goes-to-sauna/
- Karl Pentzlin


From peroyomaslists at gmail.com  Mon Nov  9 13:32:15 2015
From: peroyomaslists at gmail.com (=?UTF-8?Q?Andr=C3=A9s_Sanhueza?=)
Date: Mon, 9 Nov 2015 16:32:15 -0300
Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books
Message-ID: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>

Hello. I was looking for info in Spanish about some rare punctuation
symbols and found one in some Spaniard XIX century books (v?a Google books)
I haven't seen referenced anywhere. It was called "millar", which
translates somewhat like "thousand". It seems that it had at least four
glyph variants, yet the quality of the scans make it a bit difficult to
reproduce exactly.

[image: Im?genes integradas 1]

A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845). It says
(poorly translated):

The millar ([symbol]) o millaron as it is commonly called) is the
> abbreviation for the zeros, when one types amount of a thousand: so, with
> a single numeral and a sign of these it can be read thousands.


The description is not very clear, but I understand that the sign is an
abbreviation of the three zeros that comes in one thousand. so, instead of
writing 40.000, one can write 40[symbol].

In the text the sing is given the look of a turned C with a lighting bold
in it, but I can be wrong.

[image: Im?genes integradas 2]

Another sample from "Gram?tica castellana fundada sobre principios
filos?ficos" by Francesc Pons i Argent? (1850), with a more
straight-forward description.

Among counters the same name is given to each of these signs [symbol1],
> [symbol2], [symbol3] to denote thousand. So 20[symbol1] is read twenty
> thousand, 30[symbol2], thirty thousand, 40[symbol3], forty thousand.


Now there's three glyphs variants. One is an stand-alone turned C. Other is
a turned C with two bars as an overlay. The other looks like two f's turned
180?, or two j's with an small bar.

Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de la
imprenta" by Antoni Serra i Oliveres (1852).

[image: Im?genes integradas 4]
In this one, the millar looks like an straight C with two overlay bars. The
other symbols mentioned look like familiar ones, (the "sueldos" (salaries)
one looks like an small s in superscript. I guess is just an abbreviation.
I'm a bit confused with the letters with diacritics, but don't seems
anything unknown).

Anyone has more insight about this?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/abb1409a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 13411 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/abb1409a/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 35960 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/abb1409a/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 19520 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/abb1409a/attachment-0002.png>

From ken.shirriff at gmail.com  Mon Nov  9 15:32:17 2015
From: ken.shirriff at gmail.com (Ken Shirriff)
Date: Mon, 9 Nov 2015 13:32:17 -0800
Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books
In-Reply-To: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>
References: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>
Message-ID: <CALBHtZz5SiFKbgU4bVDQzCS-yPc-7LUDUj6HnJD3j-0H-2L2aQ@mail.gmail.com>

I took a quick look to see if I could find any other examples, which
probably confuses things more.

Take a look at this book, which describes millar symbols: 20? and 40JJ
(approximately) for 20 thousand and 40 thousand.

https://books.google.com/books?id=FBEMAQAAIAAJ&dq=%22denotar%20el%20millar%22&pg=PA161#v=onepage&q=%22denotar%20el%20millar%22&f=false

Another document says the calder?n (i.e. pilcrow ?) can be used for
thousands.
https://books.google.com/books?id=MtxGAAAAIAAJ&dq=%22denotar%20el%20millar%22&pg=PA214#v=onepage&q=%22denotar%20el%20millar%22&f=false

Ken

On Mon, Nov 9, 2015 at 11:32 AM, Andr?s Sanhueza <peroyomaslists at gmail.com>
wrote:

> Hello. I was looking for info in Spanish about some rare punctuation
> symbols and found one in some Spaniard XIX century books (v?a Google books)
> I haven't seen referenced anywhere. It was called "millar", which
> translates somewhat like "thousand". It seems that it had at least four
> glyph variants, yet the quality of the scans make it a bit difficult to
> reproduce exactly.
>
> [image: Im?genes integradas 1]
>
> A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845). It says
> (poorly translated):
>
> The millar ([symbol]) o millaron as it is commonly called) is the
>> abbreviation for the zeros, when one types amount of a thousand: so,
>> with a single numeral and a sign of these it can be read thousands.
>
>
> The description is not very clear, but I understand that the sign is an
> abbreviation of the three zeros that comes in one thousand. so, instead of
> writing 40.000, one can write 40[symbol].
>
> In the text the sing is given the look of a turned C with a lighting bold
> in it, but I can be wrong.
>
> [image: Im?genes integradas 2]
>
> Another sample from "Gram?tica castellana fundada sobre principios
> filos?ficos" by Francesc Pons i Argent? (1850), with a more
> straight-forward description.
>
> Among counters the same name is given to each of these signs [symbol1],
>> [symbol2], [symbol3] to denote thousand. So 20[symbol1] is read twenty
>> thousand, 30[symbol2], thirty thousand, 40[symbol3], forty thousand.
>
>
> Now there's three glyphs variants. One is an stand-alone turned C. Other
> is a turned C with two bars as an overlay. The other looks like two f's
> turned 180?, or two j's with an small bar.
>
> Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de
> la imprenta" by Antoni Serra i Oliveres (1852).
>
> [image: Im?genes integradas 4]
> In this one, the millar looks like an straight C with two overlay bars.
> The other symbols mentioned look like familiar ones, (the "sueldos"
> (salaries) one looks like an small s in superscript. I guess is just an
> abbreviation. I'm a bit confused with the letters with diacritics, but
> don't seems anything unknown).
>
> Anyone has more insight about this?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/58d91c72/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 19520 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/58d91c72/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 35960 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/58d91c72/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 13411 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151109/58d91c72/attachment-0005.png>

From charupdate at orange.fr  Sun Nov 15 07:58:55 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 15 Nov 2015 14:58:55 +0100 (CET)
Subject: Latin glottal stop in ID in NWT, Canada
Message-ID: <1203727072.8694.1447595935864.JavaMail.www@wwinf1p10>

Dear Leo,

Thank you for your kind reply. I hope it will meet anticipated expectations. 

It?s however shocking that when traditional languages require supplemental means for a performative orthography, this is referred to as ?proliferating arbitrary characters defined as "latin letter" in Unicode?.

I?feel with your concern about people?s inertia and their attitude of being eager to cut short, root out, and throw away nature?s beauties. Applied to aboriginal languages and to proper names, this practice isn?t like pruning, it?s like spraying herbicide?and even worse: upon flowers.

One often forgets that ancient Romans themselves added ?new characters? in order to put themselves into a position to efficiently spell foreign names. We find these additional letters at the very end of the ?Roman alphabet??which turns out to be already a kind of ?extended Latin.? Today, Arabic?and?IPA obviously take over the role that Greek or Phoenician played by the time.

The idea that everything must be spelt in US-ASCII, or that everything must be written in Latin-1 or in CP-1252, or that everything must at least be encoded on one single byte, couldn?t arise before the computer age. Today, our mission if we accept, is to help Unicode to bring the invitation to make a smarter use of the worktool. 

IMHO, ?ease of data interchange? is meant to be ensured by using UTF-8. And even in plain ASCII, non-ASCII characters can be represented as HTML entities. The problem clearly is not interchange, it?s storage and local processing, thus an issue about software and related hardware. 

Here are the means to implement respectfulness towards *all* individuals who aim at respecting their language, their traditions, and the values of faithfulness, democracy, and humanity. To win the actual war, we best stop unsupporting our aboriginal and other official languages first. French too must stop to be threatened in Canada. Unity in diversity is part of our strength.

I believe that once this thread has been launched on the Unicode Mailing List, that is to be added. 

Today is likely to be the right time.

Best regards,

Marcel 

On Thu, 29 Oct 2015 10:20:35 -0700, Leo Broukhis  wrote:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0225.html

[Link provided instead of quotation in conformance to List policies.]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151115/cb6d4977/attachment.html>

From charupdate at orange.fr  Mon Nov 16 06:38:48 2015
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 16 Nov 2015 13:38:48 +0100 (CET)
Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books
In-Reply-To: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>
References: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>
Message-ID: <563810292.11167.1447677528891.JavaMail.www@wwinf1c25>

On Mon, 9 Nov 2015 16:32:15 -0300, Andr?s Sanhueza 
wrote:

> Hello. I was looking for info in Spanish about some rare punctuation symbols and found one in some Spaniard XIX century books (v?a Google books) I haven't seen referenced anywhere. It was called "millar", which translates somewhat like "thousand". It seems that it had at least four glyph variants, yet the quality of the scans make it a bit difficult to reproduce exactly.

> A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845).
> In the text the sing is given the look of a turned C with a lighting bold in it

Upscaled, it looks like a reversed C with two little overlaid solidi. I can't address the challenge to represent it in Unicode. As an approximation, one might suggest a turned C with (one) small solidus overlay: 

U+0186 LATIN CAPITAL LETTER OPEN O, U+0337 COMBINING SHORT SOLIDUS OVERLAY.

Reversed C is available in lowercase only (U+2184 LATIN SMALL LETTER REVERSED C).

?

> Another sample from "Gram?tica castellana fundada sobre principios filos?ficos"?by Francesc Pons i Argent? (1850)
> Now there's three glyphs variants. One is an stand-alone turned C. Other is a turned C with two bars as an overlay. The other looks like two f's turned 180?, or two j's with an small bar.

?

In digital typography, these turned characters could IMO be raised on the baseline like it is current in Unicode. The second is in fact a turned Colon sign. This can be represented fairly well (at the condition that overlay combining diacritics are properly implemented):

U+0186 LATIN CAPITAL LETTER OPEN O, U+20E6 COMBINING DOUBLE VERTICAL STROKE OVERLAY

The third looks like a turned small ligature ff. I see no other way than using two turned f's (eventually with reduced letter spacing):

U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE, U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE

?

> Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de la imprenta" by Antoni Serra i Oliveres (1852).
> In this one, the millar looks like an straight C with two overlay bars.

?

This being now U+20A1 COLON SIGN, use as thousand sign would be biased.

?

?

On Mon, 9 Nov 2015 13:32:17 -0800, Ken Shirriff  wrote:


?

> Take a look at this book, which describes millar symbols: 20? and 40JJ (approximately)

?

U+0254 LATIN SMALL LETTER OPEN O as a thousands sign is straightforward, especially with Elzevirian digits as quoted from

"Critica de lenguaje" by F?liz Ramos i Duarte (1896).

When for double J uppercase is preferred, I suppose that's to have it dotless. This is available in lowercase:

U+0237 LATIN SMALL LETTER DOTLESS J, U+0237 LATIN SMALL LETTER DOTLESS J.

?

I'm not sure whether I've replied what Andr?s really intended to learn by launching the thread. In any case I took it as a touchstone for Unicode completeness.

?

Best regards,

?

Marcel 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151116/35f03d66/attachment.html>

From verdy_p at wanadoo.fr  Mon Nov 16 10:51:11 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 16 Nov 2015 17:51:11 +0100
Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books
In-Reply-To: <563810292.11167.1447677528891.JavaMail.www@wwinf1c25>
References: <CAPnRZcSA-ncFBjXGy_Q7L0eHhqZ7ipdK3R3kyHSoA5-RvPjYjQ@mail.gmail.com>
 <563810292.11167.1447677528891.JavaMail.www@wwinf1c25>
Message-ID: <CAGa7JC3UNjrurbCdbzTF-DkFcjdXGFDsMQ6ceRhJ11YYw=sgwQ@mail.gmail.com>

Le 16 nov. 2015 13:56, "Marcel Schneider" <charupdate at orange.fr> a ?crit :
> The third looks like a turned small ligature ff. I see no other way than
using two turned f's (eventually with reduced letter spacing):
>
> U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE, U+025F LATIN SMALL
LETTER DOTLESS J WITH STROKE

If this is a ligature of two letters they really should be joined with
ZWJ... Otherwisecthectwo letters will not have any signifiance and won't be
associated with the millar, they'll juste read as strange two letters which
are not even an abbreviation of the millar word.
Such hint is needed in this case, semantically, even if the ligature will
not necessarily be rendered.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151116/38766b1c/attachment.html>

From jknappen at web.de  Thu Nov 26 02:10:36 2015
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Thu, 26 Nov 2015 09:10:36 +0100
Subject: Aw: New Character Property for Prepended Concatenation Marks
In-Reply-To: <56563758.1040906@unicode.org>
References: <56563758.1040906@unicode.org>
Message-ID: <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/dc0af7f7/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 1743 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/dc0af7f7/attachment.png>

From verdy_p at wanadoo.fr  Thu Nov 26 04:41:47 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 Nov 2015 11:41:47 +0100
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
Message-ID: <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>

The root sign is much more complex than just prepending specific sequences
of characters (in a limited set): when it embeds some "text", it can it it
recursively and unless you use additional parentheses for the linear
presentation, it highly depends on the 2D layout of its operand
(additionally it could be prefixed itself by a superscripted radix value).
Leave it alone: the 2D layout (even in the linear presentation using
parentheses where needed) will be mapped using an additional mathematical
presentaiton layer and notation.
For the basic plain-text, the root sign will just stay alone without using
any complex layout, and its operand will simply follow it (using
parentheses where needed) without specific rendering.

----

However the proposal for these prepended concatenation marks does not give
any hint about how to compute the extent of the following clusters
above/over/below/around which they will apply (do they extend over only
letters/digits, but not whitespaces or punctuation signs including
abbreviation marks?

For me this kind of visual interaction should be more explicitly delimited
using special marks (working like invisible parentheses) : the absence of
these special marks immediately after the prepended concatenation mark
should mean that they will not extend after the next (non-whitespace)
cluster.


So:

- <ARABIC NUMBER SIGN, SPACE, ARABIC DIGIT ONE> will display the isolated
number sign WITHOUT extending to the following space and digit

- <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO> will apply the
number sign ONLY to the first digit

- <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, ARABIC DIGIT
TWO, END OF SEQUENCE> will apply the number sign to the two digits

- <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, FULL STOP, ARABIC
DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
and the separating full stop

- <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, SPACE, ARABIC
DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
and the separating space

- <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, NEWLINE, ARABIC
DIGIT TWO, END OF SEQUENCE> will apply the number sign to the first digit
only before the newline control, the second digit will appear on the next
line outside the number sign complex cluster, the second control will be
ignored (or would display with a "visible control glyph".

Without the <START OF SEQUENCE> and <END OF SEQUENCE> special controls, it
will be necessary anyway to define specific enumerations of characters that
can be part of the sequence on which the prepended mark will apply.

Another complication: when such prepended sequences are recognized, there
are specific tunings to apply in line-breaking algorithms.

Word breaking algorithms may not need specific changes if the enumerations
of characters that can be part of the prepended sequence cannot contain any
word-breaking character. That's why I suggested that, by default, such
enumerations should include only letters and digits but not whitespace (and
probably not punctuation signs such as the dot), plus their additional
combining marks.

- For Arabic U+0600, U+0601 and U+0605 (TUS-9.2, page 374), the enumeration
is supposed to contain only Arabic-Indic or extended Arabic-Indic digits,
but I wonder if it should not include as well number separators, or even
Arabic-European digits.
- Same remark for the Kaithi number sign U+110BD.
- For Syriac U+070F (TUS-9.3, pages 390-391), the enumeration is not so
obvious (all Syriac "letter-numbers"?)

There are also similar characters in other scripts not listed: one example
with the Cyrillic hundred-thousands/millions marks U+0488..U+0489 which
enclose possibly more than one digits (currently encoded as combining marks
applicable to only one digit?); another with the Egyptian Hieroglyph
honorific "Cartouche" which encloses the name of a king; other examples
possible as well in other Asian scripts for honorific marks.

The system using explicitly delimited sequences would work as well with the
Latin script for some honorific "decorators" which are not just ligatures,
e.g. for the name of God or Jesus-Christ (which may also be themselves
abbreviated), including for Quranic transcriptions.

-- Philippe.


2015-11-26 9:10 GMT+01:00 "J?rg Knappen" <jknappen at web.de>:

> I wonder how this concept relates to mathematical notation, especially the
> root sign.
>
> --J?rg Knappen
>
> *Gesendet:* Mittwoch, 25. November 2015 um 23:34 Uhr
> *Von:* announcements at unicode.org
> *An:* announcements at unicode.org
> *Betreff:* New Character Property for Prepended Concatenation Marks
>
> The Unicode Technical Committee is seeking feedback on a proposal to
> define a new character property for the class of *prepended concatenation
> marks*, also referred to as *prefixed format control characters* or, more
> generically, as subtending marks. Characters in that class include U+0600
> ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH. The new property, named
> Prepended_Concatenation_Mark and targeted for Unicode 9.0, would provide a
> mechanism to handle subtending marks collectively via properties rather
> than by hardcoded enumeration. A detailed description of the issue and how
> to provide feedback are given in Public Review Issue #310
> <http://www.unicode.org/review/pri310/>.
>
> http://blog.unicode.org/2015/11/new-character-property-for-prepended.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/43951ddf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 1743 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/43951ddf/attachment.png>

From asmus-inc at ix.netcom.com  Thu Nov 26 04:50:51 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 26 Nov 2015 02:50:51 -0800
Subject: Aw: New Character Property for Prepended Concatenation Marks
In-Reply-To: <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
Message-ID: <5656E40B.2050905@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/0cc2180c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 1743 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/0cc2180c/attachment.png>

From asmus-inc at ix.netcom.com  Thu Nov 26 04:56:44 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 26 Nov 2015 02:56:44 -0800
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
Message-ID: <5656E56C.3090205@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/1a8e03cf/attachment.html>

From verdy_p at wanadoo.fr  Thu Nov 26 05:08:43 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 Nov 2015 12:08:43 +0100
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
Message-ID: <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>

The related definition for extended grapheme clusters says:

( CRLF
| *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
           ( Grapheme_Extend | *SpacingMark* )*
| . )

However I do not understand why it may include only one Hangul-Syllable
when applying prepended concatenation marks. And if the definition excludes
whitespaces, nothing prevents it to extend to arbitrary sequences of
letters/digits/symbols/punctuations (this could span very long sequences of
sinograms, or other letters from scripts that do not use whitespaces as
word separators. Even in the Latin script it would extend to the
punctuation signs that may follow any word, or to an entire mathematical
formula such as "1+2*3" but not "sin x"...


2015-11-26 11:41 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> The root sign is much more complex than just prepending specific sequences
> of characters (in a limited set): when it embeds some "text", it can it it
> recursively and unless you use additional parentheses for the linear
> presentation, it highly depends on the 2D layout of its operand
> (additionally it could be prefixed itself by a superscripted radix value).
> Leave it alone: the 2D layout (even in the linear presentation using
> parentheses where needed) will be mapped using an additional mathematical
> presentaiton layer and notation.
> For the basic plain-text, the root sign will just stay alone without using
> any complex layout, and its operand will simply follow it (using
> parentheses where needed) without specific rendering.
>
> ----
>
> However the proposal for these prepended concatenation marks does not give
> any hint about how to compute the extent of the following clusters
> above/over/below/around which they will apply (do they extend over only
> letters/digits, but not whitespaces or punctuation signs including
> abbreviation marks?
>
> For me this kind of visual interaction should be more explicitly delimited
> using special marks (working like invisible parentheses) : the absence of
> these special marks immediately after the prepended concatenation mark
> should mean that they will not extend after the next (non-whitespace)
> cluster.
>
>
> So:
>
> - <ARABIC NUMBER SIGN, SPACE, ARABIC DIGIT ONE> will display the isolated
> number sign WITHOUT extending to the following space and digit
>
> - <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO> will apply the
> number sign ONLY to the first digit
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, ARABIC DIGIT
> TWO, END OF SEQUENCE> will apply the number sign to the two digits
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, FULL STOP, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
> and the separating full stop
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, SPACE, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits
> and the separating space
>
> - <ARABIC NUMBER SIGN, START OF SEQUENCE, ARABIC DIGIT ONE, NEWLINE, ARABIC
> DIGIT TWO, END OF SEQUENCE> will apply the number sign to the first digit
> only before the newline control, the second digit will appear on the next
> line outside the number sign complex cluster, the second control will be
> ignored (or would display with a "visible control glyph".
>
> Without the <START OF SEQUENCE> and <END OF SEQUENCE> special controls,
> it will be necessary anyway to define specific enumerations of characters
> that can be part of the sequence on which the prepended mark will apply.
>
> Another complication: when such prepended sequences are recognized, there
> are specific tunings to apply in line-breaking algorithms.
>
> Word breaking algorithms may not need specific changes if the enumerations
> of characters that can be part of the prepended sequence cannot contain any
> word-breaking character. That's why I suggested that, by default, such
> enumerations should include only letters and digits but not whitespace (and
> probably not punctuation signs such as the dot), plus their additional
> combining marks.
>
> - For Arabic U+0600, U+0601 and U+0605 (TUS-9.2, page 374), the
> enumeration is supposed to contain only Arabic-Indic or extended Arabic
> -Indic digits, but I wonder if it should not include as well number
> separators, or even Arabic-European digits.
> - Same remark for the Kaithi number sign U+110BD.
> - For Syriac U+070F (TUS-9.3, pages 390-391), the enumeration is not so
> obvious (all Syriac "letter-numbers"?)
>
> There are also similar characters in other scripts not listed: one example
> with the Cyrillic hundred-thousands/millions marks U+0488..U+0489 which
> enclose possibly more than one digits (currently encoded as combining marks
> applicable to only one digit?); another with the Egyptian Hieroglyph
> honorific "Cartouche" which encloses the name of a king; other examples
> possible as well in other Asian scripts for honorific marks.
>
> The system using explicitly delimited sequences would work as well with
> the Latin script for some honorific "decorators" which are not just
> ligatures, e.g. for the name of God or Jesus-Christ (which may also be
> themselves abbreviated), including for Quranic transcriptions.
>
> -- Philippe.
>
>
> 2015-11-26 9:10 GMT+01:00 "J?rg Knappen" <jknappen at web.de>:
>
>> I wonder how this concept relates to mathematical notation, especially
>> the root sign.
>>
>> --J?rg Knappen
>>
>> *Gesendet:* Mittwoch, 25. November 2015 um 23:34 Uhr
>> *Von:* announcements at unicode.org
>> *An:* announcements at unicode.org
>> *Betreff:* New Character Property for Prepended Concatenation Marks
>>
>> The Unicode Technical Committee is seeking feedback on a proposal to
>> define a new character property for the class of *prepended
>> concatenation marks*, also referred to as *prefixed format control
>> characters* or, more generically, as subtending marks. Characters in
>> that class include U+0600 ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH.
>> The new property, named Prepended_Concatenation_Mark and targeted for
>> Unicode 9.0, would provide a mechanism to handle subtending marks
>> collectively via properties rather than by hardcoded enumeration. A
>> detailed description of the issue and how to provide feedback are given in Public
>> Review Issue #310 <http://www.unicode.org/review/pri310/>.
>>
>> http://blog.unicode.org/2015/11/new-character-property-for-prepended.html
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/3ff2b657/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 1743 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/3ff2b657/attachment.png>

From asmus-inc at ix.netcom.com  Thu Nov 26 05:38:13 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 26 Nov 2015 03:38:13 -0800
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
 <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>
Message-ID: <5656EF25.9080903@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/042cc02e/attachment.html>

From verdy_p at wanadoo.fr  Thu Nov 26 06:29:41 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 Nov 2015 13:29:41 +0100
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <5656EF25.9080903@ix.netcom.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
 <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>
 <5656EF25.9080903@ix.netcom.com>
Message-ID: <CAGa7JC0qadX_RON1RrZ5boTF4rKu0P0E4bUcFx3=n5keX-S5pQ@mail.gmail.com>

2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:

> On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>
> The related definition for extended grapheme clusters says:
>
> ( CRLF
> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>            ( Grapheme_Extend | *SpacingMark* )*
> | . )
>
> However I do not understand why it may include only one Hangul-Syllable
> when applying prepended concatenation marks. And if the definition excludes
> whitespaces, nothing prevents it to extend to arbitrary sequences of
> letters/digits/symbols/punctuations (this could span very long sequences of
> sinograms, or other letters from scripts that do not use whitespaces as
> word separators. Even in the Latin script it would extend to the
> punctuation signs that may follow any word, or to an entire mathematical
> formula such as "1+2*3" but not "sin x"...
>
>
> White space is clearly NOT part a grapheme cluster, so I don't see what
> your issue is?
>

No, whitespace is a grapheme cluster by its own, matching (.)

The issue is the overlong extended grapheme cluster after any Prepend
occurs because ( Grapheme_Extend | *SpacingMark* )*
But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore
the rare RI-sequences which are still are stil short) and will not match
the sequences of digits or letters intended by the prepended concatenation
marks, but only one.


> BTW, if after careful analysis you think there is a mistake, you should
> probably raise a bug on this.
>

For now the proposal only speaks about listing the prepended characters
enumeration with a new defined property , it still does not address what
are the sequences of graphemes over which they apply. As these quequences
are specific to each prepended character, I don't see how the new property
will help if we need to specialize each one of these characters: we still
need custom algorithm (possibly tailored by locale) for breaking clusters
using them.

With the definition given above, the extended grapheme clusters will break
after each letter/digit/punctuation and
 <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
will still break into
  <ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
The proposed new property does not change this : how can we really extend
the sequence of digits so that the number sign will span all of them? Use
CGJ or explicit sequence delimiters ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/e225c35b/attachment.html>

From asmus-inc at ix.netcom.com  Thu Nov 26 06:58:44 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 26 Nov 2015 04:58:44 -0800
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <CAGa7JC0qadX_RON1RrZ5boTF4rKu0P0E4bUcFx3=n5keX-S5pQ@mail.gmail.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
 <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>
 <5656EF25.9080903@ix.netcom.com>
 <CAGa7JC0qadX_RON1RrZ5boTF4rKu0P0E4bUcFx3=n5keX-S5pQ@mail.gmail.com>
Message-ID: <56570204.8090209@ix.netcom.com>

On 11/26/2015 4:29 AM, Philippe Verdy wrote:
> 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com 
> <mailto:asmus-inc at ix.netcom.com>>:
>
>     On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>>     The related definition for extended grapheme clusters says:
>>
>>     ( CRLF
>>     | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>>              ( Grapheme_Extend | *SpacingMark* )*
>>     | . )
>>
>>     However I do not understand why it may include only one
>>     Hangul-Syllable when applying prepended concatenation marks. And
>>     if the definition excludes whitespaces, nothing prevents it to
>>     extend to arbitrary sequences of
>>     letters/digits/symbols/punctuations (this could span very long
>>     sequences of sinograms, or other letters from scripts that do not
>>     use whitespaces as word separators. Even in the Latin script it
>>     would extend to the punctuation signs that may follow any word,
>>     or to an entire mathematical formula such as "1+2*3" but not "sin
>>     x"...
>
>     White space is clearly NOT part a grapheme cluster, so I don't see
>     what your issue is?
>
>
> No, whitespace is a grapheme cluster by its own, matching (.)
>
> The issue is the overlong extended grapheme cluster after any Prepend 
> occurs because ( Grapheme_Extend | *SpacingMark* )*
> But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we 
> ignore the rare RI-sequences which are still are stil short) and will 
> not match the sequences of digits or letters intended by the prepended 
> concatenation marks, but only one.

Prepend in front of an RI-Sequence is really a "defective" cluster in 
terms of the Arabic number sign's definition. So, one thing the Grapheme 
cluster specification should be clear about is that it does not describe 
the breaks in formatting runs needed to implement these characters.

Also, for editing (a common use of grapheme clusters) running these 
together with any following characters is not very useful in my opinion. 
So, perhaps much of the "Prepend" is a bug after all?
>
>     BTW, if after careful analysis you think there is a mistake, you
>     should probably raise a bug on this.
>
>
> For now the proposal only speaks about listing the prepended 
> characters enumeration with a new defined property , it still does not 
> address what are the sequences of graphemes over which they apply. As 
> these quequences are specific to each prepended character, I don't see 
> how the new property will help if we need to specialize each one of 
> these characters: we still need custom algorithm (possibly tailored by 
> locale) for breaking clusters using them.

correct - I wouldn't call that an "algorithm" -- it's the formatting 
behavior for that code point (some of them are similar, as I said, I see 
three patterns: following digit, digit run and word run.
>
> With the definition given above, the extended grapheme clusters will 
> break after each letter/digit/punctuation and
>  <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
> will still break into
> <ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
> The proposed new property does not change this : how can we really 
> extend the sequence of digits so that the number sign will span all of 
> them? Use CGJ or explicit sequence delimiters ?
>
correct, gives an incorrect specification - we need an actual 
specification for the format runs.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/789820e1/attachment.html>

From verdy_p at wanadoo.fr  Thu Nov 26 07:04:27 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 Nov 2015 14:04:27 +0100
Subject: New Character Property for Prepended Concatenation Marks
In-Reply-To: <CAGa7JC0qadX_RON1RrZ5boTF4rKu0P0E4bUcFx3=n5keX-S5pQ@mail.gmail.com>
References: <56563758.1040906@unicode.org>
 <trinity-6dc8128e-505f-4186-ac12-2feba9147eb6-1448525436930@3capp-webde-bs24>
 <CAGa7JC0JUobPFY8pkxnR2RVSqiTGmqZu_3UZn3j8dgr-UFuFeQ@mail.gmail.com>
 <CAGa7JC1J_fdDoM1Wxp5pWiuyxD_Zm2UfhyWT9Yk+cEgMpTz9Cg@mail.gmail.com>
 <5656EF25.9080903@ix.netcom.com>
 <CAGa7JC0qadX_RON1RrZ5boTF4rKu0P0E4bUcFx3=n5keX-S5pQ@mail.gmail.com>
Message-ID: <CAGa7JC0xjFWKqjNK7xO5d=O_WoGGxQ_0sKyDbcGN+svadKjjmA@mail.gmail.com>

Also, for Kaithi (TUS-15.2 pages 570-571) I note this paragraph:

The character U+110BD kaithi number sign is a format control character that
interacts with digits, occurring either above or below a digit. The
position of the kaithi number South and Central Asia-IV 571 15.2 Kaithi
sign indicates its usage: when the mark occurs above a digit, it indicates
a number in an itemized list, similar to U+2116 numero sign. If it occurs
below a digit, it indicates a numerical reference. Like U+0600 arabic
number sign and the other Arabic signs that span numbers (see Section 9.2,
Arabic), the kaithi number sign precedes the numbers they graphically
interact with, rather than following them, as would combining characters.
The U+110BC kaithi enumeration sign is the spacing version of the kaithi
number sign, and is used for inline usage.

However there's absolutely no indication on how to disambiguate the two
usages and presentations if these are unified within the same U+110BD
character. In both cases it will be encoded before the Kaithi digits. Note
that U+110BC is a separate standalone usage (as a symbol without any
number) which is a priori much more limited. Possibly something was
forgotten there:
- add an additional (joiner) control between it and the digits for the
numeric reference (e.g. note calls), and none for itemized lists (including
when numbering section headings) ?
- or encode a separate character for its usage in numeric reference (below
numbers)

In the Latin script, both usages are generally distinguished but no
specific mark is used (with the exception of the legacy Numero symbol), and
there's no need to tweak the default presentation of clusters :
- the "numero" symbol or abbreviation (N or n + superscript o) is used for
references, or the number itself is put in superscript or between
[brackets],
- but for itemized lists, the indicator is typically a suffix after the
number (e.g. a dot or hyphen punctuation before the item itself,  or within
the item itself a superscript "o" or "a", or superscripted final
abbreviation, such as "e", "er" in French, "st", "nd", "rd" in English...)


2015-11-26 13:29 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:
>
>> On 11/26/2015 3:08 AM, Philippe Verdy wrote:
>>
>> The related definition for extended grapheme clusters says:
>>
>> ( CRLF
>> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control )
>>            ( Grapheme_Extend | *SpacingMark* )*
>> | . )
>>
>> However I do not understand why it may include only one Hangul-Syllable
>> when applying prepended concatenation marks. And if the definition excludes
>> whitespaces, nothing prevents it to extend to arbitrary sequences of
>> letters/digits/symbols/punctuations (this could span very long sequences of
>> sinograms, or other letters from scripts that do not use whitespaces as
>> word separators. Even in the Latin script it would extend to the
>> punctuation signs that may follow any word, or to an entire mathematical
>> formula such as "1+2*3" but not "sin x"...
>>
>>
>> White space is clearly NOT part a grapheme cluster, so I don't see what
>> your issue is?
>>
>
> No, whitespace is a grapheme cluster by its own, matching (.)
>
> The issue is the overlong extended grapheme cluster after any Prepend
> occurs because ( Grapheme_Extend | *SpacingMark* )*
> But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore
> the rare RI-sequences which are still are stil short) and will not match
> the sequences of digits or letters intended by the prepended concatenation
> marks, but only one.
>
>
>> BTW, if after careful analysis you think there is a mistake, you should
>> probably raise a bug on this.
>>
>
> For now the proposal only speaks about listing the prepended characters
> enumeration with a new defined property , it still does not address what
> are the sequences of graphemes over which they apply. As these quequences
> are specific to each prepended character, I don't see how the new property
> will help if we need to specialize each one of these characters: we still
> need custom algorithm (possibly tailored by locale) for breaking clusters
> using them.
>
> With the definition given above, the extended grapheme clusters will break
> after each letter/digit/punctuation and
>  <ARABIC NUMBER SIGN, ARABIC DIGIT ONE, ARABIC DIGIT TWO>
> will still break into
>   <ARABIC NUMBER SIGN, ARABIC DIGIT ONE> separated from <ARABIC DIGIT TWO>
> The proposed new property does not change this : how can we really extend
> the sequence of digits so that the number sign will span all of them? Use
> CGJ or explicit sequence delimiters ?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151126/0eedefe1/attachment.html>

From plug.gulp at gmail.com  Fri Nov 27 13:55:55 2015
From: plug.gulp at gmail.com (Plug Gulp)
Date: Fri, 27 Nov 2015 19:55:55 +0000
Subject: ZWJ, ZWNJ and Markup languages.
Message-ID: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>

Hi,

The Unicode standard 8.0 states in chapter 23, section titled "Cursive
Connection and Ligatures"(printed page #814, PDF page #850) that:

"The zero width joiner and non-joiner characters are designed for use
in plain text; they should not be used where higher-level ligation and
cursive control is available. (See Uni-code Technical Report #20,
?Unicode in XML and Other Markup Languages,? for more information.) "

I went through TR#20 and did not find any mention that ZWJ and ZWNJ
are not suitable for use with markup languages. On the contrary, ZWJ
and ZWNJ are listed in TR#20 under section 4 titled "Format Characters
Suitable for Use with Markup".

So are ZWJ and ZWNJ characters suitable for use with markup languages
such as HTML and XML?

Thanks and kind regards,

~Plug


From duerst at it.aoyama.ac.jp  Fri Nov 27 19:42:15 2015
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Sat, 28 Nov 2015 10:42:15 +0900
Subject: ZWJ, ZWNJ and Markup languages.
In-Reply-To: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
References: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
Message-ID: <56590677.1020204@it.aoyama.ac.jp>

On 2015/11/28 04:55, Plug Gulp wrote:

> The Unicode standard 8.0 states in chapter 23, section titled "Cursive
> Connection and Ligatures"(printed page #814, PDF page #850) that:
>
> "The zero width joiner and non-joiner characters are designed for use
> in plain text; they should not be used where higher-level ligation and
> cursive control is available. (See Uni-code Technical Report #20,
> ?Unicode in XML and Other Markup Languages,? for more information.) "
>
> I went through TR#20 and did not find any mention that ZWJ and ZWNJ
> are not suitable for use with markup languages. On the contrary, ZWJ
> and ZWNJ are listed in TR#20 under section 4 titled "Format Characters
> Suitable for Use with Markup".
>
> So are ZWJ and ZWNJ characters suitable for use with markup languages
> such as HTML and XML?

They are indeed suitable for use with markup languages. They are so 
suitable that they are already provided as entities in RFC 2070, which 
is now historic, and from there on through HTML 4.0 and onwards. Please 
see http://tools.ietf.org/html/rfc2070#section-4.2.

I'm not sure why Unicode 8.0 has the text it has; at the least, this 
should be toned down somewhat to say "they may be replaced by 
higher-level ligation and cursive control mechanisms if available".
Thanks for finding this!

The main reason for this is that these characters apply at a single 
point; creating markup such as <zwj/> and <zwnj/> would not give any 
advantages over &zwj;/&zwnj;.

Markup is at its best when it can be applied to nested spans of text. It 
is not inconcievable that something like <do_not_ligate_inside>...
</do_not_ligate_inside> could occasionally be useful, but I have 
difficulties immagining a use case of the top of my head.

I'll file a bug report with the content of this email.

Regards,   Martin.

From asmus-inc at ix.netcom.com  Fri Nov 27 20:49:31 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 27 Nov 2015 18:49:31 -0800
Subject: ZWJ, ZWNJ and Markup languages.
In-Reply-To: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
References: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
Message-ID: <5659163B.1080708@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151127/3297ad38/attachment.html>

From asmus-inc at ix.netcom.com  Fri Nov 27 22:14:40 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 27 Nov 2015 20:14:40 -0800
Subject: ZWJ, ZWNJ and Markup languages.
In-Reply-To: <56590677.1020204@it.aoyama.ac.jp>
References: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
 <56590677.1020204@it.aoyama.ac.jp>
Message-ID: <56592A30.4070606@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151127/55aabc52/attachment.html>

From plug.gulp at gmail.com  Sun Nov 29 20:58:18 2015
From: plug.gulp at gmail.com (Plug Gulp)
Date: Mon, 30 Nov 2015 02:58:18 +0000
Subject: ZWJ, ZWNJ and Markup languages.
In-Reply-To: <56590677.1020204@it.aoyama.ac.jp>
References: <CAL01L+1OPEzZKGD-iYm8Q4JeVO1mrtq0Zn0cXPOQBor9PzisDQ@mail.gmail.com>
 <56590677.1020204@it.aoyama.ac.jp>
Message-ID: <CAL01L+0Pn0nXrJwyEqPSn2SPo66OT_KaQQ4-oRU1yGPfX_=VDA@mail.gmail.com>

On Sat, Nov 28, 2015 at 1:42 AM, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>
> They are indeed suitable for use with markup languages. They are so suitable
> that they are already provided as entities in RFC 2070, which is now
> historic, and from there on through HTML 4.0 and onwards. Please see
> http://tools.ietf.org/html/rfc2070#section-4.2.
>

Thank you Martin for the information! Yes, I now see that it is indeed
specified in the HTML spec here
http://www.w3.org/TR/html4/sgml/entities.html#h-24.4

Thanks once again for the help!

Kind regards,

~Plug


> I'm not sure why Unicode 8.0 has the text it has; at the least, this should
> be toned down somewhat to say "they may be replaced by higher-level ligation
> and cursive control mechanisms if available".
> Thanks for finding this!
>
> The main reason for this is that these characters apply at a single point;
> creating markup such as <zwj/> and <zwnj/> would not give any advantages
> over &zwj;/&zwnj;.
>
> Markup is at its best when it can be applied to nested spans of text. It is
> not inconcievable that something like <do_not_ligate_inside>...
> </do_not_ligate_inside> could occasionally be useful, but I have
> difficulties immagining a use case of the top of my head.
>
> I'll file a bug report with the content of this email.
>
> Regards,   Martin.