From nobody_uses at outlook.com  Thu Sep  1 23:39:53 2016
From: nobody_uses at outlook.com (eduardo marin)
Date: Fri, 2 Sep 2016 04:39:53 +0000
Subject: Missing block element characters
Message-ID: <MWHPR2001MB10533F95FD6FEDF582A4E18282E50@MWHPR2001MB1053.namprd20.prod.outlook.com>

It has come to my attention that there are four semi-graphics characters from the ZX-81 character set that are currently un-encoded: https://en.wikipedia.org/wiki/File:ZX81.chars.00-0A.80-8A.png

[https://en.wikipedia.org/wiki/File:ZX81.chars.00-0A.80-8A.png]the last four characters on the right of which I propose the following names: UPPER HALF BLOCK MEDIUM SHADE, LOWER HALF BLOCK MEDIUM SHADE, FULL BLOCK-UPPER HALF MEDIUM SHADE and FULL BLOCK-LOWER HALF MEDIUM SHADE. I recommend encoding them in the Miscellaneous Symbols and Arrows block or in the Geometric Shapes Extended block.

While a compelling reason for encoding is completing this obsolete character set (allowing for emulation) a much more convincing case (in my opinion) is the fact that it allows for greater artistic freedom for anybody decorating their text with Unicode or even creating illustrations: https://www.google.com.mx/search?q=utf-8+art&source=lnms&tbm=isch&sa=X&ved=0ahUKEwi406KI4-_OAhVL7SYKHdxfAUsQ_AUICCgB#imgrc=X9T-ssHdaBoJoM%3A

The second argument implies also encoding the vertical counterparts of these characters and many more variants of block elements with half shading, but a proposal just for these four is a great start before considering and measuring the artistic implications.

Another set of missing characters for Atari ST emulation are ATARI LOGO LEFT HALF, ATARI LOGO RIGHT HALF, SMILING MAN WITH PIPE UPPER LEFT, SMILING MAN WITH PIPE UPPER RIGHT, SMILING MAN WITH PIPE LOWER LEFT and SMILING MAN WITH PIPE LOWER RIGHT. These characters are much more objectionable due to their specificity and possible trademark issues (although unlikely), the set of digits represented as if they were in an seven segment display, could be considered a font variation there is also what appears to be a negative diagonal and lozenge, and those are even weirder: https://en.wikipedia.org/wiki/Atari_ST_character_set

Atari ST character set - Wikipedia, the free encyclopedia<https://en.wikipedia.org/wiki/Atari_ST_character_set>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160902/dab0d69d/attachment.html>

From christoph.paeper at crissov.de  Fri Sep  2 05:08:09 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 2 Sep 2016 12:08:09 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
Message-ID: <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>

Christoph P?per <christoph.paeper at crissov.de>:
> 
> I just learned that recent Samsung phones already contain emoji representations for many of these symbols.

And I?ve finally seen <http://www.unicode.org/L2/L2016/16087-provisional-value-for-emoji.pdf>. A pointer would have been nice.

From mark at macchiato.com  Fri Sep  2 05:42:44 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 2 Sep 2016 12:42:44 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
Message-ID: <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>

In order to understand the status of any document in the registry, you need
to also look at the minutes of the meeting where they are discussed, in
this case: http://www.unicode.org/L2/L2016/16121.htm

What you see there is:

B.14.3 Provisional value for Emoji property [Emoji SC/Edberg, L2/16-087
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/16-087>]

B.14.3.1 Characters Proposed for Emoji=Provisional [Emoji SC/Edberg,
L2/16-088 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/16-088>]

Discussion. UTC took no action at this time.

"Took no action" generally means "rejected".


Mark

On Fri, Sep 2, 2016 at 12:08 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> Christoph P?per <christoph.paeper at crissov.de>:
> >
> > I just learned that recent Samsung phones already contain emoji
> representations for many of these symbols.
>
> And I?ve finally seen <http://www.unicode.org/L2/
> L2016/16087-provisional-value-for-emoji.pdf>. A pointer would have been
> nice.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160902/ae145be4/attachment.html>

From christoph.paeper at crissov.de  Fri Sep  2 06:01:46 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 2 Sep 2016 13:01:46 +0200
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <20160831092504.665a7a7059d7ee80bb4d670165c8327d.2c28aa3363.wbe@email03.godaddy.com>
References: <20160831092504.665a7a7059d7ee80bb4d670165c8327d.2c28aa3363.wbe@email03.godaddy.com>
Message-ID: <DF6257B5-A868-4920-86D9-A537BD2E6AB2@crissov.de>

Doug Ewell <doug at ewellic.org>:
> [FAZ:]
>> ("The Unicode Consortium appears like a reissue of Orwell's Ministry
>> of Truth, which replaced the English language by a new one, sweeped
>> clean from harmful terms, and which removed "unorthodox" connotations
>> from the rest of the words.")
> 
> I can imagine people with time on their hands criticizing Apple for
> changing the glyph, but how did the Unicode Consortium itself get
> dragged into this? What obvious thing am I missing?

>From the FAZ article:

>> Doch nach Darstellung des Portals ?Buzzfeed? intervenierten Apple und Microsoft bei dem f?r die Standardisierung von Emojis zust?ndigen Konsortium und verhinderten so die Aufnahme des Gewehrsymbols in den Unicode. Waffenkontrolle funktioniert heute also mit ein paar Code-?nderungen.


My translation:

>> According to the news portal ?Buzzfeed?, Apple and Microsoft intervened at the consortium responsible for standardizing emojis and thereby prevented the addition of the Rifle symbol into  Unicode. Gun control hence works by changing a bit of code nowadays.


Newspapers, especially big and traditional ones, notoriously don?t hyperlink their sources. The author, Adrian Lobe, probably references the first of the following Buzzfeed articles by Charlie Warzel (I remembered the Ars Technica article which is similar in tone):

- <https://www.buzzfeed.com/charliewarzel/thanks-to-apples-influence-youre-not-getting-a-rifle-emoji>
- <https://www.buzzfeed.com/charliewarzel/inside-emojigeddon-the-fight-over-the-future-of-the-unicode>
- <http://arstechnica.com/apple/2016/06/apple-and-microsoft-pushed-to-remove-the-rifle-emoji-from-unicode-9-0/>

The finer details of inclusion in Unicode, `Emoji=Yes`, `Emoji_Presentation=Yes`, inclusion in emoji picker GUIs or default fonts don?t really matter to the general public. Two characters were added in TUS9 to a block otherwise only used for (non-compatibility) emojis without making them proper emojis. That?s fishy, even to the halfwits. UTC would have been better off not encoding ?? U+1F946 Rifle and ?? U+1F93B Modern Pentathlon at all. The current situation is a half-assed compromise that pleases nobody and it?s really no wonder it gets mixed up with the more recent controversy about the default glyph for ?? U+1F52B Pistol in current beta versions of Apple?s OSs.

From leoboiko at namakajiri.net  Fri Sep  2 08:58:49 2016
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Fri, 2 Sep 2016 10:58:49 -0300
Subject: Emoji semantic drift
Message-ID: <CAJ6uix4_T35kVcbSkCFxZFgwMyYJw8KRhfiZcAFv_mi1BaGXrg@mail.gmail.com>

This isn't news, but I find it interesting how some emoji are being used in
ways that differ from their Unicode names, reflecting alternative
interpretations of common glyphs. I'll compare data from the Unicode chart
with interpretations taken from Emojipedia, which I think do reflect
real-world usage:

U+1F617 KISSING FACE ??
Current keywords: face|kiss
? whistle (= nonchalance ; happiness)
http://emojipedia.org/kissing-face/

U+1F481 INFORMATION DESK PERSON ??
? person tipping hand
Keywords: hand | help | information | sassy | tipping
? sassy ; hair flick
http://emojipedia.org/information-desk-person/

U+1F601 GRINNING FACE WITH SMILING EYES ??
Keywords: face | grin
? grimace (discomfort, pain)
http://emojipedia.org/grinning-face-with-smiling-eyes/

U+1F624 FACE WITH LOOK OF TRIUMPH ??
? face with steam from nose
Keywords: face | triumph| won
? angry; frustration; contemptuous
http://emojipedia.org/face-with-look-of-triumph/

I see that *some* of those alternative readings are registered in the
Unicode table as ? , while others are present in keywords, and still others
are absent.  Is there any criteria for that? Is someone trying to keep
track of emoji in use?

I think distributional methods are promising, as shown by Thomas Dimson:
http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji

By this we find that, for example, U+1F46F WOMAN WITH BUNNY EARS, also
marked as *? people partying*, has additional connotations of
sisterhood?specifically female friendship and loyalty ( #sistasista,
#sistersforlife, #sistersister, #bestiesforlife, yearsoffriendship,
#sisterfromanothermister, #morelikesisters, #bffl, #bestiesfortheresties,
#bestfriendsforever ). U+1F647 PERSON BOWING DEEPLY ?? is seeing use as a
marker of worry or shame ("late night thoughts", "deleting later", "in my
feelings", "laughing but very serious"); probably due to the emotion lines
drawn on most fontsets.
<http://unicode.org/emoji/charts/emoji-annotations.html#hand>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160902/7c5575e2/attachment.html>

From irgendeinbenutzername at gmail.com  Fri Sep  2 08:53:38 2016
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Fri, 2 Sep 2016 15:53:38 +0200
Subject: Missing block element characters
Message-ID: <CAKLR3AqC1RRk-_WrqbVX24h5vf11YhGomvBnX+ch0wp8JiRY0Q@mail.gmail.com>

I'd argue there's a fifth character missing from the ZX-81 set. Notice how
there are two separate MEDIUM SHADEs, one the inverse of the other. For
complete compatibility Unicode would also somehow need a second one of
those.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160902/3cd63fb6/attachment.html>

From christoph.paeper at crissov.de  Fri Sep  2 14:03:03 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 2 Sep 2016 21:03:03 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
Message-ID: <A50CA8C2-ADB5-4210-A3BF-69383CD48764@crissov.de>

Mark Davis ?? <mark at macchiato.com>:
> 
> In order to understand the status of any document in the registry, you need to also look at the minutes of the meeting where they are discussed,

I?ve said it before: Unicode is lacking a proper public issue tracker.


From christoph.paeper at crissov.de  Sat Sep  3 00:08:39 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sat, 3 Sep 2016 07:08:39 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <CAGa7JC2aGkBjL4x7YvUrav9pu5uLUEg_FZU5oAfS3FudjWy67A@mail.gmail.com>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <CAGa7JC2aGkBjL4x7YvUrav9pu5uLUEg_FZU5oAfS3FudjWy67A@mail.gmail.com>
Message-ID: <CD45A6BF-EF68-4EA2-9809-01620130FDFF@crissov.de>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160903/bbbdf8be/attachment.html>

From mark at macchiato.com  Sat Sep  3 00:29:42 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sat, 3 Sep 2016 07:29:42 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <A50CA8C2-ADB5-4210-A3BF-69383CD48764@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
 <A50CA8C2-ADB5-4210-A3BF-69383CD48764@crissov.de>
Message-ID: <CAJ2xs_ExcVJ1cOtXNVzmqetgPeuBMMYoNx5=yKWwXkq1Gse3sw@mail.gmail.com>

?There are three main Unicode technical committees: we use an issue tracker
for two of them, but not the UTC (which established its processes earlier).
Personally, I'd favor an issue tracker for the UTC as well, but decisions
as to the process are determined by the committee.? (Note again that
nothing said on this list will be taken up by the UTC unless someone
submits a proposal to the UTC.)

Mark

On Fri, Sep 2, 2016 at 9:03 PM, Christoph P?per <christoph.paeper at crissov.de
> wrote:

> Mark Davis ?? <mark at macchiato.com>:
> >
> > In order to understand the status of any document in the registry, you
> need to also look at the minutes of the meeting where they are discussed,
>
> I?ve said it before: Unicode is lacking a proper public issue tracker.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160903/bebaeebf/attachment.html>

From mark at macchiato.com  Sat Sep  3 01:17:08 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Sat, 3 Sep 2016 08:17:08 +0200
Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and
 Male/Mars Signs
In-Reply-To: <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
Message-ID: <CAJ2xs_F95xWr2Y6BvF78FG1cQUrAch5Hf6cgRq8XYTy70ExxnQ@mail.gmail.com>

As to your points below.
There is great demand for the choice between female and male, and there is
a specific proposal in E4.0. I have no doubt that it will be accepted.
?The addition of emoji is iterative: the acceptance of male and female
forms doesn't preclude a neutral option in a later version. As I've said,
people are actively working on that, and we'll see what they come up with.

I strongly doubt that anyone would be receptive to your ~3,000?

?characters. Nor do I think that that many characters are required to
"satisfy user expectations".
There were earlier proposals to look at somewhat broadening the number of
existing emoji. There is a very real cost to supporting more emoji
characters, and the committee is being prudent about the number of new
emoji it supports. Some vendors already give additional Unicode characters
an emoji appearance; that is not forbidden.

?
?You aren't wasting your time if you present a grounded proposal for
reasonably-sized additional sets of characters, based on expected usage and
other criteria we've outlined, not just "is related". This is whether
adding new characters, or making existing characters be Emoji: for example,
a set of body parts as you mention. If you see successful proposals from
the past, such as for additional sports symbols, those did not try to
propose the addition of all possible emoji of that type (eg all human
sports activities, or all species of animals), but rather looked at the
most popular sports.
?

?The UTC
is not trying to scare away input. We have accepted many proposals from
"small and independent parties". It *does* mean that such independent?

?parties have to provide a good argument for their proposals, based on
usage and other criteria.


Mark

On Thu, Aug 25, 2016 at 4:52 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> TL;DR: Unicode properties should reflect user expectations, not vendor
> choices.
>
> Mark Davis ?? <mark at macchiato.com>:
> > On Mon, Aug 22, 2016 at 11:26 PM, Christoph P?per <
> christoph.paeper at crissov.de> wrote:
> >> 1. it?s incomplete without an explicit neutral/ambiguous alternative and
> >
> > ?As I said, people are actively investigating what to do about such
> cases. It may be that the solution is to add ? U+26B2 Neuter, but maybe
> not. We'll see as they develop further.
>
> Natively speaking a language which can explicitly mark any actor noun with
> a morpheme as female/feminine, but neither as neutral nor as male/masculine
> ? a generic version of English ?actor/actress?, ?waiter/waitress?,
> ?prince/princess? ? and having intensely dealt with guidelines for
> corporate languages and public speech, I?ll assure you that a feminism/LGBT
> shitstorm will be heading for UTC and vendors if binary gender became
> mandatory for profession emojis. You should not approve Google?s and
> Apple?s ZWJ sequences without a neutral option.


?[snip]?


>
> Sorry, this got long.
>

?yes?. shorter is better
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160903/8160c7a5/attachment.html>

From charupdate at orange.fr  Sat Sep  3 11:30:38 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Sat, 3 Sep 2016 18:30:38 +0200 (CEST)
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
Message-ID: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>

On Wed, 31 Aug 2016 09:25:05 -0700, Doug Ewell wrote:

>> [?]
> 
> So I took another look and saw that:
> 
> (1) U+1F946 RIFLE has the following cross-reference in NamesList.txt:
> 
> = marksmanship, shooting, hunting
> 
> which does not include any mention of squirt guns or water pistols, or
> generally bowdlerizing the image or changing the intent of this code
> point;
> 
> (2) Section 22.9 "Miscellaneous Symbols" in TUS 9.0 does not make any
> mention of modifying the RIFLE glyph, or symbol glyphs in general, so as
> to alter their meaning;
> 
> (3) the code chart at http://www.unicode.org/charts/PDF/U1F900.pdf
> clearly shows a rifle, and not any other type of gun or non-gun. 
> 
> I can imagine people with time on their hands criticizing Apple for
> changing the glyph, but how did the Unicode Consortium itself get
> dragged into this? What obvious thing am I missing?

Don?t mind. That?s just another example of ?Unicode bashing? that 
is sometimes found in European papers since the beginning of the 
public existence of the Standard. That is typically issued by people 
who didn?t learn much about the topic they?re writing on (well, like 
I didn?t when I started mailing here?). The underlying spirit is IMHO 
found also in the first ISO 10646 chief editor?s attitude when he 
enforced bad (wrong, inconsistent or useless) character names just 
to make for Europe?s superiority (forgetting BTW that ?CARON? was 
originally US-American internal use standardese).

Now that Christoph P?per found out [1] that the FAZ authors most probably 
*did* read the BuzzFeed article I found with Bing and posted on this 
Mailing List [2], the issue is complicated in that there is obviously 
some dishonest handling of core information by the FAZ authors, except 
in the case that they were unable to understand the difference between 
a character encoding refusal and an emoji property value change, or?as 
of the PISTOL emoji?the difference between a character and a glyph.

Apple could have made use of the possibility to shift the meaning of an 
emoji?a not uncommon phenomenon, according to Leonardo Boiko?s last 
findings [3]. Actually they didn?t have much choice, being urged to hide
from the public area as far as possible the meaning of a fire weapon. 
The really troublesome thing is that German newspaper journalists are 
eager to promote guns, rifles and other pistols for interpersonal 
messaging. As I already said, much of the latter is performed by children. 
That?s the biggest reason why I find it OK that no effective pistols be 
provided in image. It seems that this FAZ article was written by some 
unmarried, unresponsive beginners.

However, since they talk of the RIFLE character as if it didn?t exist in 
Unicode (and not only were ?missing? amidst the iOS emoji), it?s hard 
for me to make any sense except by considering those utterings as a kind 
of neonazi propaganda junk (despite of the renown of the newspaper itself)
due most probably to the fact that the responsible chief editor was on 
holidays.

So as I said: Don?t mind.

Marcel

[1] http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0003.html
[2] http://www.unicode.org/mail-arch/unicode-ml/y2016-m08/0091.html
[3] http://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0004.html


From asmusf at ix.netcom.com  Sat Sep  3 16:21:22 2016
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Sat, 3 Sep 2016 14:21:22 -0700
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>
References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>
Message-ID: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>

I don't think that there should be a place on this list for accusing 
people of dishonesty and / or spreading "neo-nazi junk"; and I don't 
know what the marriage status of the editors has to do with anything.

The central concern of the FAZ article appears to be the role that 
private entities play as gate-keepers of modern communication. That's 
actually a valid concern (see issues like net-neutrality, algorithm 
based search returns and news-feeds and the like). The fact that fine 
distinctions of a technical nature may have been handled with less 
precision than insiders would prefer, is perhaps sloppy, but pretty 
typical for journalism in general.

None of that warrants the kind of loaded language used here.

A./

PS: must admit, I haven't followed the FAZ in a while, so I have no 
personal knowledge of any changes that may have happened in recent 
years, but in earlier times the Feuilleton (the section that this 
article appeared in) used to be fairly liberal in outlook, certainly not 
given to the extremist views that they are accused of here. And I can  
detect no evidence that the charges below have any merit.

On 9/3/2016 9:30 AM, Marcel Schneider wrote:

> ...t there is obviously
> some dishonest handling of core information by the FAZ authors, except
> in the case that they were unable to understand the difference between
> a character encoding refusal and an emoji property value change, or?as
> of the PISTOL emoji?the difference between a character and a glyph.
> ... It seems that this FAZ article was written by some
> unmarried, unresponsive beginners.
>
> However, since they talk of the RIFLE character as if it didn?t exist in
> Unicode (and not only were ?missing? amidst the iOS emoji), it?s hard
> for me to make any sense except by considering those utterings as a kind
> of neonazi propaganda junk (despite of the renown of the newspaper itself)
> due most probably to the fact that the responsible chief editor was on
> holidays.
>
>


From charupdate at orange.fr  Sat Sep  3 20:10:58 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Sun, 4 Sep 2016 03:10:58 +0200 (CEST)
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>
References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>
 <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>
Message-ID: <112979824.3.1472951458962.JavaMail.www@wwinf1c02>

On Sat, 3 Sep 2016 14:21:22 -0700, Asmus Freytag (c) wrote:

> I don't think that there should be a place on this list for accusing
> people of dishonesty and / or spreading "neo-nazi junk"; and I don't
> know what the marriage status of the editors has to do with anything.
> 
> The central concern of the FAZ article appears to be the role that
> private entities play as gate-keepers of modern communication. That's
> actually a valid concern (see issues like net-neutrality, algorithm
> based search returns and news-feeds and the like). The fact that fine
> distinctions of a technical nature may have been handled with less
> precision than insiders would prefer, is perhaps sloppy, but pretty
> typical for journalism in general.
> 
> None of that warrants the kind of loaded language used here. 
> 
> A./
> 
> PS: must admit, I haven't followed the FAZ in a while, so I have no
> personal knowledge of any changes that may have happened in recent
> years, but in earlier times the Feuilleton (the section that this
> article appeared in) used to be fairly liberal in outlook, certainly not
> given to the extremist views that they are accused of here. And I can
> detect no evidence that the charges below have any merit. 

I admit that I mistook my language level. In front of the long discussion on 
the Unicode List triggered by that FAZ paper, I?ve ended up bursting out.

The main concern as Doug Ewell?s last question underscores it, is whether the 
attack against the Unicode Consortium is justified in any way, or is mere calumny.

Further research brings up that the author of the paper is a very young freelance
journalist.[1] That confirms my suspicion that having no children in age of using 
an iPhone, he doesn?t feel concerned with Apple?s choice. However he is right in 
that, at the very end of his article he points out the risk of the waterpistol emoji 
being intended as such and received on Android:

?Die in Codes formulierte Entwaffnungspolitik kehrt sich in ihr Gegenteil, 
wenn ein iPhone-Nutzer seine Freunde zu einer Wasserschlacht einl?dt und 
ihnen per SMS ein Wasserpistolen-Emoji schickt: Dann erscheint auf dem 
Samsung-Ger?t keine Wasserpistole, sondern ein Revolver. Und das k?nnten 
die Empf?nger wom?glich missverstehen.?

[revised Google translation: ?The disarmament policy that is formulated in codes 
is reversed into its opposite when an iPhone user invites his friends to a water 
fight and sends them a water pistol emoji by SMS: Then the Samsung device does not 
display a water pistol, but a revolver. Something that the receivers could possibly 
misunderstand.?]

Arguing by this very rare case is consistent with the facts-twisting used in other 
parts of the article. This casts a crude twilight on the author?s approach. Harsh 
wording such as ?doppelz?ngig? (deceitful, speaking of Apple and Microsoft); 
?schleift das Recht auf freie Meinungs?u?erung? (grinds the right on Free Speech, 
as of Apple), pointed as a ?Skandal?; ?zeugt von einer verqueren Sicht der Dinge?
(brings evidence of an awry/askew/screwy point of view) contrasts with an obvious 
lack of knowledge when talking of Unicode as both proposing and accepting emoji?

As is outlined by a reader?s comment, while emoji are formally in the first place, 
the demonstration is biased with a mix-up involving speech, then applied to emoji
to make the reader believe that the Orwell-reminiscence is really well-placed.

Unicode and big tech companies are always patient targets for attacks 
of that kind. As pointed by another commenting reader: A tempest in a teapot.

I remember the FAZ feuilleton over a decade ago too, appearing to me always as 
high-quality journalism. A quick look at the last article from the same author[2]
makes me believe that truth and accuracy still conform to the standard.

In return, I?m left back with the troublesome question: Why do they hate Unicode,
Apple and Microsoft?

Marcel

[1] https://www.linkedin.com/in/adrian-lobe-aa3057b7
[2] http://www.faz.net/aktuell/feuilleton/debatten/wer-haftet-fuer-luegen-die-algorithmen-verbreiten-14416032.html


From christoph.paeper at crissov.de  Sun Sep  4 01:08:55 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sun, 4 Sep 2016 08:08:55 +0200
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>
References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>
 <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>
Message-ID: <1877D788-19A5-4ABD-B594-35E62245FE04@crissov.de>

Asmus Freytag (c) <asmusf at ix.netcom.com>:
> 
> The central concern of the FAZ article appears to be the role that private entities play as gate-keepers of modern communication. That's actually a valid concern (?). The fact that fine distinctions of a technical nature may have been handled with less precision than insiders would prefer, is perhaps sloppy, but pretty typical for journalism in general.

Exactly. Anybody who becomes aware of being considered a gatekeeper (i.e. a mild version of a ?censor?) by the general public should not react by dismissal but reflection!

The FAZ is generally considered conservative by German/European standards, but would still be considered rather liberal in the US. Be assured that it takes a rather restrictive stand when it comes to *actual* gun control (at least by international standards). Within the spectrum of traditional German newspapers, it usually is quite on the pro side of capitalism and trans-Atlantic friendship (i.e. its policy is not ?anti-American?). The fear of being controlled or restricted by big (US-based) corporations or faceless bureaucrats, however, is shared by many left and right-wing authors. The state itself ? unlike in ?1984? or NRA propaganda ? is generally not seen as the enemy in German media. The reference to Orwell?s dystopia was hence badly chosen, but it is probably the one best known ? besides ?Brave New World? ? among the readership.

From c933103 at gmail.com  Sun Sep  4 05:48:04 2016
From: c933103 at gmail.com (gfb hjjhjh)
Date: Sun, 4 Sep 2016 18:48:04 +0800
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <112979824.3.1472951458962.JavaMail.www@wwinf1c02>
References: <1408976176.8391.1472920238977.JavaMail.www@wwinf1c26>
 <59355a68-f064-bd3a-a800-d6c81f90f972@ix.netcom.com>
 <112979824.3.1472951458962.JavaMail.www@wwinf1c02>
Message-ID: <CAGHjPP+T=qk=-95HXd0Jy5UxDmZL7+a8K-jP3_fvk=-VzR4kQg@mail.gmail.com>

After you admitted you have mistook the language level, your reply is still
full of ad hominem, and being very young =\= have no children in age of
using iphone, and even if he have no children in age of using iphone that
does not mean he is not concerned about the problem of children possibly
exposed to hate comments. In my opinion concern about what children might
expose to shall be deal with parental control not something that affect
every users. In some war-torn countries, guns are still unfortunately part
of daily life for some people, and they are also unicode users. They use
emoji too. Please at least try to pretend to be less US-centric.

Calling inter-operatability problem a very rare case is just pretending the
problem does not exist, think about how the yellow heart look like in
Android 4.4.

Those words used in the article are in my opinion fairly mild and indirect,
at least they are not writing like some other reports I have read earlier
that claim it is yet another example of how America dictate online world
and utilize their technological advantage to force rest of the world
sacrifice for them.

I don't see any particular big problem in the googl translated text of the
report on the original author's understansing on unicode's proposal and
accepting procedure and I am afraid you might have misunderstood something
between the lines. Also, author's point is not about priblems in
proposal/acceptance procedure, instead he's talking about how such a use of
existing procedure could lead to undesirable effect.

You cannot deny emoji are used as part of speech, even when they are mostly
informal. For instance, in an election take place in ny city today, some
candidates are forced to use emoji to write their political campaigns to
avoid censorship. It is likely that if people can't find their desired
emoji in emoji list, they will turm to use other less fitting emoji and
thus avoided the expression they originally intended to use.

Determining what to/not to appear in the emoji list for non-technical
reason surely open up a new path to governmental bodies around the world to
control people's expression, for instance if you set your locale as China,
then the flag for Taiwan would disappear from your keyboard if you're using
native keyboard on phones from some brand. Or try to think about a
hypothetical situation that certain Islamic country require all phones sold
in it must have hijab for all the women emoji.

If you think such a rational article is wrotten with hate then might be you
habe already assumed any negative opinions are equal to hate.

Ckyu.

2016?9?4? 09:13 ? "Marcel Schneider" <charupdate at orange.fr> ???

> On Sat, 3 Sep 2016 14:21:22 -0700, Asmus Freytag (c) wrote:
>
> > I don't think that there should be a place on this list for accusing
> > people of dishonesty and / or spreading "neo-nazi junk"; and I don't
> > know what the marriage status of the editors has to do with anything.
> >
> > The central concern of the FAZ article appears to be the role that
> > private entities play as gate-keepers of modern communication. That's
> > actually a valid concern (see issues like net-neutrality, algorithm
> > based search returns and news-feeds and the like). The fact that fine
> > distinctions of a technical nature may have been handled with less
> > precision than insiders would prefer, is perhaps sloppy, but pretty
> > typical for journalism in general.
> >
> > None of that warrants the kind of loaded language used here.
> >
> > A./
> >
> > PS: must admit, I haven't followed the FAZ in a while, so I have no
> > personal knowledge of any changes that may have happened in recent
> > years, but in earlier times the Feuilleton (the section that this
> > article appeared in) used to be fairly liberal in outlook, certainly not
> > given to the extremist views that they are accused of here. And I can
> > detect no evidence that the charges below have any merit.
>
> I admit that I mistook my language level. In front of the long discussion
> on
> the Unicode List triggered by that FAZ paper, I?ve ended up bursting out.
>
> The main concern as Doug Ewell?s last question underscores it, is whether
> the
> attack against the Unicode Consortium is justified in any way, or is mere
> calumny.
>
> Further research brings up that the author of the paper is a very young
> freelance
> journalist.[1] That confirms my suspicion that having no children in age
> of using
> an iPhone, he doesn?t feel concerned with Apple?s choice. However he is
> right in
> that, at the very end of his article he points out the risk of the
> waterpistol emoji
> being intended as such and received on Android:
>
> ?Die in Codes formulierte Entwaffnungspolitik kehrt sich in ihr Gegenteil,
> wenn ein iPhone-Nutzer seine Freunde zu einer Wasserschlacht einl?dt und
> ihnen per SMS ein Wasserpistolen-Emoji schickt: Dann erscheint auf dem
> Samsung-Ger?t keine Wasserpistole, sondern ein Revolver. Und das k?nnten
> die Empf?nger wom?glich missverstehen.?
>
> [revised Google translation: ?The disarmament policy that is formulated in
> codes
> is reversed into its opposite when an iPhone user invites his friends to a
> water
> fight and sends them a water pistol emoji by SMS: Then the Samsung device
> does not
> display a water pistol, but a revolver. Something that the receivers could
> possibly
> misunderstand.?]
>
> Arguing by this very rare case is consistent with the facts-twisting used
> in other
> parts of the article. This casts a crude twilight on the author?s
> approach. Harsh
> wording such as ?doppelz?ngig? (deceitful, speaking of Apple and
> Microsoft);
> ?schleift das Recht auf freie Meinungs?u?erung? (grinds the right on Free
> Speech,
> as of Apple), pointed as a ?Skandal?; ?zeugt von einer verqueren Sicht der
> Dinge?
> (brings evidence of an awry/askew/screwy point of view) contrasts with an
> obvious
> lack of knowledge when talking of Unicode as both proposing and accepting
> emoji?
>
> As is outlined by a reader?s comment, while emoji are formally in the
> first place,
> the demonstration is biased with a mix-up involving speech, then applied
> to emoji
> to make the reader believe that the Orwell-reminiscence is really
> well-placed.
>
> Unicode and big tech companies are always patient targets for attacks
> of that kind. As pointed by another commenting reader: A tempest in a
> teapot.
>
> I remember the FAZ feuilleton over a decade ago too, appearing to me
> always as
> high-quality journalism. A quick look at the last article from the same
> author[2]
> makes me believe that truth and accuracy still conform to the standard.
>
> In return, I?m left back with the troublesome question: Why do they hate
> Unicode,
> Apple and Microsoft?
>
> Marcel
>
> [1] https://www.linkedin.com/in/adrian-lobe-aa3057b7
> [2] http://www.faz.net/aktuell/feuilleton/debatten/wer-
> haftet-fuer-luegen-die-algorithmen-verbreiten-14416032.html
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160904/8b7fd169/attachment.html>

From doug at ewellic.org  Mon Sep  5 11:51:03 2016
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 5 Sep 2016 10:51:03 -0600
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <mailman.1.1473008401.16186.unicode@unicode.org>
References: <mailman.1.1473008401.16186.unicode@unicode.org>
Message-ID: <84059D99CFD34648A4A4FC17D74047A4@DougEwell>

Marcel Schneider wrote:

> The main concern as Doug Ewell's last question underscores it, is
> whether the attack against the Unicode Consortium is justified in any
> way, or is mere calumny.

I didn't accuse FAZ or anyone else of calumny, or any sort of malicious 
intent. Hanlon's Razor applies here.

--
Doug Ewell | Thornton, CO, US | ewellic.org 


From charupdate at orange.fr  Mon Sep  5 16:49:41 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 5 Sep 2016 23:49:41 +0200 (CEST)
Subject: Comment in a leading German newspaper regarding the way UTC and
 Apple handle Emoji as an attack on Free Speech
In-Reply-To: <84059D99CFD34648A4A4FC17D74047A4@DougEwell>
References: <mailman.1.1473008401.16186.unicode@unicode.org>
 <84059D99CFD34648A4A4FC17D74047A4@DougEwell>
Message-ID: <931602732.19273.1473112181379.JavaMail.www@wwinf1e21>

On Mon, 5 Sep 2016 10:51:03 -0600, Doug Ewell wrote:

> Marcel Schneider wrote:
> 
>> The main concern as Doug Ewell's last question underscores it, is
>> whether the attack against the Unicode Consortium is justified in any
>> way, or is mere calumny.
> 
> I didn't accuse FAZ or anyone else of calumny, or any sort of malicious
> intent. Hanlon's Razor applies here.

Yep, I wasn?t aware! With this in mind, there will be much less hassles.
Thanks!

Marcel


From nobody_uses at outlook.com  Mon Sep  5 18:53:50 2016
From: nobody_uses at outlook.com (eduardo marin)
Date: Mon, 5 Sep 2016 23:53:50 +0000
Subject: named character sequences foor tally marks
Message-ID: <MWHPR2001MB105388E81AA83890B731AF8482E60@MWHPR2001MB1053.namprd20.prod.outlook.com>

I love the proposal to add western tally marks because it only occuies two code points for a techically equivalent solution: http://www.unicode.org/L2/L2016/16065-tally-marks.pdf

L2/16-065 (Proposal to encode two Western-style tally marks)<http://www.unicode.org/L2/L2016/16065-tally-marks.pdf>
www.unicode.org
1 L2/16-065 Title: Proposal to encode two Western-style tally marks Source: Ken Lunde (Adobe) & Daisuke MIURA Status: Individual contribution Action: For ...

However this proposal isn't complete unless we can identify tally marks 2, 3 and 4 easily and the simplest way is to add named character sequences, where we just repeat tally mark one an n number of times.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160905/efa98284/attachment.html>

From irgendeinbenutzername at gmail.com  Mon Sep  5 19:34:38 2016
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Tue, 6 Sep 2016 02:34:38 +0200
Subject: Why isn't MUSICAL SYMBOL NULL NOTEHEAD default ignorable?
Message-ID: <CAKLR3Ap2qyht4yhEXU24hUh82YMpZRgk+CWF8XfHoUk-QOUGhA@mail.gmail.com>

It has just come to my attention that U+1D159 MUSICAL SYMBOL NULL NOTEHEAD
is not default ignorable, even though it has no visible glyph appearance
and no advance width in text, just like the various Hangul jamo fillers
that *are* default ignorable. Is there a technical reason for this or is it
just an oversight?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160906/6ba9b305/attachment.html>

From nobody_uses at outlook.com  Tue Sep  6 00:18:35 2016
From: nobody_uses at outlook.com (eduardo marin)
Date: Tue, 6 Sep 2016 05:18:35 +0000
Subject: Enconding a flammable sign and others
Message-ID: <MWHPR2001MB1053071EE534D8B700CF37E982F90@MWHPR2001MB1053.namprd20.prod.outlook.com>

I'm really surprised this isn't already encoded but while we are at it let's also encode a symbol for non-ioninzing radiation, laser hazard, explosion hazard, strong magnetic field, chocking hazard, corrosion, slippery floor, oxidising, carcinogen and chemical weapon symbols just to name the most relevant. Other ones would include uv light, frezzing hazard, hand in the middle of cogs, foot or hand under machinery, battery hazard, washing hands, fragile symbol, crane hazard, suffocation, high temeperatures and probably some I missed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160906/377a4276/attachment.html>

From verdy_p at wanadoo.fr  Tue Sep  6 07:03:28 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 6 Sep 2016 14:03:28 +0200
Subject: named character sequences foor tally marks
In-Reply-To: <MWHPR2001MB105388E81AA83890B731AF8482E60@MWHPR2001MB1053.namprd20.prod.outlook.com>
References: <MWHPR2001MB105388E81AA83890B731AF8482E60@MWHPR2001MB1053.namprd20.prod.outlook.com>
Message-ID: <CAGa7JC2AOmX1gQj9ufw3Pk5=TqAyDGgz6u=_yYvmsDn8wEE-mQ@mail.gmail.com>

Isn't the proposal showing only significant cases for numbers 1-5 (the
others are repeating the glyphs with their separation made by their side
bearing)

Digits 1-4 (using the first variant and an overstriking slash for 5) are
also highly confusable with existing vertical bars and Devanagari
punctuations, but their significant difference is their side bearings,
which may not be distinctable with monospaced fonts).

For the second variant (alternating horizontals and verticals) has the same
confusable glyphs for 1, but for 2 they may be confusable with some Hangul.

However, for a coherent presentation (to avoid mixing fonts with various
metrics, it still seems coherent to encode 1-5 in a single set (the
addition of 10-20 does not seem necessary, unless there are other variants
for talling marks)

Other Variants :

- 1. Many people extend the number of vertical bars and do not use any
slanted slash (5 is represented just as " ||||| "):

- 2. Another common variant draws successively each side of a square and
adds a diagonal for 5.

- 3. I've seen at least one variant of the first variant, using an
horizontal bar for 6 (two for 7, three for 8, four for 9) and  representing
10 with two orthogonal slashes (and when counting by tens directly, using a
square or just an X cross, also similar to the Roman digit X).

Talling marks are basically used to count events or objects progressively,
as they come, by adding a single stroke to the counter without erasing
anything. these marks are not necessarily dranw with a pen but may be
engraved bur some cutting tool, they may also be holes through thin
surface. their presentation depends on the material used and whever they
must resist mechanically for extended periods of time (ink on paper is not
always usable, notably when exposed directly to water/humidity, or when
marks are on objects that are subject to frequent manipulations).

Their presentation will then widely vary, but for use in plain text
documents that will be printed on paper or displayed on screen (possibly
along with other text) it seems that the first variant (verical bars and
overstriking slash for 5) is the most representative.

I don't think this proposal is really justifying the addition of number
6-20.

In fact, for digits 1-5 it also looks very similar to variants of the Roman
digits (where 4 may also be represented by " IIII " instead of " IV ").
Roman digits where initially simple talling marks (just like older Greek or
phoenician scripts and many scripts of the world) but then turned to reuse
the same Latin letters of the alphabet (with additional letters borrowed
for multiples of powers of ten) and derived later to use also lowercase
letters and ligated glyphs.

The proposal you cite suggests to reuse the same code for two very
different variants, but I wonder why. the first variant is the most common
and matches what is used in many scripts. The second variant is probably
very specific for use with  Asian scripts. The variant with sides of a
square is probably better known (and used across cultures; howeer the order
for frawing these sides is not significant, even if they are geenrally
drawn by circularily connecting sides, which diagonal will be drawn is
usually not significant: right-handed people usually use a "/" slash,
left-handed people frequently use a "\" backslash, but people may also
alternate the slashes between the group for five and the second group for
ten, for easier visualization of the total).

Another common usage is to add a second horizontal stroke over two
successive groups of five talling marks, to create complete groups of 10.

If the surface is easily erasable or discardable, another talling line may
be used to count complete groups of tens (or some other significant groups
such as 12, depending on the context, notably in games), erasing/discarding
the first line for units as soon as it is complete.

There are also games using groups of 3 units, where talling marks are sides
of a triangle, or groups of 6 units where marks are sides of a square and
its two diagonals.

Finally there are also wellknown games where talling marks are drawing a
hanged man, usually with 10 strokes (including straight strokes for the
horizontal base, the vertical mast, the horizontal support at top, the
hanging chord, a circle for the head, and a single stroke for the
body+neck, the arms, and the legs; the exact layout may vary, I've seen
also games drawing a basic house, with a triangular roof, and a door).
However these talling schemes in games are not perceived as digits/numbers.

---

Notes about variant (1.) above :

   This variant using only vertical bars is wellknown in France when
opening and counting votes in all official elections and referenda (similar
methods may be used for other elections/polls in organizations, such as
professional elections, when there are large numbers of voters with secret
votes and no agreed electronic votes).
  Attempts to use electronic votes for official elections have always been
opposed (and they don't bring any advantage for the secret of votes or in
terms of cost and speed of operations).

  The position where to draw these vertical bars is preprinted on talling
sheets by a small dot or horizontal dash.
  This counting is made publicly, immediately after the closure of the
vote, by volunteeers (assessors) voting in that bureau, whose identity is
controled).
  An assermented public officier may also be present to control an assert
the regularity of the operations (when sealing the empty urn before opening
the vote, and when the seals are broken until votes are fully counted):
this opens in some municipalities where candidates are in strong
oppositions and the majority is likely to be contested), but most
frequently candidates have their own representant present in each bureau
(within the public which is still kept separate from the talling tables).
  Generally there are 3-4 tables by bureau for this operation (there may be
only one table if there are not enough volunteers present), plus the
president of the bureau (an elected member of the municipality and one or
two representant of the opposition, or some administrative personel of the
municipality; the police may also be present to control the public and
secure the operations. As long as there are not enough persons there to
start the count, the sealed transparent urns cannot be opened (they may be
brought by the police to another bureau that will open it publicly, but
before that, they must remain visible to anyone). The effective talling
process occurs after opening the votes and counting the individual
envelopes in groups of 100: these groups are put in sealed envelopeds that
will be opened separately on talling tables.
  On each talling table, there is two assessors counting parallelly on
separate sheets, another assessor opening the envelopes, and another one
announcing the vote orally and ordering them (all 4 assessors are
controling the regularity of the vote and signing the null votes or empty
envelopes); it may happen that a group of 100 has one additional envelope
or a missing one: this is signaled but does not cancel the group, the
talling sheet will return the effective number of valid and invalid votes.
  All valid, null/canceled and blank votes are counted with these talling
marks.Only the empty envelopes that contained the secret votes do not need
to be kept with the talling sheets (but envelopes containing
invalid/null/blank votes are signed by assessors and kept for later
control, if needed).
  At end of counting a group of 100 votes, the total number is also added
in a dedicated colum, using standard digits "0-9" and a new talling sheet
is used for the next group (each talling sheet is then signed by each one
of the 4 persons. It the two talling sheets do not have match totals at
this step (and because alls open envelopes are kept on the table), the full
group will be recounted on new sheets and assessors are signing the
cancellation of a talling sheet. If the public sees irregularities of
operations on a talling table, the group of 100 will be recounted on
another table (groups of 100 votes are never mixed on the same table).
  At end of operations, another sheet is used by the bureau to total all
votes for the bureau and results are immediately announced publicly in the
bureau and displayed ouside for several days. This totalling sheet just use
normal "0-9" digit, before sending all signed sheets of the bureau to the
central bureau of the municipality where they will be totaled and announced
poublicly and then sent to the prefecture (representing the national
authority), electronically immediately and by secure postal mail.
  Finally these totals are compared to the registry of participants (each
voter signs this registry before inserting their secret vote in the urn),
which is controlled separately publicly with their own totals, while votes
from the urns are being counted by assessors.

  The whole process lasts for about one hour (more or less depending on the
number of tables). These wellknown public operations in bureau are rarely
contested (and most people feel that it is more secure than any form of
electronic vote, which is also felts as being intrusive for the secret).

  Contestations generally come from what is happening outside the bureau
(such as illegal campaigns during the day of voting, or irregularities in
the registry of voters, or security problems for the access of voters to
the bureau), or opening votes before the official scheduled time (before
the public can be officially present) or keeping it open too late (when
there were no more voters arriving in the bureau in regular time and
waiting their turn to access the secret cabins, sign the registry and
insert their vote in the urn): there's a small tolerance for closing the
vote one or two minutes after, but at this time the public is generally
present (however at this time, results or estimations may be published and
could influence the vote of last minute voters and this could be signaled
as an irregularity, possibly invalidating all results of the bureau by a
court; if too many results are canceled by a court, changing the final
results significantly a new vote would then have to be organized later and
this has a significant cost for municipalities).

  So the vertical talling marks on sheets are just used temporarily (really
needed for a few minutes) but may still be controled later (along with with
all envelopes and vote sheets that are kept together in a large sealed
envelope containing the 100 votes, which is closed, signed by assessors and
the president of the bureau).

  As this talling method is wellknown, it is also used informally (notably
by children paying games). But talling marks using sides of a quare and a
diagonal is also common in popular games.


2016-09-06 1:53 GMT+02:00 eduardo marin <nobody_uses at outlook.com>:

> I love the proposal to add western tally marks because it only occuies two
> code points for a techically equivalent solution:
> http://www.unicode.org/L2/L2016/16065-tally-marks.pdf
> L2/16-065 (Proposal to encode two Western-style tally marks)
> <http://www.unicode.org/L2/L2016/16065-tally-marks.pdf>
> www.unicode.org
> 1 L2/16-065 Title: Proposal to encode two Western-style tally marks
> Source: Ken Lunde (Adobe) & Daisuke MIURA Status: Individual contribution
> Action: For ...
> However this proposal isn't complete unless we can identify tally marks 2,
> 3 and 4 easily and the simplest way is to add named character sequences,
> where we just repeat tally mark one an n number of times.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160906/a99c4665/attachment.html>

From doug at ewellic.org  Tue Sep  6 09:56:27 2016
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 06 Sep 2016 07:56:27 -0700
Subject: Enconding a flammable sign and others (et al.)
Message-ID: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com>

As a reminder, "let's encode X" on the public mailing list doesn't
constitute a proposal, even for emoji. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From asmusf at ix.netcom.com  Tue Sep  6 11:12:48 2016
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Tue, 6 Sep 2016 09:12:48 -0700
Subject: Enconding a flammable sign and others (et al.)
In-Reply-To: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com>
References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com>
Message-ID: <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160906/6ef79d2f/attachment.html>

From everson at evertype.com  Tue Sep  6 11:17:44 2016
From: everson at evertype.com (Michael Everson)
Date: Tue, 6 Sep 2016 17:17:44 +0100
Subject: Enconding a flammable sign and others (et al.)
In-Reply-To: <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com>
References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com>
 <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com>
Message-ID: <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com>

On 6 Sep 2016, at 17:12, Asmus Freytag (c) <asmusf at ix.netcom.com> wrote:

> As a reminder, "X" is already encoded, so no proposal for encoding it would be accepted.

My Latin chi was accepted ;-)

M

From doug at ewellic.org  Tue Sep  6 12:01:12 2016
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 06 Sep 2016 10:01:12 -0700
Subject: Enconding a flammable sign and others (et al.)
Message-ID: <20160906100112.665a7a7059d7ee80bb4d670165c8327d.2bdbbb8be8.wbe@email03.godaddy.com>

Asmus Freytag (c) wrote:

>> As a reminder, "let's encode X" on the public mailing list doesn't
>> constitute a proposal, even for emoji.
>
> As a reminder, "X" is already encoded, so no proposal for encoding it
> would be accepted.

Ha ha! It is to laugh! Let me try again.

Suggestions on the public mailing list to encode new characters,
including but not limited to the recent ones quoted below, do not
constitute proposals, even for emoji.

> It is well known that the southern song style of counting rods, had
> different forms for the digits 4, 5 and 9
> https://en.wikipedia.org/wiki/Counting_rods , however currently there
> is no way to represent such forms, a proposal to add them would only
> occupy five code points, since number four is identical both vertical
> and horizontally.

> the last four characters on the right of which I propose the following
> names: UPPER HALF BLOCK MEDIUM SHADE, LOWER HALF BLOCK MEDIUM SHADE,
> FULL BLOCK-UPPER HALF MEDIUM SHADE and FULL BLOCK-LOWER HALF MEDIUM
> SHADE. I recommend encoding them in the Miscellaneous Symbols and
> Arrows block or in the Geometric Shapes Extended block.

> I'd argue there's a fifth character missing from the ZX-81 set. Notice
> how there are two separate MEDIUM SHADEs, one the inverse of the
> other. For complete compatibility Unicode would also somehow need a
> second one of those.

> However this proposal isn't complete unless we can identify tally
> marks 2, 3 and 4 easily and the simplest way is to add named character
> sequences, where we just repeat tally mark one an n number of times.

> I'm really surprised this isn't already encoded but while we are at it
> let's also encode a symbol for non-ioninzing radiation, laser hazard,
> explosion hazard, strong magnetic field, chocking hazard, corrosion,
> slippery floor, oxidising, carcinogen and chemical weapon symbols just
> to name the most relevant. Other ones would include uv light, frezzing
> hazard, hand in the middle of cogs, foot or hand under machinery,
> battery hazard, washing hands, fragile symbol, crane hazard,
> suffocation, high temeperatures and probably some I missed.

 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From asmusf at ix.netcom.com  Tue Sep  6 14:12:20 2016
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Tue, 6 Sep 2016 12:12:20 -0700
Subject: Enconding a flammable sign and others (et al.)
In-Reply-To: <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com>
References: <20160906075627.665a7a7059d7ee80bb4d670165c8327d.21ef170ccd.wbe@email03.godaddy.com>
 <4a986c1e-1460-ecab-0427-0fe1b8c50fb9@ix.netcom.com>
 <2F64DAE3-5983-4A36-8E4D-744CBFC0CFB1@evertype.com>
Message-ID: <1e79542b-d7f4-17d8-c432-595da6e9b774@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160906/ac0ea69b/attachment.html>

From mpsuzuki at hiroshima-u.ac.jp  Fri Sep  9 08:58:24 2016
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Fri, 09 Sep 2016 22:58:24 +0900
Subject: how to evaluate the "emoji support level" in given font?
Message-ID: <57D2C000.5070902@hiroshima-u.ac.jp>

Hi,

Recently, fontconfig developers are discussing how to evaluate
"is this font supporting 'emoji' set sufficiently?". Is it possible
to design a subset of emoji to serve common use of emoji?

For detail about the discussion of fontconfig developers, please
refer the thread from:
https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html

* about fontconfig
fontconfig is a library which is widely used by Unix-like operating
systems to locate a (pathname of) font file, by the query with a few
typographic category (serif/sans-serif/monospace etc), script, and
language. fontconfig crawls the font files on the systems, and make
a database to respond such query. To guess the supported script and
language, basically fontconfig checks the coverage of the codepoints
with relevant glyph data. The coverage is compared with the orthography
database: for the case of CJK script, the coverage is compared with
GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc.

* emoji and fontconfig
At present, fontconfig developers are wondering how they can list the
codepoints to evaluate the query "this font support emoji?". The stable
subset of emoji would be the repertoire used by Japanese legacy cellular
phones, but (personally) I don't think it is still respected to design
some emoji fonts, as far as the developer is careful about the legacy
cellular phone users.

Is it possible to design a subset of emoji to serve common use of emoji?
Or, if such attempt (evaluate the support level of emoji by checking
some codepoints) is wrong, is there any good method to evaluate the
support level of emoji in given font?

Regards,
mpsuzuki


From mpsuzuki at hiroshima-u.ac.jp  Fri Sep  9 09:20:43 2016
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Fri, 09 Sep 2016 23:20:43 +0900
Subject: [Unicode] how to evaluate the "emoji support level" in given font?
In-Reply-To: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com>
References: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com>
Message-ID: <57D2C53B.1050007@hiroshima-u.ac.jp>

oh, I should add more words why I wrote "subset". There is a full
list of emoji defined by Unicode;
http://unicode.org/Public/emoji/3.0/emoji-data.txt
But I'm questionable whether the most emoji font developers are
trying to fill all of this list.

For example, to check the support level for zh-CN, fontconfig does
not check all G-source characters of CJK Unified Ideograph - because,
there are so many Chinese fonts covering GB 2312 but not coverting
GB 18030. I guess similar situation in emoji fonts...

Regards,
mpsuzuki

suzuki toshiya wrote:
> Hi,
> 
> Recently, fontconfig developers are discussing how to evaluate
> "is this font supporting 'emoji' set sufficiently?". Is it possible
> to design a subset of emoji to serve common use of emoji?
> 
> For detail about the discussion of fontconfig developers, please
> refer the thread from:
> https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html
> 
> * about fontconfig
> fontconfig is a library which is widely used by Unix-like operating
> systems to locate a (pathname of) font file, by the query with a few
> typographic category (serif/sans-serif/monospace etc), script, and
> language. fontconfig crawls the font files on the systems, and make
> a database to respond such query. To guess the supported script and
> language, basically fontconfig checks the coverage of the codepoints
> with relevant glyph data. The coverage is compared with the orthography
> database: for the case of CJK script, the coverage is compared with
> GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc.
> 
> * emoji and fontconfig
> At present, fontconfig developers are wondering how they can list the
> codepoints to evaluate the query "this font support emoji?". The stable
> subset of emoji would be the repertoire used by Japanese legacy cellular
> phones, but (personally) I don't think it is still respected to design
> some emoji fonts, as far as the developer is careful about the legacy
> cellular phone users.
> 
> Is it possible to design a subset of emoji to serve common use of emoji?
> Or, if such attempt (evaluate the support level of emoji by checking
> some codepoints) is wrong, is there any good method to evaluate the
> support level of emoji in given font?
> 
> Regards,
> mpsuzuki
> 
> 


From pedberg at apple.com  Fri Sep  9 19:23:55 2016
From: pedberg at apple.com (Peter Edberg)
Date: Fri, 09 Sep 2016 17:23:55 -0700
Subject: CLDR Version 30 alpha available
Message-ID: <5D20A154-190C-48DD-A92D-041E9009D6A5@apple.com>

Dear Unicode list members,

The alpha draft version of Unicode CLDR v30 is available for testing. The main improvements include:
? New format and preference structure has been added to support week designations such as ?the week of August 10? or ?week 3 of March?.
? New data items have been added to support relative times such as ?3 Fridays ago? or ?this hour?.
? New <characterLabels> data can be used to generate labels for groups of related characters in character pickers.
? The structure for emoji annotations has been revised, and the data has been significantly updated.
? Unicode support is updated to 9.0, including updated Unihan readings for the pinyin collation and Han-Latin transforms, and support for new script codes and number systems. Support is also added for region codes EZ, UN.
? The set of language codes for translation has been updated, with a significant increase in the total number of translated language names.
? The CLDR 30 Survey Tool data collection and additional bug fixing resulted in a net increase in data items of about 8.6%, with an additional 5.6% of items changed.

Draft release note: http://cldr.unicode.org/index/downloads/cldr-30 <http://cldr.unicode.org/index/downloads/cldr-30>
Draft charts: http://www.unicode.org/cldr/charts/dev/ <http://www.unicode.org/cldr/charts/dev/>
Draft data tag: http://www.unicode.org/repos/cldr/tags/release-30-d01 <http://www.unicode.org/repos/cldr/tags/release-30-d01>

The final release of CLDR 30 is targeted for the end of September. Please provide any feedback on the alpha draft version by filing a ticket as described Here: http://cldr.unicode.org/index/bug-reports <http://cldr.unicode.org/index/bug-reports>

Best regards,
Peter Edberg for the CLDR Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160909/93cab591/attachment.html>

From doug at ewellic.org  Sat Sep 10 17:25:29 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 10 Sep 2016 16:25:29 -0600
Subject: how to evaluate the "emoji support level" in given font?
Message-ID: <22B1CFFCA9DC423DA4874F4EED0AA274@DougEwell>

suzuki toshiya wrote:

> Is it possible to design a subset of emoji to serve common use of
> emoji? Or, if such attempt (evaluate the support level of emoji by
> checking some codepoints) is wrong, is there any good method to
> evaluate the support level of emoji in given font?

This could be a more complex question for emoji than for writing 
systems. Users might have different expectations of what constitutes 
"support" for emoji as compared to, say, Latin or Kanji. They might 
expect to be able to select text or emoji rendering via U+FE0E and 
U+FE0F, or might expect support for ZWNJ sequences. fontconfig may or 
may not be wired to take this into account.

On the other hand, as far as full or less-than-full coverage is 
concerned, does fontconfig currently insist on 100% coverage of all 
characters in a script? Nearly all fonts offer Basic Latin (ASCII) 
support, while relatively few have glyphs for the Latin Extended-E 
block. Is the latter required in order to claim "Latin script support," 
and if not, would similar criteria apply when determining "emoji 
support"?

Not trying to be critical, just trying to understand the expectations.

--
Doug Ewell | Thornton, CO, US | ewellic.org 


From christoph.paeper at crissov.de  Sun Sep 11 07:40:55 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sun, 11 Sep 2016 14:40:55 +0200
Subject: Additional Emoji selection factor: Support by "Major Vendors"
In-Reply-To: <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
Message-ID: <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>

Mark Davis ?? <mark at macchiato.com>:
> 
> In order to understand the status of any document in the registry, you need to also look at the minutes of the meeting where they are discussed, in this case: http://www.unicode.org/L2/L2016/16121.htm
> 
>> B.14.3 Provisional value for Emoji property [Emoji SC/Edberg, L2/16-087]
>> 
>> B.14.3.1 Characters Proposed for Emoji=Provisional [Emoji SC/Edberg, L2/16-088]
>> 
>> Discussion. UTC took no action at this time.
> 
> "Took no action" generally means "rejected". 

Can anyone explain then, why [L2/16-128] seems to have been ?rejected? and still made it into selection.html? 

Same minutes as above:

> E.1.11 Additional Emoji selection factor [Emoji SC/Edberg, L2/16-128]
> 
> Discussion. UTC took no action at this time.

[L2/16-128]: http://www.unicode.org/L2/L2016/16128-additional-emoji-selection-factor.pdf

This was the proposed text to be added:

> The following is a criterion for adding characters into a release of Unicode. It is not a selection factor that proposals need to address, but rather a consideration that the UTC takes into account before approving a character as a candidate for inclusion in a future release. 
> 
> Compared to most other characters in Unicode, there is greater public awareness of new emoji characters, and a high expectation of support for them from major vendors. However, the cost to such vendors of supporting new emoji characters is also much higher than for most other Unicode characters, especially on devices with limited memory.
> 
> Thus in addition to these selection factors, before approving a new emoji character the Unicode Technical Committee needs to expect wide deployment: that major vendors would plan to include the proposed emoji character into very widely deployed fonts and input methods (keyboards / palettes / speech).

In the currently public version of ?Submitting Emoji Character Proposals? (dated 4 August 2016) we find most of it unchanged.

http://www.unicode.org/emoji/selection.html#selection_factors

> Before approving as candidates or adding to a release of Unicode, other considerations are taken into account. See UTC Consideration.

http://www.unicode.org/emoji/selection.html#utc_consideration

> 1. Compared to most other characters in Unicode, there is greater public awareness of new emoji characters, and a high expectation of support for them from major vendors. However, the cost to such vendors of supporting new emoji characters is also much higher than for most other Unicode characters, especially on devices with limited memory.
> 
> 2. Thus in addition to the selection factors, before approving a new emoji character the Unicode Technical Committee needs to expect wide deployment: that major vendors would plan to include the proposed emoji character into very widely deployed fonts and input methods (keyboards / palettes / speech).
> 
> 3. The committee may balance the choices of emoji in a given set of candidates or release. For example, rather than 15 different breeds of dogs, the committee might choose to have some faces, some clothing, other animals, food items, transport items, and sports.

None of that was present in April 2016. <https://web.archive.org/web/20160427074931/http://www.unicode.org/emoji/selection.html>

I haven?t been able to find out what constitutes a ?major vendor?. Apple, Microsoft and Google are certainly ones (and Unicode Full Members), but what about, for instance, Samsung, LG, Sony, Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent ones like Emojione (mostly Associate Members or not Unicode members at all)?

From verdy_p at wanadoo.fr  Sun Sep 11 09:21:27 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 11 Sep 2016 16:21:27 +0200
Subject: Additional Emoji selection factor: Support by "Major Vendors"
In-Reply-To: <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
 <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>
Message-ID: <CAGa7JC0OkNJcaiWnoGdOGbfD=pB1e0W+Kw2K7D60tCFBMgnvcQ@mail.gmail.com>

2016-09-11 14:40 GMT+02:00 Christoph P?per <christoph.paeper at crissov.de>:

> Mark Davis ?? <mark at macchiato.com>:
> >
> > I haven?t been able to find out what constitutes a ?major vendor?.
> Apple, Microsoft and Google are certainly ones (and Unicode Full Members),
> but what about, for instance, Samsung, LG, Sony,


They are vendors yes, but for hardware devices using common platforms
(Windows and Android most often, or derived directly from Android). The
hardware capabilities of devices made by them are mostly the same as the
ycompete at the same time on the same global market with similar features.

The good question however is how they support these devices for the long
term: sales of new mobile devices are slowing down, as more and more people
feel that they are too much expensive and there's now a life to maintain
devices functional, and there's now a vivid secondary market for repairing
or upgrading them, or changing their batteries instead of buying a new
device (additionally now people already have a secondary phone, possibly
older, but that can be used as backup when necessary).


> Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent
> ones like Emojione (mostly Associate Members or not Unicode members at all)?
>

These are connected wep apps, and web apps have no difficulties to support
and upgrade their supporting fonts or collections of icons. Users don't
need to upgrade, this occurs almost transparently. If these are mobile
apps, they are updated extremely frequently from their publication store
(Windows Store, Google Play, Apple Store), using simple procedures (however
mobile apps tend to grow in size at each update, forcing users to select
which one they'll keep, or forcing them to uninstall one to install another
one, then switch back, if they have old models with low storage capacity:
8GB smartphones are now deprecating in factor of 16GB ones, but these app
vendors should propose several versions of their apps for users with
limited storage, and they forget to monitor the market to see that this
growth of app sizes is becoming a problem, when many features are added but
not used by users. Most resident features should better be installed on
demand in a cache that will clean up automatically if storage becomes too
low; ideally most apps should be online and should use minimal local code).

Beside these vendors, there remain some niche markets for mobile devices
that have their own OS not compatible and not supported by these wellknown
apps vendors. But most of them are created based on a Linux core (just like
Android itself), and use wellknown platforms (Java, .Net, or simply the
HTML/CSS support of the builtin browser) and apps are not complicate to
port.

The real complication is in the default support for input interfaces (i.e.
virtual keyboards) that these apps need, and adaptaing them for the local
markets (languages). Emoji input however is mostly independant and can be
developped and supported across languages in the same input panels.

The necessary sets of fonts or icons may also be instaleld transparently
using web queries and the standard browser cache. These should be mostly
independant of the target app platform or OS. vendors may develop their own
common subsets (with personalized style), but there will be plenty of
alternative offers (just like there's a lot of providers since long for
emoticons, long before many of them where encoded). I don't think this is a
major problem.

However the complexity now is not in the encoding of emojis but the recent
development of complex encodings requiring now large ligature tables to
work properly, and/or using variant selectors. The most common combinations
should be better documented in the standard (it is possible to encode them
in the list of "named sequences", or in a separate list specifically for
emojis). But I wonder if this is productive to show all styles for emojis:
why not returning to just display a single representative glyph, with basic
"flat" designs, but probably with some colors to help distinguishing them:
but which version from which vendor should the standard use for that glyph
?.

Emoji doc pages and reports in the Unicode site tend to become very large
and it becomes much harder now to initiale new sets for designing new
styles. Google, Apple, Microsoft, Facebook can develop their own sets, just
like there are standard sets from Japanese telcos.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160911/7e15a94a/attachment.html>

From christoph.paeper at crissov.de  Sun Sep 11 16:26:47 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sun, 11 Sep 2016 23:26:47 +0200
Subject: Additional Emoji selection factor: Support by "Major Vendors"
In-Reply-To: <CAGa7JC0OkNJcaiWnoGdOGbfD=pB1e0W+Kw2K7D60tCFBMgnvcQ@mail.gmail.com>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
 <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>
 <CAGa7JC0OkNJcaiWnoGdOGbfD=pB1e0W+Kw2K7D60tCFBMgnvcQ@mail.gmail.com>
Message-ID: <F5FD0D8F-3259-464E-BFF6-92E96A994985@crissov.de>

Philippe Verdy <verdy_p at wanadoo.fr>:
> 2016-09-11 14:40 GMT+02:00 Christoph P?per <christoph.paeper at crissov.de>:
> 
>> I haven?t been able to find out what constitutes a ?major vendor?. Apple, Microsoft and Google are certainly ones (?), but what about, for instance, Samsung, LG, Sony,
> 
> They are vendors yes, but for hardware devices using common platforms (?).

The important point here is that at least Samsung and LG are selling millions of devices annually with custom emoji fonts installed on them.

> The good question however is how they support these devices for the long term:

Many vendors are indeed really, really bad at developing, maintaining and rolling out OS updates for much of their product line-up. Even Apple supports their iOS devices for only ca. 5 years (from launch, not purchase)^, although the update infrastructure is already set up and new fonts by themselves would be rather small and simple fixes. Current emoji font files for mobile operating systems are somewhere between 3 MB (vector-based) and almost 40 MB (bitmap-based). 

^ Apple released iOS 5, the first version with non-PUA emojis and iMessages (IIRC), and the iPhone 4s in late 2011. The OS update being deployed this week (i.e. iOS 10) will no longer support that device which was sold in some places until earlier this year. That means, there are now iOS devices which were capable of handling Unicode emojis at their launch date, but will be permanently incapable of displaying new ones from Unicode 9 or later.

>> Twitter/Twemoji, Facebook, Whatsapp or widely-used platform-independent ones like Emojione (?)?
> 
> These are connected wep apps, and web apps have no difficulties to support and upgrade their supporting fonts or collections of icons. Users don't need to upgrade, this occurs almost transparently. If these are mobile apps, they are updated extremely frequently from their publication store

That?s quite true, but doesn?t say anything about whether they?re ?major vendors? when it comes standardizing new emoji characters in Unicode.

> The real complication is in the default support for input interfaces (i.e. virtual keyboards) that these apps need, and adaptaing them for the local markets (languages). Emoji input however is mostly independant and can be developped and supported across languages in the same input panels.

That?s a general assumption, but I?m not sure it would hold against a user test. Observe, for instance, how on 11 September (or 4 July or during the Olympics) people complain on Twitter and elsewhere that they cannot find their ?American flag emoji? on the top of the list, or how they confuse it with the Liberian or Malaysian regional indicator (cf. Texas vs. Chile). That?s why there are customized panels for frequently or recently used emojis, or auto-replace and suggest-as-you-type algorithms.

> However the complexity now is not in the encoding of emojis but the recent development of complex encodings requiring now large ligature tables to work properly, and/or using variant selectors.

Those large tables can be generated by rather short algorithms, which perhaps could be simpler if emoji properties were more systematic.

From duerst at it.aoyama.ac.jp  Tue Sep 13 03:03:24 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 13 Sep 2016 17:03:24 +0900
Subject: [Unicode] how to evaluate the "emoji support level" in given font?
In-Reply-To: <57D2C53B.1050007@hiroshima-u.ac.jp>
References: <6fb50b79eec84f8993fce175508f5f58@KL1PR04MB1637.apcprd04.prod.outlook.com>
 <57D2C53B.1050007@hiroshima-u.ac.jp>
Message-ID: <5fc34456-744b-7e8c-76a0-bd7368c58739@it.aoyama.ac.jp>

I think the first and most obvious way to check would be according to 
Unicode Version, i.e. check for some Emoji introduced in version 6, in 
version 7, and so on. For very old sets, checking for emoji present in 
the NTT Docomo set but not in the Softbank set,... might also make sense.

Regards,    Martin.

On 2016/09/09 23:20, suzuki toshiya wrote:
> oh, I should add more words why I wrote "subset". There is a full
> list of emoji defined by Unicode;
> http://unicode.org/Public/emoji/3.0/emoji-data.txt
> But I'm questionable whether the most emoji font developers are
> trying to fill all of this list.
>
> For example, to check the support level for zh-CN, fontconfig does
> not check all G-source characters of CJK Unified Ideograph - because,
> there are so many Chinese fonts covering GB 2312 but not coverting
> GB 18030. I guess similar situation in emoji fonts...
>
> Regards,
> mpsuzuki
>
> suzuki toshiya wrote:
>> Hi,
>>
>> Recently, fontconfig developers are discussing how to evaluate
>> "is this font supporting 'emoji' set sufficiently?". Is it possible
>> to design a subset of emoji to serve common use of emoji?
>>
>> For detail about the discussion of fontconfig developers, please
>> refer the thread from:
>> https://lists.freedesktop.org/archives/fontconfig/2016-September/005830.html
>>
>> * about fontconfig
>> fontconfig is a library which is widely used by Unix-like operating
>> systems to locate a (pathname of) font file, by the query with a few
>> typographic category (serif/sans-serif/monospace etc), script, and
>> language. fontconfig crawls the font files on the systems, and make
>> a database to respond such query. To guess the supported script and
>> language, basically fontconfig checks the coverage of the codepoints
>> with relevant glyph data. The coverage is compared with the orthography
>> database: for the case of CJK script, the coverage is compared with
>> GB 2312, Big5, HKSCS, JIS X 0208, KS X 1001 etc.
>>
>> * emoji and fontconfig
>> At present, fontconfig developers are wondering how they can list the
>> codepoints to evaluate the query "this font support emoji?". The stable
>> subset of emoji would be the repertoire used by Japanese legacy cellular
>> phones, but (personally) I don't think it is still respected to design
>> some emoji fonts, as far as the developer is careful about the legacy
>> cellular phone users.
>>
>> Is it possible to design a subset of emoji to serve common use of emoji?
>> Or, if such attempt (evaluate the support level of emoji by checking
>> some codepoints) is wrong, is there any good method to evaluate the
>> support level of emoji in given font?
>>
>> Regards,
>> mpsuzuki
>>
>>
>
> .
>

-- 
Martin J. D?rst
Department of Intelligent Information Technology
Collegue of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

From costello at mitre.org  Thu Sep 15 06:14:58 2016
From: costello at mitre.org (Costello, Roger L.)
Date: Thu, 15 Sep 2016 11:14:58 +0000
Subject: Default character encoding for each operating system?
Message-ID: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>

Hi Folks,
In a book that I am reading [1] the author mentions "the default character encoding for the operating system." What is the default character encoding of:

-          Windows 10

-          Mac OS

-          Linux

/Roger
[1] Practical Common Lisp by Peter Seibel, p. 165 (footnote 2).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/e54baf44/attachment.html>

From prosfilaes at gmail.com  Thu Sep 15 06:43:48 2016
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 15 Sep 2016 11:43:48 +0000
Subject: Default character encoding for each operating system?
In-Reply-To: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
Message-ID: <CAMZ=zj4PycdmoAzhbo8+-2RpQFN7P4gVpwGT_6ExAD9d493K5w@mail.gmail.com>

Linux is far less specific than Windows 10. In all recent versions of
Debian GNU/Linux, UTF-8 is the most common character encoding, but it is
still supported to use ISO-8859-x or I believe even something like EUC-JP.
Other distributions may enforce UTF-8 or in rare cases ISO 8859-1 or even
something else.

On Thu, Sep 15, 2016, 4:18 AM Costello, Roger L. <costello at mitre.org> wrote:

> Hi Folks,
>
> In a book that I am reading [1] the author mentions ?the default character
> encoding for the operating system.? What is the default character encoding
> of:
>
> -          Windows 10
>
> -          Mac OS
>
> -          Linux
>
>
> /Roger
>
> [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/5bf623c2/attachment.html>

From verdy_p at wanadoo.fr  Thu Sep 15 08:19:38 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 15 Sep 2016 15:19:38 +0200
Subject: Default character encoding for each operating system?
In-Reply-To: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
Message-ID: <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>

A better question is what is the default character encoding for the
**installed** operating system.

Unfortunately it has no single response, because there are several default
encodings for several parts of the OS. An OS has lots of components, many
of them don't are transparent to the encoding it uses.

All the 3 OSes you cite support several default character encodings, and in
addition they support them in several encoding forms. All three support
Unicode internally, but not in all software components. that will run with
one or the other.

And defaults will change according to your distribution or OS configuration
options, and to your own current user settings

2016-09-15 13:14 GMT+02:00 Costello, Roger L. <costello at mitre.org>:

> Hi Folks,
>
> In a book that I am reading [1] the author mentions ?the default character
> encoding for the operating system.? What is the default character encoding
> of:
>
> -          Windows 10
>
> -          Mac OS
>
> -          Linux
>
>
> /Roger
>
> [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/b8374c1a/attachment.html>

From john.w.kennedy at gmail.com  Thu Sep 15 09:36:39 2016
From: john.w.kennedy at gmail.com (John W Kennedy)
Date: Thu, 15 Sep 2016 10:36:39 -0400
Subject: Default character encoding for each operating system?
In-Reply-To: <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
Message-ID: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>

macOS, and its offspring, iOS, watchOS, and tvOS, use UTF-16LE for all internals, but readily import and export all versions of Unicode and a good many historic 8-bit and mixed-length codings. 

In the new Swift programming language, which is white-hot in the Apple community, Apple is moving toward a model of a transparent, generic Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, but in which a ?character? contains however many code points it needs (?e? with a stacked macron, acute accent, and dieresis is algorithmically one ?character? in Swift). Moreover, e-with-an-acute-accent and e followed by a combining acute accent, for example, compare as equal. At present, the underlying code is still UTF-16LE.

-- 
SKen Software, LLC
Coming soon to an iPhone near you

> On Sep 15, 2016, at 9:19 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> A better question is what is the default character encoding for the **installed** operating system.
> 
> Unfortunately it has no single response, because there are several default encodings for several parts of the OS. An OS has lots of components, many of them don't are transparent to the encoding it uses.
> 
> All the 3 OSes you cite support several default character encodings, and in addition they support them in several encoding forms. All three support Unicode internally, but not in all software components. that will run with one or the other.
> 
> And defaults will change according to your distribution or OS configuration options, and to your own current user settings
> 
> 2016-09-15 13:14 GMT+02:00 Costello, Roger L. <costello at mitre.org>:
>> Hi Folks,
>> 
>> In a book that I am reading [1] the author mentions ?the default character encoding for the operating system.? What is the default character encoding of:
>> 
>> -          Windows 10
>> 
>> -          Mac OS
>> 
>> -          Linux
>> 
>> 
>> /Roger
>> 
>> [1] Practical Common Lisp by Peter Seibel, p. 165 (footnote 2).
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/500fca43/attachment.html>

From verdy_p at wanadoo.fr  Thu Sep 15 10:25:17 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 15 Sep 2016 17:25:17 +0200
Subject: Default character encoding for each operating system?
In-Reply-To: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
Message-ID: <CAGa7JC3On2cgusNhZtA3Dp3=EgfaTpA8+3T7TOEk9eVgnBq_vQ@mail.gmail.com>

Not all internals. Many kernel drivers (notably bus drivers) still use an
OEM 8 bit encoding in their debugging log (based on an US English locale
most often even if the installed version if localized to another version;
but I've seen CP850 still used; and you can see some samples in the Event
Viewer). Those messages in fact are not localized at all and intended only
for debugging or analysis by developers, or displayed on a Windows console.

Many console tools on Windows still use the default 8-bit OEM charset and
won't display any Unicode output, even when the console is set to use an
Unicode codepage: I can still see some mojibake, even on Windows 10). When
those ouput messages are read from other UI tools, they won't be
interpreted in their codepage but in the default "ANSI" codepage (such as
Windows1252).

Filesystems still use legacy charsets in their basic directory structure
(e.g. when inserting a FAT or FAT32 volume, formated without the LFN
extensions for Windows which also stores filenames in UTF-16, such as a SD
card formatted on a digital camera; as the directories and filenames create
on those devices only use ASCII and uninformative names such as
IMG00001.JPG this generally does not cause a problem; but no Unicode name
is stored; I've seen however some digital cameras storing some filenames in
a legacy Chinese or Japanese charset, incorrectly rendered when viewing
their content on a non-Japanese/Chinese system).

2016-09-15 16:36 GMT+02:00 John W Kennedy <john.w.kennedy at gmail.com>:

> macOS, and its offspring, iOS, watchOS, and tvOS, use UTF-16LE for all
> internals, but readily import and export all versions of Unicode and a good
> many historic 8-bit and mixed-length codings.
>
> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic Unicode
> that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, but in which
> a ?character? contains however many code points it needs (?e? with a
> stacked macron, acute accent, and dieresis is algorithmically one
> ?character? in Swift). Moreover, e-with-an-acute-accent and e followed by a
> combining acute accent, for example, compare as equal. At present, the
> underlying code is still UTF-16LE.
>
> --
> SKen Software, LLC
> Coming soon to an iPhone near you
>
> On Sep 15, 2016, at 9:19 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> A better question is what is the default character encoding for the
> **installed** operating system.
>
> Unfortunately it has no single response, because there are several default
> encodings for several parts of the OS. An OS has lots of components, many
> of them don't are transparent to the encoding it uses.
>
> All the 3 OSes you cite support several default character encodings, and
> in addition they support them in several encoding forms. All three support
> Unicode internally, but not in all software components. that will run with
> one or the other.
>
> And defaults will change according to your distribution or OS
> configuration options, and to your own current user settings
>
> 2016-09-15 13:14 GMT+02:00 Costello, Roger L. <costello at mitre.org>:
>
>> Hi Folks,
>>
>> In a book that I am reading [1] the author mentions ?the default
>> character encoding for the operating system.? What is the default character
>> encoding of:
>>
>> -          Windows 10
>>
>> -          Mac OS
>>
>> -          Linux
>>
>>
>> /Roger
>>
>> [1] *Practical Common Lisp* by Peter Seibel, p. 165 (footnote 2).
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/9883c8d5/attachment.html>

From jsbien at mimuw.edu.pl  Thu Sep 15 14:12:53 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Thu, 15 Sep 2016 21:12:53 +0200
Subject: "textels" (was: Default character encoding for each operating system?)
In-Reply-To: <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com> (John
 W. Kennedy's message of "Thu, 15 Sep 2016 10:36:39 -0400")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
Message-ID: <86poo5f03u.fsf_-_@mimuw.edu.pl>

On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:

[...]

> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic
> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary,
> but in which a ?character? contains however many code points it needs
> (?e? with a stacked macron, acute accent, and dieresis is
> algorithmically one ?character? in Swift). Moreover,
> e-with-an-acute-accent and e followed by a combining acute accent, for
> example, compare as equal. At present, the underlying code is still
> UTF-16LE.

For several years I use the name "textel" (text element, in Polish
"tekstel") for such objects. I do it mostly orally in my presentations
for my students, but I used it also in writing e.g. in
http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
definition. A rudymentary definition was provided for me only in my
recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
(on p. 69) "an elementary text element independently of its Unicode
representation" (meaning in particular composed vs precomposed). I still
hope to formulate sooner or later a more satisfactory definition :-)

I think Swift confirms that such a notion is really needed.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From eliz at gnu.org  Thu Sep 15 14:27:14 2016
From: eliz at gnu.org (Eli Zaretskii)
Date: Thu, 15 Sep 2016 22:27:14 +0300
Subject: "textels" (was: Default character encoding for each operating
 system?)
In-Reply-To: <86poo5f03u.fsf_-_@mimuw.edu.pl> (jsbien@mimuw.edu.pl)
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl>
Message-ID: <83r38l55gt.fsf@gnu.org>

> From: jsbien at mimuw.edu.pl (Janusz S. Bie?)
> Date: Thu, 15 Sep 2016 21:12:53 +0200
> Cc: mufi-fonts <mufi-fonts at googlegroups.com>
> 
> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
> 
> [...]
> 
> > In the new Swift programming language, which is white-hot in the Apple
> > community, Apple is moving toward a model of a transparent, generic
> > Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary,
> > but in which a ?character? contains however many code points it needs
> > (?e? with a stacked macron, acute accent, and dieresis is
> > algorithmically one ?character? in Swift). Moreover,
> > e-with-an-acute-accent and e followed by a combining acute accent, for
> > example, compare as equal. At present, the underlying code is still
> > UTF-16LE.
> 
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
> definition.

Isn't "grapheme cluster" the definition you are looking for?

From jsbien at mimuw.edu.pl  Thu Sep 15 14:56:32 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Thu, 15 Sep 2016 21:56:32 +0200
Subject: "textels"
In-Reply-To: <83r38l55gt.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 15 Sep
 2016 22:27:14 +0300")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
Message-ID: <86r38llyxb.fsf@mimuw.edu.pl>

On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes:

[...]

> Isn't "grapheme cluster" the definition you are looking for?

I don't think so.

On Thu, Sep 15 2016 at 21:27 CEST, leoboiko at namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"? (Well, technically UAX
> #29[1] defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> algorithmically).
>
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
>
> https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
>
> Such a notion is indeed needed, but it has been always there.
>
> [1] http://unicode.org/reports/tr29/

Perhaps I don't understand properly the rather obscure definitions, like

        An extended grapheme cluster is the same as a legacy grapheme
        cluster, with the addition of some other characters.

However:

1. Graphemes, if I understand correctly, are language dependent, textels
are not.

2. Textel "?" means both U+0144 and <U+006E,U+0301>, so it is a notion
on a higher abstraction level then a grapheme cluster.

Moreover I don't want to call <U+006E,U+0301> (LATIN SMALL LETTER N,
COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2
reasons:

1. there is nothing extended in it
2. U+0301 is not a grapheme according to Polish linguistics terminology

Regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From kenwhistler at att.net  Thu Sep 15 19:27:24 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 15 Sep 2016 17:27:24 -0700
Subject: Additional Emoji selection factor: Support by "Major Vendors"
In-Reply-To: <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>
References: <FB1F7F97-45D2-49E0-BE95-9487A0AEA5C2@crissov.de>
 <CAJ2xs_FJ9=4=Wbph7yS0Ugd-6JMSYN4Xf=Yq6nbZH6Q_5bH70Q@mail.gmail.com>
 <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de>
 <CAJ2xs_HO0gQNfFmfVduGAA25j50QfaUZGsqVeNjreiD3Njj8Og@mail.gmail.com>
 <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de>
 <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de>
 <3FE65C05-E06E-4735-B304-A4628CE10BA5@crissov.de>
 <CAJ2xs_GkZ=_DFGQtMh_XH6E6+FT9_oJbFB8bxgejwDDyiXSs3w@mail.gmail.com>
 <BF7F349B-E03A-49DD-A9D9-372DD81AD499@crissov.de>
Message-ID: <65f8b1a5-0755-c9f5-d917-59e6c17e16a1@att.net>


On 9/11/2016 5:40 AM, Christoph P?per wrote:
>> "Took no action" generally means "rejected".
> Can anyone explain then, why [L2/16-128] seems to have been ?rejected? and still made it into selection.html?

Not all documents in the UTC document register are born equal.

If a document in the register is explicitly a *proposal* to encode X at 
code point Y in version Z of the Unicode Standard, then that requires a 
recorded decision by the UTC. If the UTC takes up such a document, and 
the minutes for the agenda item in question note only "UTC took no 
action at this time", that clearly indicates that as of that date the 
UTC had not *accepted* the proposal. It *might* mean that the proposal 
was rejected, but a rejection is often then also indicated with some 
action item to follow up with the proposal author. If the proposal 
author is in the room for the discussion, they might simply take notes 
about some possible future revision of the proposal, and no action need 
be formally minuted. In only a few instances would a rejection be 
minuted as a formal decision -- that case is generally limited to some 
encoding proposals that are objectionable in ways that the UTC 
determines are unlikely to be fixable, and which thus should not be 
re-discussed in future meetings.

Other kinds of documents in the register (and associated agenda items to 
discuss them) may not require minuting of formal decisions by the UTC at 
all.

>   
>
> Same minutes as above:
>
>> >E.1.11 Additional Emoji selection factor [Emoji SC/Edberg, L2/16-128]
>> >
>> >Discussion. UTC took no action at this time.
> [L2/16-128]:http://www.unicode.org/L2/L2016/16128-additional-emoji-selection-factor.pdf
>
> This was the proposed text to be added:
>

In a case like that, the UTC doesn't necessarily control the exact text 
of a web page. The emoji selection factors are not a formal 
specification or a published standard. They are guidelines that the 
Emoji Subcommittee uses to help organize and rationalize its 
consideration of all the various proposals that get submitted for 
encoding more emoji characters. That helps the Emoji Subcommittee 
assemble better summarized proposals to bring to the UTC when it is time 
to standardize some selected set of new emoji and assign code points for 
them for a new version of the Unicode Standard.

L2/16-128 was brought to the attention of the UTC by the Emoji 
Subcommittee to let the UTC know they were considering another selection 
factor, and to allow discussion and let people raise objections or make 
other suggestions. Once the Emoji Subcommittee gets that feedback, they 
could then go back and update the relevant web page regarding selection 
factors. No UTC decision is required for something like that. People who 
have a problem with one or another of the selection criteria that the 
Emoji Subcommittee has been using can always submit feedback, if they 
wish, and I'm sure the Emoji Subcommittee would take such feedback under 
advisement.

In general, I would advise people who are interested in the UTC and UTC 
process to not treat the UTC minutes as legal documents that require 
their wording to be litigated line by line. Minutes of standards 
organizations function primarily as their institutional memory about 
decisions taken and associated actions to follow up on decisions. The 
wording of such minutes tends to be brief and telegraphic, because a lot 
of topics are taken up, and a lot of decisions and actions have to be 
recorded quickly -- and their wording is usually aimed at being clear to 
the people doing the actual maintenance of the standard(s) or other 
specifications. They are not meeting transcripts, and they do not 
attempt to recapitulate discussions nor do they provide detailed 
rationales for every decision taken by the committee.

If something is unclear about some decision taken by the UTC, or the 
outcome of the discussion of some particular topic is unclear and you 
desire elucidation, the best course is often simply to ask somebody who 
attended the meeting about it. Many participants in the UTC meetings 
*do* monitor this discussion list, for example.

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160915/6414e5a4/attachment.html>

From kenwhistler at att.net  Thu Sep 15 20:13:48 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 15 Sep 2016 18:13:48 -0700
Subject: Why isn't MUSICAL SYMBOL NULL NOTEHEAD default ignorable?
In-Reply-To: <CAKLR3Ap2qyht4yhEXU24hUh82YMpZRgk+CWF8XfHoUk-QOUGhA@mail.gmail.com>
References: <CAKLR3Ap2qyht4yhEXU24hUh82YMpZRgk+CWF8XfHoUk-QOUGhA@mail.gmail.com>
Message-ID: <36a257b1-2ee7-a972-f0ba-60bfdf92fdda@att.net>


On 9/5/2016 5:34 PM, Charlotte Buff wrote:
> It has just come to my attention that U+1D159 MUSICAL SYMBOL NULL 
> NOTEHEAD is not default ignorable, even though it has no visible glyph 
> appearance and no advance width in text, just like the various Hangul 
> jamo fillers that *are* default ignorable. Is there a technical reason 
> for this or is it just an oversight?

Well, the proximate reason is that it is General_Category=So, so that 
unless it were special-cased for the derivation of the Default_Ignorable 
property, it will end up Default_Ignorable=No in the UCD.

As to why it wouldn't be special-cased to force it to end up 
Default_Ignorable=Yes, I don't think there was a whole lot of special 
thinking that went into this when the musical symbols were first added 
in Unicode 3.1 way back in 2001. Default_Ignorable was not even a formal 
property as of Unicode 3.1. That property was added (and rationalized) 
rather later.

As to why Default_Ignorable=No is probably the correct value for U+1D159 
anyway, think of it this way. The null notehead is essentially a musical 
notation specialized version of a non-breaking space -- it is 
essentially just a base for applying the various combining stems and 
flags for a display without showing a particular notehead, analogous to 
applying a generic combining mark to a NBSP to show that combining mark 
in isolation. It isn't clear that the null notehead should have no 
advance width, and in general, if you don't have a rendering system that 
displays such combinations correctly in context, it would arguably be 
better to show that there is some *thing* there, rather than to just 
omit any visible display at all. Such a situation is also roughly akin 
to the various synthetic virama characters in the standard, e.g., U+17D2 
KHMER SIGN COENG, which is essentially a subscript consonant stacker. 
But if you can't display Khmer conjuncts correctly, it would be better 
to display a visible glyph at that point than to just ignore it for 
display altogether. So U+17D2 is also not Default_Ignorable, even though 
it has no well-defined glyph of its own (hence the dotted box shape 
shown in the code charts). And in the case of U+17D2, when correctly 
rendered, it definitely would *not* have its own advance width, yet it 
is still not Default_Ignorable=Yes.

--Ken


From verdy_p at wanadoo.fr  Thu Sep 15 22:41:23 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 16 Sep 2016 05:41:23 +0200
Subject: "textels"
In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl>
 <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl>
Message-ID: <CAGa7JC35T69GJh99G5Xnj_v54vH3Z92Sbf9zV9k-_K8huOhRug@mail.gmail.com>

2016-09-15 21:56 GMT+02:00 Janusz S. Bie? <jsbien at mimuw.edu.pl>:

> On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes:
>
> [...]
>
> > Isn't "grapheme cluster" the definition you are looking for?
>
> I don't think so.
>
> However:
>
> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>

Your definition of textels is also language dependant, as you are reading
it from a Polish point of view.
However you are confusing here "graphemes" with "grapheme clusters".

Your (Polish) textels are in fact the same as the (Polish) grapheme
clusters.

Unicode also defines "default grapheme clusters" that are "grapheme
clusters" not tailored for a particular language. A "default grapheme
clusters" is the minimum unbreakable unit that can be seen as a valid
"grapheme cluster" in most languages (or at least in most languages using
the same base script if the script is used in that language; in other
scripts, it just provides a minimum compatibility level to allow insertion
of foreign texts in a multilingual document).

The grapheme clusters can then be used to parse text and apply various
processes such as

  - normalization : grapheme clusters are not broken by it and can be
compared for canonical equivalences (but you can compare smaller units
using only the combining class property by breaking text on characters with
CC=0 and handling the special algorithmic case of modern Hangul syllables;
see the Unicode standard about normalization)
  - BiDi layout
  - line breaking
  - word breaking
  - most standard text transforms (such as case folding)
  - transliteration

Rendering text however often requires larger units as successive grapheme
clusters (if not split by a line break or by BiDi reoredring) will interact
visually to create more complex layouts (notably in Indic scripts), glued
together by some controls (notably joining controls); they are also
compelxified in some cases where combining classes alone cannot properly
represent these interactions.

Additionnally for a few cases, the visual order is used for encoding text
instead of the standard model using the logical order: this was made to
preserve the roundtrip compatibility between Unicode and legacy encodings
widely used (notably for the Thai script). However this has a known caveat
(which already existed before Unicode) for some algorithms such as word
breaking (implementaitons need to implement a lookup dictionnary, but in
Thai this dictionnary is not very large) and line breaking (if we don't
want to break words or in the middle oif syllables). The default grapheme
clusters however will correctly break the text to allow Thai text (encoded
in visual order) to be rendered correctly.

In summary, the concept of "grapheme clusters" must be read and understood
in the Unicode standard only as a Unicode terminology used to describe all
other algorithms described in the standard. They are not bound to a
particular language except if thsi language is explicitly specified with
this term in that case we won't be handling the "default grapheme clusters"
rules but the additional rules tailoring the basic rules used to define the
default grapheme clusters.

The "extended grapheme clusters" are used in context requiring more complex
algorithms that need to group several grapheme clusters in a ordered
sequence. These algorithms require some text buffering, and parsing from a
random position in text may require looking backward on larger lengths to
determine the context. Parsing text sequentially also requires keeping some
additional context variables. Plain text searches based on "extended
grapheme clusters" is also much more challenging than searches on "default
grapheme clusters".

For these reasons, the "extended grapheme clusters" are not defined in
"default grapheme clusters" but will be needed for matching user
expectations in particular languages or scripts. You normally don't need
any "extended grapheme clusters" in Polish, except in multilingual
documents that are embedding some non-Latin scripts, or some technical
notations.


> 2. Textel "?" means both U+0144 and <U+006E,U+0301>, so it is a notion
> on a higher abstraction level then a grapheme cluster.
>
> Moreover I don't want to call <U+006E,U+0301> (LATIN SMALL LETTER N,
> COMBINING ACUTE ACCENT) an extended grapheme cluster for at least 2
> reasons:
>
> 1. there is nothing extended in it
>

This <U+006E,U+0301> combination is first a "grapheme cluster", before
being also an "extended grapheme cluster" in Unicode terminology.

The term "extended" comes from an extension added not for the case of
combining chacters encoded after base characters (or combined to them in a
canonically equivalent string), but for other extensions, notably for
complex syllabic constructs:

Every "grapheme cluster" may also be an "extended grapheme cluster", but
the reverse is NOT true.

You have to read the standard about the various kind of text breaking
processes.


> 2. U+0301 is not a grapheme according to Polish linguistics terminology
>

The Polish lingusitics uses its own Polish term, not "grapheme" which is in
the standard what is defined there in English, but for being the base of
other definitions needed for parsing texts in various languages.

In Unicode U+0301 would be a grapheme, but if used in isolation it would
not form a complete grapheme cluster, but a defective grapheme cluster as
it lacks its base with which it should be associated and encoded before it
(that base cannot be a non-character or a control, even if these are
blockers against reordering for normalization processes and canonical
equivalences, and cannot be another combining character)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160916/9393a089/attachment.html>

From eliz at gnu.org  Fri Sep 16 08:15:37 2016
From: eliz at gnu.org (Eli Zaretskii)
Date: Fri, 16 Sep 2016 16:15:37 +0300
Subject: "textels"
In-Reply-To: <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost>
 (message from William_J_G Overington on Fri, 16 Sep 2016 10:25:53
 +0100 (BST))
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost>
Message-ID: <83y42s3s06.fsf@gnu.org>

> Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST)
> From: William_J_G Overington <wjgo_10009 at btinternet.com>
> 
> jsbien at mimuw.edu.pl wrote:
> 
> > On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes:
> 
> [...]
> 
> >> Isn't "grapheme cluster" the definition you are looking for?
> 
> > I don't think so.
> 
> Is an example of a textel that would definitely not be a grapheme cluster be when a character is expressed as a BASE CHARACTER character followed by one or more TAG CHARACTER characters.

Since no formal definition of a "textel" was presented, except via an
example, it's not clear to me whether what you propose can be a
textel.  (I also don't quite understand the semantics of a base
character followed by tag characters, to say the truth.)

From jsbien at mimuw.edu.pl  Fri Sep 16 08:52:26 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Fri, 16 Sep 2016 15:52:26 +0200
Subject: "textels"
In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl> ("Janusz S. =?utf-8?Q?Bie=C5=84?=
 =?utf-8?Q?=22's?= message of "Thu, 15 Sep 2016 21:56:32 +0200")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
Message-ID: <864m5grlyd.fsf@mimuw.edu.pl>

On Thu, Sep 15 2016 at 21:56 CEST, jsbien at mimuw.edu.pl writes:

[...]

> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
>
> 2. Textel "?" means both U+0144 and <U+006E,U+0301>, so it is a notion
> on a higher abstraction level then a grapheme cluster.

In other words, textels are equivalence classes of some set of Unicode
characters strings by an equivalence relation which at the moment is
open to the discussion but is very close to the official Unicode
canonical equivalence (when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
convenient).

[...]


On Thu, Sep 15 2016 at 21:27 CEST, leoboiko at namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"?

As for the Swift "character", perhaps someone fluent in Swift will answer
the question?

> (Well, technically UAX
> #29[1] defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> algorithmically).
>
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
>
> https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html

Thank you very much for the link. Let me quote the relevant fragment:

--8<---------------cut here---------------start------------->8---
 
Extended Grapheme Clusters

Every instance of Swift?s Character type represents a single extended
grapheme cluster. An extended grapheme cluster is a sequence of one or
more Unicode scalars that (when combined) produce a single
human-readable character.

Here?s an example. The letter ? can be represented as the single Unicode
scalar ? (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same
letter can also be represented as a pair of scalars?a standard letter e
(LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE
ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically
applied to the scalar that precedes it, turning an e into an ? when it
is rendered by a Unicode-aware text-rendering system.

In both cases, the letter ? is represented as a single Swift Character
value that represents an extended grapheme cluster. In the first case,
the cluster contains a single scalar; in the second case, it is a
cluster of two scalars:

[...]

*Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent.*

--8<---------------cut here---------------end--------------->8---

For me it means that Swift's characters are equivalence classes of the
set of extended grapheme clusters by canonical equivalence relation.


> Such a notion is indeed needed, but it has been always there.
>
> [1] http://unicode.org/reports/tr29/

I don't see there a notion of such equivalent classes.

On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:

[...]

> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic
> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary,
> but in which a ?character? contains however many code points it needs
> (?e? with a stacked macron, acute accent, and dieresis is
> algorithmically one ?character? in Swift). Moreover,
> e-with-an-acute-accent and e followed by a combining acute accent, for
> example, compare as equal. At present, the underlying code is still
> UTF-16LE.

If you insist that Swift's "character" are just grapheme clusters, than
you add different, although related, meaning to the term "grapheme
cluster". I think the notion deserves a term of its own.

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From eric.muller at efele.net  Fri Sep 16 10:03:54 2016
From: eric.muller at efele.net (Eric Muller)
Date: Fri, 16 Sep 2016 08:03:54 -0700
Subject: "textels"
In-Reply-To: <864m5grlyd.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl>
Message-ID: <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net>

On 9/16/2016 6:52 AM, Janusz S. Bie? wrote:
> (when working on a corpus of historical Polish we
> noticed some cases where standard Unicode equivalence was not
> convenient).

I'm very interested to know more about those cases.

Thanks,
Eric.


From wjgo_10009 at btinternet.com  Fri Sep 16 04:25:53 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 16 Sep 2016 10:25:53 +0100 (BST)
Subject: "textels"
In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
Message-ID: <10038497.19619.1474017953799.JavaMail.defaultUser@defaultHost>

jsbien at mimuw.edu.pl wrote:

> On Thu, Sep 15 2016 at 21:27 CEST, eliz at gnu.org writes:

[...]

>> Isn't "grapheme cluster" the definition you are looking for?

> I don't think so.

Is an example of a textel that would definitely not be a grapheme cluster be when a character is expressed as a BASE CHARACTER character followed by one or more TAG CHARACTER characters.

Such a construct was first suggested for some flag characters.

William Overington

16 September 2016


From wjgo_10009 at btinternet.com  Fri Sep 16 09:07:41 2016
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 16 Sep 2016 15:07:41 +0100 (BST)
Subject: "textels"
Message-ID: <3731612.52242.1474034861874.JavaMail.defaultUser@defaultHost>

>(I also don't quite understand the semantics of a base character followed by tag characters, to say the truth.)

Page 2 of the following document is where the idea was introduced.

http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf

The document is linked from the following page.

http://www.unicode.org/L2/L2015/Register-2015.html

William Overington

16 September 2016


From jsbien at mimuw.edu.pl  Fri Sep 16 10:30:48 2016
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Fri, 16 Sep 2016 17:30:48 +0200
Subject: "textels"
In-Reply-To: <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl>
 <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net>
Message-ID: <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl>

Quote/Cytat - Eric Muller <eric.muller at efele.net> (pi?, 16 wrz 2016,  
17:03:54):

> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote:
>> (when working on a corpus of historical Polish we
>> noticed some cases where standard Unicode equivalence was not
>> convenient).
>
> I'm very interested to know more about those cases.

For our search engine we were unable to use compatibility equivalence  
"out of the box" for splitting the ligature because it also converted  
long s to short s while we wanted to preserve the distinction.

Regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From eric.muller at efele.net  Fri Sep 16 10:47:27 2016
From: eric.muller at efele.net (Eric Muller)
Date: Fri, 16 Sep 2016 08:47:27 -0700
Subject: "textels"
In-Reply-To: <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl>
 <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net>
 <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl>
Message-ID: <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net>

On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
> Quote/Cytat - Eric Muller <eric.muller at efele.net> (pi?, 16 wrz 2016, 
> 17:03:54):
>
>> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote:
>>> (when working on a corpus of historical Polish we
>>> noticed some cases where standard Unicode equivalence was not
>>> convenient).
>>
>> I'm very interested to know more about those cases.
>
> For our search engine we were unable to use compatibility equivalence 
> "out of the box" for splitting the ligature because it also converted 
> long s to short s while we wanted to preserve the distinction.

I am interested in the problems with *canonical* equivalence. I thought 
that you were talking about those before.

Compatibility equivalence is a completely different beast. It is, IMHO, 
too coarse a tool and best forgotten. For any particular task, it's 
typically doing too much (e.g. long/short s folding in your case) and 
too little (not everything you need). There was an attempt at improving 
the situation, by providing a whole bunch of fine grained, targeted 
transformations (http://www.unicode.org/reports/tr30/), but that did not 
pan out.

Eric.


Thanks,
Eric.


From jsbien at mimuw.edu.pl  Fri Sep 16 10:57:44 2016
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Fri, 16 Sep 2016 17:57:44 +0200
Subject: "textels"
In-Reply-To: <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl> <864m5grlyd.fsf@mimuw.edu.pl>
 <7f3e554b-a060-152c-f0f0-ae9908d857b8@efele.net>
 <20160916173048.183359euk7i1fktk@mail.mimuw.edu.pl>
 <8934bf10-6fd2-2e87-0260-4706e3f22119@efele.net>
Message-ID: <20160916175744.11941cx23il9zp14@mail.mimuw.edu.pl>

Quote/Cytat - Eric Muller <eric.muller at efele.net> (pi?, 16 wrz 2016,  
17:47:27):

> On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
>> Quote/Cytat - Eric Muller <eric.muller at efele.net> (pi?, 16 wrz  
>> 2016, 17:03:54):
>>
>>> On 9/16/2016 6:52 AM, Janusz S. Bie? wrote:
>>>> (when working on a corpus of historical Polish we
>>>> noticed some cases where standard Unicode equivalence was not
>>>> convenient).
>>>
>>> I'm very interested to know more about those cases.
>>
>> For our search engine we were unable to use compatibility  
>> equivalence "out of the box" for splitting the ligature because it  
>> also converted long s to short s while we wanted to preserve the  
>> distinction.
>
> I am interested in the problems with *canonical* equivalence. I  
> thought that you were talking about those before.

I apologize for the confusion, that was my fault. I tend to answer too  
quickly and not precisely enough :-(

On the other hand I'm not sure canonical equivalence is always what I  
want and expect, but I don't have specific examples at hand.

Regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From christoph.paeper at crissov.de  Fri Sep 16 16:51:38 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Fri, 16 Sep 2016 23:51:38 +0200
Subject: "textels"
In-Reply-To: <86r38llyxb.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
Message-ID: <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>

Janusz S. Bie? <jsbien at mimuw.edu.pl>:
> 
> 1. Graphemes, if I understand correctly, are language dependent, ?

That?s true in linguistic terminology ? well, at least within the more popular schools of thought ?, but not in technical (i.e. Unicode) jargon.

From mats.gbproject at gmail.com  Sat Sep 17 04:19:59 2016
From: mats.gbproject at gmail.com (Mats Blakstad)
Date: Sat, 17 Sep 2016 11:19:59 +0200
Subject: Dataset for all ISO639 code sorted by country/territory?
Message-ID: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>

Hi

Is there any dataset that contains all languages in the world sorted by
country/territory?

I found this at Unicode, but seems like only containing the most spoken
languages in each country and not the smaller once:
http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_information.html

Thanks in advance for help.

Best regards
Mats Blakstad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160917/8068cc48/attachment.html>

From otto.stolz at uni-konstanz.de  Sat Sep 17 06:27:02 2016
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Sat, 17 Sep 2016 13:27:02 +0200
Subject: Dataset for all ISO639 code sorted by country/territory?
In-Reply-To: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>
References: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>
Message-ID: <57DD2886.2060408@uni-konstanz.de>

Hello,

am 2016-09-17 um 11:19 Uhr hat Mats Blakstad geschrieben:
> Is there any dataset that contains all languages in the world sorted by
> country/territory?

Have you tried <http://www.ethnologue.com/>, already?

Also, <http://www.sil.org/>
and <http://www-01.sil.org/iso639-3/codes.asp>
may provide partial answers.

Best wishes,
   Otto Stolz


From verdy_p at wanadoo.fr  Sat Sep 17 06:35:20 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 17 Sep 2016 13:35:20 +0200
Subject: Dataset for all ISO639 code sorted by country/territory?
In-Reply-To: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>
References: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>
Message-ID: <CAGa7JC2cE0-BkQJhGeN=C_qJ0z8Ef5Vh15tu8hY2SYCOcZS6Eg@mail.gmail.com>

Not all languages are sorted, only those for which there are released data
in CLDR.
And languages frequently belong to several countries/territories at the
same time, with different official or recognized status (itself independant
of the number of actual speakers, which is very frequently roughly
estimated).
Some countries are giving official statistics about their national or
regional languages, but frequently these stats are old, or underestimated
or overestimated for political reasons, or some languages are mixed as if
they were only one, or simply discarded if it is considered locally as a
secondary language, even if the official language is superficially
understood but taken as a primary one.
Statistics are also forgetting native speakers living abroad in a diaspora,
or secondary learners of a language taught in foreign countries.


2016-09-17 11:19 GMT+02:00 Mats Blakstad <mats.gbproject at gmail.com>:

> Hi
>
> Is there any dataset that contains all languages in the world sorted by
> country/territory?
>
> I found this at Unicode, but seems like only containing the most spoken
> languages in each country and not the smaller once:
> http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_
> information.html
>
> Thanks in advance for help.
>
> Best regards
> Mats Blakstad
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160917/3372e95a/attachment.html>

From mats.gbproject at gmail.com  Sat Sep 17 07:10:26 2016
From: mats.gbproject at gmail.com (Mats Blakstad)
Date: Sat, 17 Sep 2016 14:10:26 +0200
Subject: Dataset for all ISO639 code sorted by country/territory?
In-Reply-To: <CAGa7JC2cE0-BkQJhGeN=C_qJ0z8Ef5Vh15tu8hY2SYCOcZS6Eg@mail.gmail.com>
References: <CAP=1PAVw6q6fwBQxV1k4jJ5se4oDXRWYeuzq9o2N6pB7rvv54A@mail.gmail.com>
 <CAGa7JC2cE0-BkQJhGeN=C_qJ0z8Ef5Vh15tu8hY2SYCOcZS6Eg@mail.gmail.com>
Message-ID: <CAP=1PAXYsp3MP1yF=xabp8-7zhtFs3NdKiYkH2avvE5oi8fnrg@mail.gmail.com>

I manage to find a dataset on the website of Ethnologue, though it doesn't
look like open source, need to check with them exactly how I'm allowed to
use it:
http://www.ethnologue.com/codes/download-code-tables

Thanks for the explanation Phillippe. I know it is not an easy issue. Look
for different resources on the web, any specific links or feedbacks would
be helpful.

On 17 September 2016 at 13:35, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Not all languages are sorted, only those for which there are released data
> in CLDR.
> And languages frequently belong to several countries/territories at the
> same time, with different official or recognized status (itself independant
> of the number of actual speakers, which is very frequently roughly
> estimated).
> Some countries are giving official statistics about their national or
> regional languages, but frequently these stats are old, or underestimated
> or overestimated for political reasons, or some languages are mixed as if
> they were only one, or simply discarded if it is considered locally as a
> secondary language, even if the official language is superficially
> understood but taken as a primary one.
> Statistics are also forgetting native speakers living abroad in a
> diaspora, or secondary learners of a language taught in foreign countries.
>
>
> 2016-09-17 11:19 GMT+02:00 Mats Blakstad <mats.gbproject at gmail.com>:
>
>> Hi
>>
>> Is there any dataset that contains all languages in the world sorted by
>> country/territory?
>>
>> I found this at Unicode, but seems like only containing the most spoken
>> languages in each country and not the smaller once:
>> http://www.unicode.org/cldr/charts/latest/supplemental/terri
>> tory_language_information.html
>>
>> Thanks in advance for help.
>>
>> Best regards
>> Mats Blakstad
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160917/4b543d2c/attachment.html>

From deepak.jois at gmail.com  Sat Sep 17 06:31:10 2016
From: deepak.jois at gmail.com (Deepak Jois)
Date: Sat, 17 Sep 2016 17:01:10 +0530
Subject: =?UTF-8?Q?Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implementa?=
 =?UTF-8?Q?tion?=
Message-ID: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>

Hi

It seems that the Java reference implementation for the Unicode Bidi
algorithm that I downloaded from the unicode.org site fails against
some test cases in the BidiCharacterTest.txt file ? the ones that are
specifically meant to test for changes in Unicode 8.0.

Has the reference implementation been updated, and does anyone have a
copy they can share? Is there a reference implementation in some other
language that I could look at, which has been updated?

Thank you
Deepak


From khaledhosny at eglug.org  Sat Sep 17 11:23:51 2016
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Sat, 17 Sep 2016 18:23:51 +0200
Subject: Unicode Bidi Algorithm =?utf-8?B?4oCT?= =?utf-8?Q?_Java?=
 reference implementation
In-Reply-To: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>
References: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>
Message-ID: <20160917162351.GB1339@macbook>

On Sat, Sep 17, 2016 at 05:01:10PM +0530, Deepak Jois wrote:
> Hi
> 
> It seems that the Java reference implementation for the Unicode Bidi
> algorithm that I downloaded from the unicode.org site fails against
> some test cases in the BidiCharacterTest.txt file ? the ones that are
> specifically meant to test for changes in Unicode 8.0.
> 
> Has the reference implementation been updated, and does anyone have a
> copy they can share? Is there a reference implementation in some other
> language that I could look at, which has been updated?

I think there is a C implementation that is kept up to date, and there
is also a Python implementation that should pass the tests:
https://github.com/behdad/pybyedie

Regards,
Khaled

From deepak.jois at gmail.com  Sat Sep 17 12:26:55 2016
From: deepak.jois at gmail.com (Deepak Jois)
Date: Sat, 17 Sep 2016 22:56:55 +0530
Subject: =?UTF-8?Q?Re=3A_Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implem?=
 =?UTF-8?Q?entation?=
In-Reply-To: <20160917162351.GB1339@macbook>
References: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>
 <20160917162351.GB1339@macbook>
Message-ID: <CABR1-XY-A5Cz+asb5hmLE9KF2bbG0=BHPROL6mwu8F6gt36U3Q@mail.gmail.com>

On Sat, Sep 17, 2016 at 9:53 PM, Khaled Hosny <khaledhosny at eglug.org> wrote:
> I think there is a C implementation that is kept up to date,

Yes, I found that one after I posted. FWIW, here are the changes for
the latest version:

https://gist.github.com/deepakjois/5a3ae81a105abd3523ed0efe2e52f52e/revisions

> is also a Python implementation that should pass the tests

That implementation looks very different from the C and Java versions.
I can?t tell by looking at a glance if it has been updated for the
changes in Unicode 8.0. But it definitely will not pass the tests in
BidiCharacter.txt because it lacks support for paired brackets.

I just finished writing a reference implementation in Lua[1] which is
a line by line port of the Java reference implementation and passes
nearly all tests in BidiCharacter.txt.

I now need to make the updates to support the changes in Unicode 8.0,
and I am finding it a bit hard to grok the changes in C at a glance.

Deepak

[1]: https://github.com/deepakjois/luabidi/blob/master/src/bidi.lua


From jsbien at mimuw.edu.pl  Sun Sep 18 05:26:26 2016
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sun, 18 Sep 2016 12:26:26 +0200
Subject: "textels"
In-Reply-To: <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
Message-ID: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>

Quote/Cytat - Christoph P?per <christoph.paeper at crissov.de> (pi?, 16  
wrz 2016, 23:51:38):

> Janusz S. Bie? <jsbien at mimuw.edu.pl>:
>>
>> 1. Graphemes, if I understand correctly, are language dependent, ?
>
> That?s true in linguistic terminology ? well, at least within the  
> more popular schools of thought ?, but not in technical (i.e.  
> Unicode) jargon.

 From the Unicode glossary:

Grapheme. (1) A minimally distinctive unit of writing in the context  
of a particular writing system.[...] (2) What a user thinks of as a  
character.

As for (2), cf.

User-Perceived Character. What everyone thinks of as a character in  
their script.

So we have "a user" versus "everyone...in their script" - is the  
difference intentional? Probably not. Anyway the definitions are  
language/locale dependent.

Regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From christoph.paeper at crissov.de  Sun Sep 18 14:40:21 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Sun, 18 Sep 2016 21:40:21 +0200
Subject: "textels"
In-Reply-To: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
Message-ID: <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>

Janusz S. Bien <jsbien at mimuw.edu.pl>:
> 
> From the Unicode glossary:
> 
>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character.
> 
>> User-Perceived Character. What everyone thinks of as a character in their script.
> 
> [?] the definitions are language/locale dependent.

A writing system is (usually) language-dependent, a script is not, although some scripts have been used exclusively (or prominently) in a single writing system with a single language. So definition (1) of ?grapheme? would be appropriate for linguistics, (2) maybe for typography and computer science, but it?? extremely vague.

From asmusf at ix.netcom.com  Sun Sep 18 15:02:01 2016
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Sun, 18 Sep 2016 13:02:01 -0700
Subject: "textels"
In-Reply-To: <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
Message-ID: <c7bf75f6-56a6-96db-b028-458d842909ff@ix.netcom.com>

On 9/18/2016 3:26 AM, Janusz S. Bien wrote:
> Quote/Cytat - Christoph P?per <christoph.paeper at crissov.de> (pi?, 16 
> wrz 2016, 23:51:38):
>
>> Janusz S. Bie? <jsbien at mimuw.edu.pl>:
>>>
>>> 1. Graphemes, if I understand correctly, are language dependent, ?
>>
>> That?s true in linguistic terminology ? well, at least within the 
>> more popular schools of thought ?, but not in technical (i.e. 
>> Unicode) jargon.
>
> From the Unicode glossary:
>
> Grapheme. (1) A minimally distinctive unit of writing in the context 
> of a particular writing system.[...] (2) What a user thinks of as a 
> character.

"writing system" is vague enough to cover variations that might be 
regional or language dependent.
>
> As for (2), cf.
>
> User-Perceived Character. What everyone thinks of as a character in 
> their script.
>
> So we have "a user" versus "everyone...in their script" - is the 
> difference intentional? Probably not. Anyway the definitions are 
> language/locale dependent.

The "everyone" here aims at a shared understanding.

This becomes tricky in the case of Abugidas. There's certainly a shared 
understanding that the "unit of writing" is the syllable, rather than in 
individual mark, but the latter do have well-understood identities, not 
least for teaching. That's perhaps the reason why there's the handwaving 
about "minimally distinctive".

In some scripts like that, users can enter multiple sequences of 
characters that resolve (for all practical purposes) into the same 
syllable. (A big part of that in some scripts is that Unicode does not 
always provide a means to normalize the order of subsidiary signs and 
marks, typically combining marks)

For some tasks it would be great to have only well-formed syllables; but 
to do that, you would need to add additional interpretation on top of 
the Unicode definitions of a grapheme cluster.

If you just wrap the raw combining sequences into textels, then some 
tasks might not actually get simpler. Instead of a simple rule that 
determines which alternate orderings of marks are equivalent (to account 
for users not typing them in the preferred order) you would have to 
exhaustively list all combinations and set up equivalent tables.

A./

From kenwhistler at att.net  Sun Sep 18 19:16:50 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Sun, 18 Sep 2016 17:16:50 -0700
Subject: =?UTF-8?Q?Re:_Unicode_Bidi_Algorithm_=e2=80=93_Java_reference_imple?=
 =?UTF-8?Q?mentation?=
In-Reply-To: <CABR1-XY-A5Cz+asb5hmLE9KF2bbG0=BHPROL6mwu8F6gt36U3Q@mail.gmail.com>
References: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>
 <20160917162351.GB1339@macbook>
 <CABR1-XY-A5Cz+asb5hmLE9KF2bbG0=BHPROL6mwu8F6gt36U3Q@mail.gmail.com>
Message-ID: <61882234-899a-84bf-3fac-017d27af553a@att.net>


On 9/17/2016 10:26 AM, Deepak Jois wrote:
> I now need to make the updates to support the changes in Unicode 8.0,
> and I am finding it a bit hard to grok the changes in C at a glance.
>

The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change 
much about the gross behavior of the algorithm, but there were some 
fixes for edge cases in a couple rules. Also, the specification of 
behavior on stack overflow became exact, rather than implementation-defined.

The C bidi reference code is a bit complicated, because it supports 
*all* UBA versions from 6.2 through 8.0, which means it has to special 
case rule processing by versions when the specification itself changes.

If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c 
you'll find the heart of the differences there, along with explanations 
in comments for the changes. The new function br_SetBracketPairBC 
handles an edge case for combining marks following a bracket. The code 
using a new flag testONisNotRequired deals with an edge case for the 
current Bidi_Class of brackets being tested for pairing. Changes in 
br_PushBracketStack are involved in the need to keep the pre-8.0 
behavior as it was for earlier versions of bidiref, but allowing for 
explicit behavior for stack overflow for 8.0.

It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, 
so you can see the textual changes in the specification of the rules. 
Try diffing:

http://www.unicode.org/reports/tr9/tr9-31.html (7.0)
http://www.unicode.org/reports/tr9/tr9-33.html (8.0)

The significant changes there are in BD11, BD14, BD15, BD16, and in 
rules X5a, X5b, X6a, and N0. (The rest of the changes in the updated 
document are cosmetic.)

--Ken


From jsbien at mimuw.edu.pl  Mon Sep 19 01:23:53 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Mon, 19 Sep 2016 08:23:53 +0200
Subject: User-perceived character (was: "textels")
In-Reply-To: <c7bf75f6-56a6-96db-b028-458d842909ff@ix.netcom.com> (Asmus
 Freytag's message of "Sun, 18 Sep 2016 13:02:01 -0700")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <c7bf75f6-56a6-96db-b028-458d842909ff@ix.netcom.com>
Message-ID: <86h99c5rwm.fsf_-_@mimuw.edu.pl>

On Sun, Sep 18 2016 at 22:02 CEST, asmusf at ix.netcom.com writes:
> On 9/18/2016 3:26 AM, Janusz S. Bien wrote:

[...]

>> From the Unicode glossary:
>>
>> Grapheme. (1) A minimally distinctive unit of writing in the context
>> of a particular writing system.[...] (2) What a user thinks of as a
>> character.
>
> "writing system" is vague enough to cover variations that might be
> regional or language dependent.

That is obvious for me.

>>
>> As for (2), cf.
>>
>> User-Perceived Character. What everyone thinks of as a character in
>> their script.
>>
>> So we have "a user" versus "everyone...in their script" - is the
>> difference intentional? Probably not. Anyway the definitions are
>> language/locale dependent.
>
> The "everyone" here aims at a shared understanding.

That's also quite obvious for me.

"A user" is grapheme (2) is at least strange.

>
> This becomes tricky in the case of Abugidas. There's certainly a
> shared understanding that the "unit of writing" is the syllable,
> rather than in individual mark, but the latter do have well-understood
> identities, not least for teaching. That's perhaps the reason why
> there's the handwaving about "minimally distinctive".
>
> In some scripts like that, users can enter multiple sequences of
> characters that resolve (for all practical purposes) into the same
> syllable. (A big part of that in some scripts is that Unicode does not
> always provide a means to normalize the order of subsidiary signs and
> marks, typically combining marks)
>
> For some tasks it would be great to have only well-formed syllables;
> but to do that, you would need to add additional interpretation on top
> of the Unicode definitions of a grapheme cluster.
>
> If you just wrap the raw combining sequences into textels, then some
> tasks might not actually get simpler. Instead of a simple rule that
> determines which alternate orderings of marks are equivalent (to
> account for users not typing them in the preferred order) you would
> have to exhaustively list all combinations and set up equivalent
> tables.

I would like to know how Swift is handling this. I still have a feeling
that the Swift characters are almost exactly my textels.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

From verdy_p at wanadoo.fr  Mon Sep 19 01:29:12 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 19 Sep 2016 08:29:12 +0200
Subject: =?UTF-8?Q?Re=3A_Unicode_Bidi_Algorithm_=E2=80=93_Java_reference_implem?=
 =?UTF-8?Q?entation?=
In-Reply-To: <61882234-899a-84bf-3fac-017d27af553a@att.net>
References: <CABR1-XYoeV8=w2mWuG1EyXw4+xm=5eaUerhjcP30PTOA-A+h_A@mail.gmail.com>
 <20160917162351.GB1339@macbook>
 <CABR1-XY-A5Cz+asb5hmLE9KF2bbG0=BHPROL6mwu8F6gt36U3Q@mail.gmail.com>
 <61882234-899a-84bf-3fac-017d27af553a@att.net>
Message-ID: <CAGa7JC370+2UJk8cZOR=71mJJx7pA8vivSV=m3RsMynvGcg5YQ@mail.gmail.com>

I note that there's a confusion in the introduction of UAX#9:

"On web pages, the explicit directional formatting characters (of all types
? embedding, override, and isolate) should be replaced by using the dir
attribute and the elements BDI and BDO."

The suggested replacements do not match the order of the listed types.
- embedding (with LRE/PDF or RLO/PDF) just uses the dir="ltr/rtl" attribute
on any element (except BDI and BDO)
- override (with LRO/PDF or RLO/PDF) uses BDO with
the dir="ltr/rtl" attribute
- explicit isolate (with LRI/PDI or RLI/PDI) uses BDI with
the dir="ltr/rtl" attribute
- "automatic" isolate (with FSI/PDI) uses BDI without any dir attribute

Two implicit directional characters (LRM or RLM) are also convertible to
overrides as an empty BDO element with dir="ltr/rtl". Only ALM has no
equivalent.

----

But for most cases, HTML documents should simply not use embedding or
override at all, isolates with BDI are much prefered and are in fact
simpler to manage than what section 6.4 suggests (this suggestion using RLM
or LRM before the separating punctuation does not work reliably as it
implies that you can predict the implicit reading direction of the whole
list, whose ordering is normally depending on the context or the document
containing the list. It is much simpler to isolate each list element and
then pack the list using the unmarked punctuations.

An example of this is found on International wikis thart must display some
inter-language bar to navigate to other translated versions of the same
page: the same template will be used on all pages, and the list of
languages is not predicted and may evolve over time, containing LTR or RTL
language names in unpredictable occurences anywhere in the list,
formatted  with the same separatorwithin a single inline span in a
paragraph starting by a translatable introduction heading, and you cannot
predict which language name will occur after that separator. Using BDI
(without even needing any dir=rtl/trl") or FSI/PDI to isolate each language
name will work much better than using uncondiionnaly some static RLM or LRM
before the separating punctuation (note that there's no such punctuation at
start of the list, so the ordering of the first element is not set
correctly unless there's a RLM or LRM also before that first element, which
may then render incorrectly).

The best and most flexible solution is to use "automatic" isolates for each
list item (with FSI/PDI in plain-text documents, or BDI elements without
any dir attribute in HTML documents). The same is also true when inserting
quotations (including when giving the title of another document, or the
name of an author) or for formatting translatable text containing
"placeholder variables" whose content will be generated separately. BDI
elements without any dir attribute can efficiently replace SPAN elements,
and can still have their own optional formatting styles (colors, font
families, font size, line height, font styles and weight, visual
effects...), or title attributes (to give hints to readers about what the
isolate value will be used for), or identifier (useful to generate stable
anchors that work across all translations of the document).

There are also CSS styles using unicode-bidi properties, but they should be
completely avoided in HTML (these styles will be better infered from BDI
elements)


2016-09-19 2:16 GMT+02:00 Ken Whistler <kenwhistler at att.net>:

>
> On 9/17/2016 10:26 AM, Deepak Jois wrote:
>
>> I now need to make the updates to support the changes in Unicode 8.0,
>> and I am finding it a bit hard to grok the changes in C at a glance.
>>
>>
> The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change
> much about the gross behavior of the algorithm, but there were some fixes
> for edge cases in a couple rules. Also, the specification of behavior on
> stack overflow became exact, rather than implementation-defined.
>
> The C bidi reference code is a bit complicated, because it supports *all*
> UBA versions from 6.2 through 8.0, which means it has to special case rule
> processing by versions when the specification itself changes.
>
> If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c
> you'll find the heart of the differences there, along with explanations in
> comments for the changes. The new function br_SetBracketPairBC handles an
> edge case for combining marks following a bracket. The code using a new
> flag testONisNotRequired deals with an edge case for the current Bidi_Class
> of brackets being tested for pairing. Changes in br_PushBracketStack are
> involved in the need to keep the pre-8.0 behavior as it was for earlier
> versions of bidiref, but allowing for explicit behavior for stack overflow
> for 8.0.
>
> It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, so
> you can see the textual changes in the specification of the rules. Try
> diffing:
>
> http://www.unicode.org/reports/tr9/tr9-31.html (7.0)
> http://www.unicode.org/reports/tr9/tr9-33.html (8.0)
>
> The significant changes there are in BD11, BD14, BD15, BD16, and in rules
> X5a, X5b, X6a, and N0. (The rest of the changes in the updated document are
> cosmetic.)
>
> --Ken
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160919/09e3a131/attachment.html>

From jsbien at mimuw.edu.pl  Mon Sep 19 01:40:05 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Mon, 19 Sep 2016 08:40:05 +0200
Subject: graphemes (was: "textels")
In-Reply-To: <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de> ("Christoph
 =?utf-8?Q?P=C3=A4per=22's?= message of "Sun, 18 Sep 2016 21:40:21 +0200")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
Message-ID: <86d1k05r5m.fsf_-_@mimuw.edu.pl>

On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes:
> Janusz S. Bien <jsbien at mimuw.edu.pl>:
>> 
>> From the Unicode glossary:
>> 
>>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character.
>> 
>>> User-Perceived Character. What everyone thinks of as a character in their script.
>> 
>> [?] the definitions are language/locale dependent.
>
> A writing system is (usually) language-dependent, a script is not,
> although some scripts have been used exclusively (or prominently) in a
> single writing system with a single language.

It depends of course what do you mean exactly by script, and which
meaning of term is intended in the definition of User-Perceived
Character. But "a user" is definitely language/locale dependent :-)

> So definition (1) of ?grapheme? would be appropriate for linguistics,
> (2) maybe for typography and computer science, but it?? extremely
> vague.

I think that 'grapheme' (2) in the present wording is simply
incorrect. I suspect it is not used in the standard at all.

Searching the Unicode site I found only one use of 'grapheme' alone:

http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

        Graphemes are sequences of one or more encoded characters that
        correspond to what users think of as characters.

I guess the intention of 'grapheme' (2) was to describe it without any
reference to computer encoding, which is definitely an extremely
difficult task.

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From Mark.Dalley at swcsu.nhs.uk  Mon Sep 19 03:45:56 2016
From: Mark.Dalley at swcsu.nhs.uk (Dalley Mark (South West Commissioning Support))
Date: Mon, 19 Sep 2016 08:45:56 +0000
Subject: graphemes (was: "textels")
In-Reply-To: <86d1k05r5m.fsf_-_@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
Message-ID: <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>

I think the key phrase is "user-perceived". And you don't need to involve complex scripts either.

For instance as an English-speaking person, I would perceive the "?" in "encyclop?dia" as being two characters (albeit shoved together somewhat). The argument for this is that the word can equally well be rendered as "encyclopaedia".

A Danish or Norwegian speaker, on the other hand, would perceive "?" (as in "?re" or "?sj!") as being a single indivisible character.

Mark Dalley

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Janusz S. Bien
Sent: 19 September 2016 07:40
To: Christoph P?per
Cc: unicode Unicode Discussion
Subject: graphemes (was: "textels")

On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes:
> Janusz S. Bien <jsbien at mimuw.edu.pl>:
>> 
>> From the Unicode glossary:
>> 
>>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character.
>> 
>>> User-Perceived Character. What everyone thinks of as a character in their script.
>> 
>> [?] the definitions are language/locale dependent.
>
> A writing system is (usually) language-dependent, a script is not, 
> although some scripts have been used exclusively (or prominently) in a 
> single writing system with a single language.

It depends of course what do you mean exactly by script, and which meaning of term is intended in the definition of User-Perceived Character. But "a user" is definitely language/locale dependent :-)

> So definition (1) of ?grapheme? would be appropriate for linguistics,
> (2) maybe for typography and computer science, but it?? extremely 
> vague.

I think that 'grapheme' (2) in the present wording is simply incorrect. I suspect it is not used in the standard at all.

Searching the Unicode site I found only one use of 'grapheme' alone:

http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

        Graphemes are sequences of one or more encoded characters that
        correspond to what users think of as characters.

I guess the intention of 'grapheme' (2) was to describe it without any reference to computer encoding, which is definitely an extremely difficult task.

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Janusz S. Bien
Sent: 19 September 2016 07:40
To: Christoph P?per
Cc: unicode Unicode Discussion
Subject: graphemes (was: "textels")

On Sun, Sep 18 2016 at 21:40 CEST, christoph.paeper at crissov.de writes:
> Janusz S. Bien <jsbien at mimuw.edu.pl>:
>> 
>> From the Unicode glossary:
>> 
>>> Grapheme. (1) A minimally distinctive unit of writing in the context of a particular writing system.[...] (2) What a user thinks of as a character.
>> 
>>> User-Perceived Character. What everyone thinks of as a character in their script.
>> 
>> [?] the definitions are language/locale dependent.
>
> A writing system is (usually) language-dependent, a script is not, 
> although some scripts have been used exclusively (or prominently) in a 
> single writing system with a single language.

It depends of course what do you mean exactly by script, and which meaning of term is intended in the definition of User-Perceived Character. But "a user" is definitely language/locale dependent :-)

> So definition (1) of ?grapheme? would be appropriate for linguistics,
> (2) maybe for typography and computer science, but it?? extremely 
> vague.

I think that 'grapheme' (2) in the present wording is simply incorrect. I suspect it is not used in the standard at all.

Searching the Unicode site I found only one use of 'grapheme' alone:

http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

        Graphemes are sequences of one or more encoded characters that
        correspond to what users think of as characters.

I guess the intention of 'grapheme' (2) was to describe it without any reference to computer encoding, which is definitely an extremely difficult task.

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From christoph.paeper at crissov.de  Mon Sep 19 14:16:50 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Mon, 19 Sep 2016 21:16:50 +0200
Subject: graphemes (was: "textels")
In-Reply-To: <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
Message-ID: <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>

Dalley Mark (South West Commissioning Support) <Mark.Dalley at swcsu.nhs.uk>:
> 
> I think the key phrase is "user-perceived". And you don't need to involve complex scripts either.
> 
> For instance as an English-speaking person, I would perceive the "?" in "encyclop?dia" as being two characters (albeit shoved together somewhat). The argument for this is that the word can equally well be rendered as "encyclopaedia".

If

- encyclopedia
- encyclop?dia
- encyclopaedia

are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes.

Although linguists often prefer minimal pair analysis, there are some rules of thumb for what is a grapheme:

- ? whatever goes into a single box in a crossword puzzle.
- ? whatever gets transposed if you reverse a word or generate an anagram.
- ? whatever gets capitalized together in the beginning of a word.
   (Some argue that capitalization operates on characters, not graphemes, though.)
- ? whatever can never be split up by hyphenation.

From jcb+unicode at inf.ed.ac.uk  Tue Sep 20 02:30:12 2016
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Tue, 20 Sep 2016 08:30:12 +0100 (BST)
Subject: graphemes (was: "textels")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
Message-ID: <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>

On 2016-09-19, Christoph P?per <christoph.paeper at crissov.de> wrote:
> If
>
> - encyclopedia
> - encyclop?dia
> - encyclopaedia
>
> are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes.

Such a bizarre definition, which would also entail "color/colour",
"fulfill/fulfil", "sulfur/sulphur" having the same number of
graphemes, would break the first three of your rules of thumb:

> - ? whatever goes into a single box in a crossword puzzle.
> - ? whatever gets transposed if you reverse a word or generate an anagram.
> - ? whatever gets capitalized together in the beginning of a word.

and the fourth is pretty dodgy, as it usually contradicts the others

> - ? whatever can never be split up by hyphenation.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From christoph.paeper at crissov.de  Tue Sep 20 03:57:57 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 20 Sep 2016 10:57:57 +0200
Subject: graphemes (was: "textels")
In-Reply-To: <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
Message-ID: <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>

Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:
> On 2016-09-19, Christoph P?per <christoph.paeper at crissov.de> wrote:
>> If _encyclopedia, encyclop?dia, encyclopaedia_ are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes.
> 
> Such a bizarre definition, which would also entail "color/colour",
> "fulfill/fulfil", "sulfur/sulphur" having the same number of
> graphemes,

It?s not a bizarre definition at all, but one could also assume two or three different writing systems.

> would break the first three of your rules of thumb:

It would, at least partially.

> and the fourth is pretty dodgy, as it usually contradicts the others
> 
>> - ? whatever can never be split up by hyphenation.

It?s not phrased well and it does contradict the other rules of thumb sometimes indeed, but together they often work reasonably well to separate clear cases from questionable ones which are likely to be treated differently by different scholars.

From kenwhistler at att.net  Tue Sep 20 09:37:30 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 20 Sep 2016 07:37:30 -0700
Subject: graphemes
In-Reply-To: <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
Message-ID: <a1d9547e-ed84-fe29-e9b1-d987966ba7fe@att.net>

On 9/20/2016 12:30 AM, Julian Bradfield wrote:

>> are all legal spellings of the same word in a writing system, a useful linguistic definition of grapheme should ensure that all three variants have the same number of graphemes.
> Such a bizarre definition, which would also entail "color/colour",
> "fulfill/fulfil", "sulfur/sulphur" having the same number of
> graphemes, would break the first three of your rules of thumb:
>

I agree with Julian here. Consider also similar common alternations as 
night/nite, light/lite which are widespread *within* American English 
spelling conventions and don't even raise questions of locale 
differences. Or you/u,  your/ur, which vary on another dimension. If 
every variation in spelling is taken to constitute a distinct writing 
system, simply to preserve the concept of a "grapheme", we would be led 
to conclude that American English has millions of writing systems, 
because of the combinatorics involved.

And the caveat that it is a "legal" spelling is a hinky dodge, 
particularly in the case of English. There isn't any recognized legal 
framework for English spelling. English, she is spelled how people 
decide to spell her -- or perhaps mostly how 2nd grade English teachers 
decide she is spelled.

Even where legal or academic frameworks exist to formally control the 
spelling rules of a language, one should be leery that such rules 
somehow instantiate the identity of graphemes, which are unlikely to be 
the principal matter of concern for those trying to establish the 
spelling rules in the first place.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160920/b51dbda0/attachment.html>

From doug at ewellic.org  Tue Sep 20 11:09:22 2016
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 20 Sep 2016 09:09:22 -0700
Subject: "textels"
Message-ID: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com>

Janusz Bie? wrote:

> For me it means that Swift's characters are equivalence classes of the
> set of extended grapheme clusters by canonical equivalence relation.

I still hope we can come to some conclusion on the correct Unicode name
for this concept. I don't think non-Unicode interpretations of terms
like "grapheme" are grounds for throwing out "grapheme cluster," but I
can see that the equivalence class itself is lacking a name.

Note that the Swift definition doesn't say that <00E9> and <0065 0301>
are identical entities, only that the language compares them as equal.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Tue Sep 20 11:34:25 2016
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 20 Sep 2016 09:34:25 -0700
Subject: Dataset for all ISO639 code sorted by
 =?UTF-8?Q?country/territory=3F?=
Message-ID: <20160920093425.665a7a7059d7ee80bb4d670165c8327d.219e1cf756.wbe@email03.godaddy.com>

Mats Blakstad wrote:

> Is there any dataset that contains all languages in the world sorted
> by country/territory?

As others have pointed out, be careful about how slippery this slope can
get. Everyone has his or her own opinion about how many speakers of
Language X in country Y need to be identified, estimated, or conjectured
in order to say that "language X is spoken in country Y."

> I manage to find a dataset on the website of Ethnologue, though it
> doesn't look like open source, need to check with them exactly how I'm
> allowed to use it:
> http://www.ethnologue.com/codes/download-code-tables

The readme file included in the downloadable zip file makes SIL's terms
very clear. Basically you need to credit SIL as the source of the data,
not change it, and not make the data directly available for others to
download. It's best not to get caught up in "open source" as if any
other terms would make the data totally unusable.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From jsbien at mimuw.edu.pl  Tue Sep 20 23:44:08 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Wed, 21 Sep 2016 06:44:08 +0200
Subject: "textels"
In-Reply-To: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com>
 (Doug Ewell's message of "Tue, 20 Sep 2016 09:09:22 -0700")
References: <20160920090922.665a7a7059d7ee80bb4d670165c8327d.abac7df05c.wbe@email03.godaddy.com>
Message-ID: <86lgylvp47.fsf@mimuw.edu.pl>

On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes:
> Janusz Bie? wrote:
>
>> For me it means that Swift's characters are equivalence classes of the
>> set of extended grapheme clusters by canonical equivalence relation.
>
> I still hope we can come to some conclusion on the correct Unicode name
> for this concept. I don't think non-Unicode interpretations of terms
> like "grapheme" are grounds for throwing out "grapheme cluster,"

I agree.

> but I can see that the equivalence class itself is lacking a name.

I'glad.

>
> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
> are identical entities, only that the language compares them as equal.

I'm fully aware of this.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From jsbien at mimuw.edu.pl  Wed Sep 21 00:09:41 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Wed, 21 Sep 2016 07:09:41 +0200
Subject: graphemes
In-Reply-To: <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de> ("Christoph
 =?utf-8?Q?P=C3=A4per=22's?= message of "Tue, 20 Sep 2016 10:57:57 +0200")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
 <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>
Message-ID: <8660ppvnxm.fsf@mimuw.edu.pl>

On Tue, Sep 20 2016 at 10:57 CEST, christoph.paeper at crissov.de writes:
> Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:
>> On 2016-09-19, Christoph P?per <christoph.paeper at crissov.de> wrote:
>>> If _encyclopedia, encyclop?dia, encyclopaedia_ are all legal
>>> spellings of the same word in a writing system, a useful linguistic
>>> definition of grapheme should ensure that all three variants have
>>> the same number of graphemes.
>> 
>> Such a bizarre definition, which would also entail "color/colour",
>> "fulfill/fulfil", "sulfur/sulphur" having the same number of
>> graphemes,
>
> It?s not a bizarre definition at all, but one could also assume two or three different writing systems.
>
>> would break the first three of your rules of thumb:
>
> It would, at least partially.
>
>> and the fourth is pretty dodgy, as it usually contradicts the others
>> 
>>> - ? whatever can never be split up by hyphenation.
>
> It?s not phrased well and it does contradict the other rules of thumb
> sometimes indeed, but together they often work reasonably well to
> separate clear cases from questionable ones which are likely to be
> treated differently by different scholars.

Let me remind the issues which started the thread:


On Sun, Sep 18 2016 at 12:26 CEST, jsbien at mimuw.edu.pl writes:
> Quote/Cytat - Christoph P?per <christoph.paeper at crissov.de> (pi?, 16
> wrz 2016, 23:51:38):
>
>> Janusz S. Bie? <jsbien at mimuw.edu.pl>:
>>>
>>> 1. Graphemes, if I understand correctly, are language dependent, ?
>>
>> That?s true in linguistic terminology ? well, at least within the
>> more popular schools of thought ?, but not in technical (i.e.
>> Unicode) jargon.

And what is "grapheme" in "technical (i.e. Unicode) jargon"?

>
> From the Unicode glossary:
>
> Grapheme. (1) A minimally distinctive unit of writing in the context
> of a particular writing system.[...] (2) What a user thinks of as a
> character.
>
> As for (2), cf.
>
> User-Perceived Character. What everyone thinks of as a character in
> their script.
>
> So we have "a user" versus "everyone...in their script" - is the
> difference intentional? Probably not. Anyway the definitions are
> language/locale dependent.

Does 'Grapheme' (2) make sense with "a (single?) user"? 

BTW, it is rather well know that the term "phoneme" was proposed first
by a Polish linguist Jan Niecis?aw Ignacy Baudouin de Courtenay (13
March 1845 ? 3 November 1929), cf. e.g
https://en.wikipedia.org/wiki/Jan_Baudouin_de_Courtenay.  It is much
less know that he proposed also the term "grapheme". Let me quote
Alexander Berg's "English Historical Linguistics vol. I" page 230 from
Google Books:

       Since the introduction of the term grapheme by Baudouin de
       Courtenay in 1901 (Ruszkiewicz 1976:24-37, 1981 [1978], 20-34),
       it has been defined in various ways:

       [...]

       As can be seen from these quotatioms, the available definitions
       can be divided into two groups, corresponding to two main senses,
       and reflecting "conflicting linguistics views of the status of
       writing" (Henderson 1985:142):

       1. a letter or cluster of letters referring to or corresponding with a
       single phoneme;

       2. the minimal distinctive unit of a writing system.

For me the first meaning (not mentioned at all in English Wikipedia) is
the primary, i.e. more useful, meaning, as is has some practical
applications e.g. for describing Polish hyphenation rules.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From jameson.quinn at gmail.com  Thu Sep 22 00:47:40 2016
From: jameson.quinn at gmail.com (Jameson Quinn)
Date: Thu, 22 Sep 2016 01:47:40 -0400
Subject: Draft proposal for Mayan numerals
Message-ID: <CAO82iZxgYo8A5fAdTqq24cuyGfH2x5zcS1AEiEpxCoLGX_1cWg@mail.gmail.com>

Attached is my draft of a proposal for including Mayan numerals in unicode.
I intend to finish and submit this proposal before October 1. Comments are
welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160922/7d8c4ae0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mayan numerals.nobills.odt
Type: application/vnd.oasis.opendocument.text
Size: 90034 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160922/7d8c4ae0/attachment.odt>

From lang.support at gmail.com  Mon Sep 26 01:23:01 2016
From: lang.support at gmail.com (Andrew Cunningham)
Date: Mon, 26 Sep 2016 16:23:01 +1000
Subject: Myanmar Scripts and Languages FAQ
In-Reply-To: <CAGJ7U-V1nKgPPFJV4kqkGxb7KV+-6Gtq2ROLYGYkTnYT6=PLsA@mail.gmail.com>
References: <CAGJ7U-W-y=MNk2Lwt3F6kMe0EQ_9qjkktwK+WGXaw5+KEDoo8w@mail.gmail.com>
 <CAGJ7U-VE9gA74SmDg7m_exSWTk1AA4y7vAynep2mDyVMOjCnQA@mail.gmail.com>
 <CAGJ7U-UFpETrueckbZzo+sBPuAcr36JeNULcO-k-nH=BE9s0zw@mail.gmail.com>
 <CAGJ7U-Vg_HgDweyfa-EECZ7m8oS-3k7WjgncAQXKkXs0DHiC=g@mail.gmail.com>
 <CAGJ7U-VeDmcZ4N2FKY=V5Mb=xrEvRHFRC4QMxPG2PgmS7zgKhw@mail.gmail.com>
 <CAGJ7U-XZt8yqsi4sirB2mU4whn3bzqhf5Oh=TQyK1YSVXaTKrw@mail.gmail.com>
 <CAGJ7U-WsXzJUcONsmsOp=jY2xEVzfBn3Mm3mxO93msq-muOFNg@mail.gmail.com>
 <CAGJ7U-WO7q=E=_GsUJWuPubs+it5ZF54JbA753BkV5p5S4fZeg@mail.gmail.com>
 <CAGJ7U-V1nKgPPFJV4kqkGxb7KV+-6Gtq2ROLYGYkTnYT6=PLsA@mail.gmail.com>
Message-ID: <CAGJ7U-V6LaS008xZDDPhjDOOpjFxP-otGy1OuZTx64WM3i-N2g@mail.gmail.com>

H?,

I just finished looking at the Myanmar Scripts and Languages FAQ.

A few comments.

Most of the questions and answers are specific to the Myanmar (Burmese)
language.

When discussing the ad hoc fonts, it would be useful to indicate that the
ones already mentioned are Burmese specific, and that each of the major
languages has its own ad hoc font(s). Mon,  Shan and Sgaw Karen & Western
Pwo Karen have their own specific fonts.

It is also worth warning that most detectors and convertors are language
specific. If your data has content in a range of Myanmar script languages,
the results from such detectors and converters will be less than ideal.

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160926/a8791c12/attachment.html>

From christoph.paeper at crissov.de  Tue Sep 27 09:28:15 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Tue, 27 Sep 2016 16:28:15 +0200
Subject: graphemes
In-Reply-To: <8660ppvnxm.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
 <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>
 <8660ppvnxm.fsf@mimuw.edu.pl>
Message-ID: <DB061407-E33D-4528-B60C-2DDBD50CE02B@crissov.de>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160927/e8382fab/attachment.html>

From jsbien at mimuw.edu.pl  Tue Sep 27 23:59:24 2016
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Wed, 28 Sep 2016 06:59:24 +0200
Subject: graphemes
In-Reply-To: <DB061407-E33D-4528-B60C-2DDBD50CE02B@crissov.de> ("Christoph
 =?utf-8?Q?P=C3=A4per=22's?= message of "Tue, 27 Sep 2016 16:28:15 +0200")
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
 <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>
 <8660ppvnxm.fsf@mimuw.edu.pl>
 <DB061407-E33D-4528-B60C-2DDBD50CE02B@crissov.de>
Message-ID: <861t04wrf7.fsf@mimuw.edu.pl>


I wrote already

On Mon, Sep 19 2016 at  8:40 CEST, jsbien at mimuw.edu.pl writes:

[...]

> Searching the Unicode site I found only one use of 'grapheme' alone:
>
> http://www.unicode.org/L2/L2000/00274-N2236-grapheme-joiner.htm

Anybody is aware of any other occurences?

On Tue, Sep 27 2016 at 16:28 CEST, christoph.paeper at crissov.de writes:
> Janusz S. Bie? <jsbien at mimuw.edu.pl>:
>
>     On Sun, Sep 18 2016 at 12:26 CEST, jsbien at mimuw.edu.pl writes:
>     
>     Quote/Cytat - Christoph P?per <christoph.paeper at crissov.de> (pi?,
>         16
>         wrz 2016, 23:51:38):
>         
>                 Janusz S. Bie? <jsbien at mimuw.edu.pl>:
>             
>             
>                 1. Graphemes, if I understand correctly, are language
>                 dependent, ?
>                 
>
>             That?s true in linguistic terminology ? ? ?, but not in
>             technical (i.e.
>             Unicode) jargon.
>             
>
>     And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>     
>
> It depends on the script (hence Unicode block), but not the writing
> system or language. The line is not always drawn consistently.

Please prove this claim by explicit quotations from the standard.

In my opinion there is no such thing as "grapheme" in "technical
(i.e. Unicode) jargon".

>
>     From the Unicode glossary:
>         
>         Grapheme. [?] (2) What a user thinks of as a character.
>         
>         User-Perceived Character. What everyone thinks of as a
>         character in their script.
>         
>
>     Does 'Grapheme' (2) make sense with "a (single?) user"? 
>     
>
> No linguistic term makes sense with only a *single* user
> (?Privatsprache?).

That's obvious.

> It?s a very vague definition, but not quite
> incorrect for ?a typical user?.

Exactly - "a typical user" is quite different from "a user". Do we agree
that the wording of "grapheme" (2) should be corrected?

>
>     BTW, it is rather well know that the term "phoneme" was proposed
>     first by a Polish linguist Jan Niecis?aw Ignacy Baudouin de
>     Courtenay (?). It is much less know that he proposed also the term
>     "grapheme".
>     
>
> Yes, he introduced both terms, but the definitions have changed quite
> a bit through history and among schools. Entire books have been
> published about that, e.g. (in German) Manfred Kohrt (1985):
> ?Problemgeschichte des Graphembegriffs und des fru?hen Phonembegriffs?
> (ISBN 3-484-31061-8) ? I wish I knew a more recent one.
>

The question is whether all these linguistic discussions are relevant to
Unicode.

>     Alexander Berg's "English Historical Linguistics vol. I" page 230
>     [?]:
>     
>     [?] the available definitions [of ?grapheme?]
>     can be divided into two groups, corresponding to two main senses,
>     and reflecting "conflicting linguistics views of the status of
>     writing" (Henderson 1985:142):
>     
>     1. a letter or cluster of letters referring to or corresponding
>     with a
>     single phoneme;
>     
>     2. the minimal distinctive unit of a writing system.
>     
>     For me the first meaning (?) is the primary, i.e. more useful,
>     meaning, as is has some practical applications e.g. for describing
>     Polish hyphenation rules.
>     
>
> Type 1 has also been called ?phono-graphemes? (with or without the
> hyphen). 

Seems a good term, I was not aware of it. Do you happen to remember who
introduced it?

>
> The conflicting views quoted from the 30 years old work by Henderson
> still exist.

There is no doubt about it.


> Many scholars ? yourself included, it seems ? infer a
> structural primacy of spoken language over written language from its
> historic primacy.

I do not, but it is completely irrelevant to the problem of the Unicode
use of the "grapheme" term.

Best regards

Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From christoph.paeper at crissov.de  Wed Sep 28 03:24:34 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Wed, 28 Sep 2016 10:24:34 +0200
Subject: graphemes
In-Reply-To: <861t04wrf7.fsf@mimuw.edu.pl>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl> <83r38l55gt.fsf@gnu.org>
 <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
 <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>
 <8660ppvnxm.fsf@mimuw.edu.pl>
 <DB061407-E33D-4528-B60C-2DDBD50CE02B@crissov.de>
 <861t04wrf7.fsf@mimuw.edu.pl>
Message-ID: <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de>

Janusz S. Bie? <jsbien at mimuw.edu.pl>:
> On Tue, Sep 27 2016 at 16:28 CEST, christoph.paeper at crissov.de writes:
>>> And what is "grapheme" in "technical (i.e. Unicode) jargon"?
>> 
>> It depends on the script (hence Unicode block), but not the writing
>> system or language. The line is not always drawn consistently.
> 
> Please prove this claim by explicit quotations from the standard.

I?ll try another day.

> In my opinion there is no such thing as "grapheme" in "technical
> (i.e. Unicode) jargon".

Even if it?s not used explicitly, it?s still there implicitly in compounds like ?grapheme joiner? or ?grapheme cluster?.

> Do we agree that the wording of "grapheme" (2) should be corrected?

We do.

> The question is whether all these linguistic discussions are relevant to
> Unicode.

Probably not worth it at this stage with all the legacy baggage, e.g. regarding ?ideographs?, but a sound linguistic foundation would have been nice, even if it?s primarily a technical standard. Alas, since there is so much disagreement among scholars, e.g. regarding ?alphasyllabaries?, stuff would probably never have gotten done. Engineers are usually better at this than scientists (or politicians).

>> Type 1 has also been called ?phono-graphemes? (?). 
> 
> Seems a good term, I was not aware of it. Do you happen to remember who
> introduced it?

My oldest quote is from Heller 1980, but I think it was introduced earlier (maybe by Gelb). McLaughlin 1963 proposes ?graphoneme?. The terms are not very common, probably because everyone just uses their definition of ?grapheme?.

JFTR, Daniels/Bright 1999 state with resignation:

> *grapheme*
> term intended to designate a unit of a writing system, parallel to phoneme and morpheme, 
> but in practice used as a synonym for letter, diacritic, character (2), or sign (2)


From verdy_p at wanadoo.fr  Wed Sep 28 05:41:07 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 28 Sep 2016 12:41:07 +0200
Subject: graphemes
In-Reply-To: <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de>
References: <BL2PR09MB1011B794E2BF809425F75B01C8F00@BL2PR09MB1011.namprd09.prod.outlook.com>
 <CAGa7JC0rwq4nrcBBo_XodjLscXHvwfbKW4GJ8heYudyMNHz1sQ@mail.gmail.com>
 <7CDE800C-C476-4E37-9874-5F4EB293A9C3@gmail.com>
 <86poo5f03u.fsf_-_@mimuw.edu.pl>
 <83r38l55gt.fsf@gnu.org> <86r38llyxb.fsf@mimuw.edu.pl>
 <7928A265-3BCF-4EA6-AA29-93A29CB5CE10@crissov.de>
 <20160918122626.203270r5aukqdm8i@mail.mimuw.edu.pl>
 <23801085-0F4F-4745-A0C6-AF07359D62C6@crissov.de>
 <86d1k05r5m.fsf_-_@mimuw.edu.pl>
 <A863D0F6E4C78C4C83899C39BFA28DD40127EC11AD@XSW-01-MBX04.XSWHealth.nhs.uk>
 <4F476E6E-DCCB-4E4A-B0FF-1352FDF5679C@crissov.de>
 <slrnnu1pc4.fns.jcb@home.stevens-bradfield.com>
 <657667FF-A230-4ABC-8F97-16EC1BC77D5E@crissov.de>
 <8660ppvnxm.fsf@mimuw.edu.pl>
 <DB061407-E33D-4528-B60C-2DDBD50CE02B@crissov.de>
 <861t04wrf7.fsf@mimuw.edu.pl>
 <314E5730-D6C1-4210-821D-1D7D4C2028BD@crissov.de>
Message-ID: <CAGa7JC3uLCJsZ0qEzNBNo0GebPcoVLNjFKbXM7Mfwgtf-QkLBg@mail.gmail.com>

2016-09-28 10:24 GMT+02:00 Christoph P?per <christoph.paeper at crissov.de>:

>
> My oldest quote is from Heller 1980, but I think it was introduced earlier
> (maybe by Gelb). McLaughlin 1963 proposes ?graphoneme?. The terms are not
> very common, probably because everyone just uses their definition of
> ?grapheme?.
>
> > *grapheme*
> > term intended to designate a unit of a writing system, parallel to
> phoneme and morpheme,
> > but in practice used as a synonym for letter, diacritic, character (2),
> or sign (2)
>

IMHO, the term grapheme only applies (traditionally) to the written
**form**, it.e. the **graphic** item which can be clearly separated from
others (even if there's some joining). So a grapheme may as well represent
several logical letters (as they are spelled orally), Some ligatues are
mandatroy in the written form of script and the grapheme represents the
sets of graphical varaitions that will be read the same in a language (in
fact what Unicode may also designate as "confusable characters".

So the grapheme for A does not really differentiate the Latin, Greek and
Cyrillic versions, even if, when analyzing them in a linguistic context,
these letters are read differently ("a" vs. "alpha", which is in fact not
really a distrinction of the script but on the linguistic tradition of
alphabets for as spelled for the vocal language), and the graphemes do not
have any case pairings, which is part of the semantic of the script as used
for the orthography of a given language. But in the vocal language the case
distinctions are almost always not relevant. The written form adds some
distinctions but still carying the initial semantic in the language. This
makes scripts (or more exactly writing systems) more complex to map within
a unified universal encoding.

Graphemes are then weaker definitions of what Unicode encodes as abstract
characters (to map on them additional properties that are not relevant at
the grapheme level but useful to parse the semantic of a complete text).

The abstract characters in Unicode do not distinguish some letter forms
even if traditionally the scripts and their associated writing systems for
a language make clear distinctions: a "Fraktur Latin" letter A is a
distinct "grapheme" from the modern cursive letter A even if they map to
the same Unicode abstract character (as a result of unification), but the
 grapheme for the modern cursive letter A is the same between Latin,
Cyrillic and Greek scripts.

There are however significant differences when handling diacritics (e.g.
the diaeresis in German works very differently as an umlaut in the Fraktur
script than in the current modern script and really acts as a plain
distinct letter: the graphemes differences are exposed in this case even if
the Unicode-encoded letters unify them; and even logically when spelling
them vocally there's a clear difference between the diaeresis as used in
French or English and the umlaut used in German and several other
Central-European languages).

So I think that the term "grapheme" cannot be formally defined in Unicode,
it does not match anything with what's encoded. What is encoded is the
possibility to represent "grapheme clusters" (the set of graphical forms
which are minimally distinguished but not minimally separated in a specific
language) and map them with a sequence of Unicode-encoded "abstract
characters" (whose individual identity does not match exactly the
traditional graphemes, and are also detached from the perceived
distinctions of writing systems in a specific language).

Unicode cannot then define formally what is a "grapheme". It an only give a
definition of "grapheme clusters", but it is mostly based on its own
definitions of properties (which are also not sufficient to carry all
distinctions for any given language in its writing systems). "Grapheme
clusters" in Unicode are also not required to have a significant graphic
form, they purely exist at semantic level directly from their encoding and
can be used to generate other renderings (e.g. it can be rendered vocally,
aor used to derive some other semantics, such as values of numbers, word
breaking...) or to infer some grammatical/orthographic rules to compose or
generate other texts.

In summary, there's NO "grapheme" (isolately) in Unicode and I think it
should not be defined, it would break expectations on languages, and the
universal repertoire does not encode specific langauges and not even any
specific writing system (the scripts in Unicode are NOT writing systems,
which are always dependant of the language using them, and also dependant
on the epoch and geographic area of use, for their working
rules/conventions).

So the "grapheme" *may* be used (contextually) as a letter, a diacritic, a
sign, or even a ligature (the ligature is not just contextual when it is
mandated by the writing system and adds some semanctic distinctions,
depending on whever it is used or not, it's not just a question of "user
preferences" or "font styles"), or any combination of these, up to the
complete combination of what Unicode calls a "grapheme cluster" (the only
thing really encodable with one or more abstract characters).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/3cfc0cdb/attachment.html>

From a.lukyanov at yspu.org  Wed Sep 28 02:59:05 2016
From: a.lukyanov at yspu.org (a.lukyanov)
Date: Wed, 28 Sep 2016 10:59:05 +0300
Subject: IJ with accent
Message-ID: <57EB7849.3070908@yspu.org>

Dutch language writing uses the ligature ? (U+0132, U+0133). When 
accented, it should take an accent on each component, like this:


If one uses two separate characters (i+j), one can put an accent on each 
character (?j?).

However, if monolithic ligature ? is used, how one can accent it 
correctly? Unicode standard does not answer this.

Probably one should use the sequence U+0133 U+301, with the accent 
doubling automatically, but this is not implemented (??).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/17f21692/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 220px-Bijna.png
Type: image/png
Size: 3608 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/17f21692/attachment.png>

From verdy_p at wanadoo.fr  Wed Sep 28 11:16:27 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 28 Sep 2016 18:16:27 +0200
Subject: IJ with accent
In-Reply-To: <57EB7849.3070908@yspu.org>
References: <57EB7849.3070908@yspu.org>
Message-ID: <CAGa7JC2V92LZ25mVje345jOsuRKwyVC2w6iQW3t+HzSYw8RExg@mail.gmail.com>

There's a double acute accent which you could use on the ij ligature. But
it causes search problems when the ij ligature is separable, giving <i>then
<j,double acute> (the double acute accent is not decomposable).

My opinion is to put an accent on each letter and join them with a joiner,
either as <i,acute,ZWJ,j,acute>, or <i with acute,ZWJ,j with acute> (which
works with canonical equivalences, collations, and should work in rederings
to instruct their ligature and the absence of syllable break between both
letters, just like <i,ZWJ,j> should render like <ij> to produce the same
unbreakable ligature.

2016-09-28 9:59 GMT+02:00 a.lukyanov <a.lukyanov at yspu.org>:

> Dutch language writing uses the ligature ? (U+0132, U+0133). When
> accented, it should take an accent on each component, like this:
>
>
>
> If one uses two separate characters (i+j), one can put an accent on each
> character (?j?).
>
> However, if monolithic ligature ? is used, how one can accent it
> correctly? Unicode standard does not answer this.
>
> Probably one should use the sequence U+0133 U+301, with the accent
> doubling automatically, but this is not implemented (??).
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/6fd18964/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 220px-Bijna.png
Type: image/png
Size: 3608 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/6fd18964/attachment.png>

From markus.icu at gmail.com  Wed Sep 28 13:36:19 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 28 Sep 2016 11:36:19 -0700
Subject: IJ with accent
In-Reply-To: <CAGa7JC2V92LZ25mVje345jOsuRKwyVC2w6iQW3t+HzSYw8RExg@mail.gmail.com>
References: <57EB7849.3070908@yspu.org>
 <CAGa7JC2V92LZ25mVje345jOsuRKwyVC2w6iQW3t+HzSYw8RExg@mail.gmail.com>
Message-ID: <CAN49p6rgsfQ+ekJBaXFWNyFSYzrf2BAO-kepgR2tmcmzdH--sQ@mail.gmail.com>

On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> My opinion is to put an accent on each letter and join them with a joiner
>

I don't see a reason for the joiner.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/6ac0157e/attachment.html>

From verdy_p at wanadoo.fr  Wed Sep 28 13:55:33 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 28 Sep 2016 20:55:33 +0200
Subject: IJ with accent
In-Reply-To: <CAN49p6rgsfQ+ekJBaXFWNyFSYzrf2BAO-kepgR2tmcmzdH--sQ@mail.gmail.com>
References: <57EB7849.3070908@yspu.org>
 <CAGa7JC2V92LZ25mVje345jOsuRKwyVC2w6iQW3t+HzSYw8RExg@mail.gmail.com>
 <CAN49p6rgsfQ+ekJBaXFWNyFSYzrf2BAO-kepgR2tmcmzdH--sQ@mail.gmail.com>
Message-ID: <CAGa7JC2NgYcC_r9mAdRutbdLtH3ZWuhi8JfgThkFRrtjCZjqTg@mail.gmail.com>

Technically I see one, as b?j?na shound never break between ? and j?, and
they should remain ligated (or their kerning kept), even if interletter
spacing is enabled (that's whay the letter is frequently rendered also as
"?". When converting to CAPITALS, they form a ligature looking more like ?
(with the left arm broken). Adding the accents, this looks like "y" plus a
double acute, or like "U" with double acute and the broken left bar, no
additional spacing should be inserted).

Without the joiner, there's nothing to prohibit the normal negative kerning
to be removed and spacing to be inserted.

When using monospaced fonts, both characters should also occupy the same
cell (just like a "y" or "U"), not two (normal rendering without the joiner)

2016-09-28 20:36 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Wed, Sep 28, 2016 at 9:16 AM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> My opinion is to put an accent on each letter and join them with a joiner
>>
>
> I don't see a reason for the joiner.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/483d4d7f/attachment.html>

From doug at ewellic.org  Wed Sep 28 14:30:04 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 28 Sep 2016 12:30:04 -0700
Subject: IJ with accent
Message-ID: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com>

> Technically I see one, as b?j?na shound never break between ? and j?,

These wor-
ds should not bre-
ak at the places wh-
ere I have broken t-
hem

but they don't need embedded control characters to enforce that. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From everson at evertype.com  Wed Sep 28 14:33:23 2016
From: everson at evertype.com (Michael Everson)
Date: Wed, 28 Sep 2016 12:33:23 -0700
Subject: IJ with accent
In-Reply-To: <57EB7849.3070908@yspu.org>
References: <57EB7849.3070908@yspu.org>
Message-ID: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com>

The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature. 

> On 28 Sep 2016, at 00:59, a.lukyanov <a.lukyanov at yspu.org> wrote:
> 
> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, like this:
> 
> <220px-Bijna.png>
> 
> If one uses two separate characters (i+j), one can put an accent on each character (?j?).
> 
> However, if monolithic ligature ? is used, how one can accent it correctly? Unicode standard does not answer this.
> 
> Probably one should use the sequence U+0133 U+301, with the accent doubling automatically, but this is not implemented (??).
> 
> 
> 


From ruland at luckymail.com  Wed Sep 28 14:54:14 2016
From: ruland at luckymail.com (Charlie Ruland)
Date: Wed, 28 Sep 2016 21:54:14 +0200
Subject: IJ with accent
In-Reply-To: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com>
References: <57EB7849.3070908@yspu.org>
 <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com>
Message-ID: <4d60bd08-da74-7d54-ced2-52777616f543@luckymail.com>

Brill fonts <http://www.brill.com/about/brill-fonts> (designed by John 
Hudson and ? by Koninklijke Brill NV) draw ?? and ?? with two acute accents.


> The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/fda3aa96/attachment.html>

From richard.wordingham at ntlworld.com  Wed Sep 28 15:48:14 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 28 Sep 2016 21:48:14 +0100
Subject: IJ with accent
In-Reply-To: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com>
References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com>
Message-ID: <20160928214814.67bf3e87@JRWUBU2>

On Wed, 28 Sep 2016 12:30:04 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> > Technically I see one, as b?j?na shound never break between ? and
> > j?,
> 
> These wor-
> ds should not bre-
> ak at the places wh-
> ere I have broken t-
> hem
> 
> but they don't need embedded control characters to enforce that. 

Indeed, there aren't any control characters to control hyphenation.
Indeed, CGJ between default grapheme clusters is often a very good
place to hyphenate.

Richard.


From verdy_p at wanadoo.fr  Wed Sep 28 16:22:34 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 28 Sep 2016 23:22:34 +0200
Subject: IJ with accent
In-Reply-To: <20160928214814.67bf3e87@JRWUBU2>
References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com>
 <20160928214814.67bf3e87@JRWUBU2>
Message-ID: <CAGa7JC1nKT+XYSixpmSbziG_NPEB-sb_V=_GYra4XgydbdkH5w@mail.gmail.com>

2016-09-28 22:48 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Wed, 28 Sep 2016 12:30:04 -0700
> "Doug Ewell" <doug at ewellic.org> wrote:
>
> > > Technically I see one, as b?j?na shound never break between ? and
> > > j?,
> >
> > These wor-
> > ds should not bre-
> > ak at the places wh-
> > ere I have broken t-
> > hem
> >
> > but they don't need embedded control characters to enforce that.
>
> Indeed, there aren't any control characters to control hyphenation.
> Indeed, CGJ between default grapheme clusters is often a very good
> place to hyphenate.
>

Who told about CGJ ?

But zero-width joiners should prevent such undesired breaking ; the legacy
ZWNBSP however does not suggest any ligature but instead will prevent it,
by only gluing two grapheme clusters side by side (with just kerning
enabled), but without altering these glyphs (like in the capital IJ
ligature whose I is shortened and placed on top of the left arm of the J
when using ligaturing joiners).

In South-Est Asian scripts there are such cases to create complex clusters
that also carry semantic distinctions and layout restrictions. the "default
grapheme clusters" may not include these complex clusters, but the later
are needed. The rules about "default grapheme clusters" are only good for
simpler cases where no ligaturing is involved and you don't really care
about specific languages (even fonts contain specific data for specific
languages, independantly of the script represented).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160928/59f48e97/attachment.html>

From alex.plantema at xs4all.nl  Wed Sep 28 17:12:52 2016
From: alex.plantema at xs4all.nl (Alex Plantema)
Date: Thu, 29 Sep 2016 00:12:52 +0200
Subject: IJ with accent
References: <57EB7849.3070908@yspu.org>
Message-ID: <A6BE65489BBF417F80B8EEDFFF0E4F36@p4>

Op woensdag 28 september 2016 09:59 schreef a.lukyanov:

> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component, like this:
>
> If one uses two separate characters (i+j), one can put an accent on each character (?j?).
> However, if monolithic ligature ? is used, how one can accent it correctly? Unicode standard does not answer this.
> Probably one should use the sequence U+0133 U+301, with the accent doubling automatically, but this is not implemented (??).

I've never seen an ij with an accent. You can safely assume it's never needed.

Alex.


From everson at evertype.com  Wed Sep 28 17:20:54 2016
From: everson at evertype.com (Michael Everson)
Date: Wed, 28 Sep 2016 15:20:54 -0700
Subject: IJ with accent
In-Reply-To: <A6BE65489BBF417F80B8EEDFFF0E4F36@p4>
References: <57EB7849.3070908@yspu.org> <A6BE65489BBF417F80B8EEDFFF0E4F36@p4>
Message-ID: <441A66B1-4D94-431C-8223-9F16097B1A5F@evertype.com>

On 28 Sep 2016, at 15:12, Alex Plantema <alex.plantema at xs4all.nl> wrote:

> I've never seen an ij with an accent. You can safely assume it's never needed.

I?ve had people request that I add support for it to Everson Mono, so I safely assume that it?s sometimes needed. ;-)

Michael

From richard.wordingham at ntlworld.com  Wed Sep 28 17:39:54 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 28 Sep 2016 23:39:54 +0100
Subject: IJ with accent
In-Reply-To: <CAGa7JC1nKT+XYSixpmSbziG_NPEB-sb_V=_GYra4XgydbdkH5w@mail.gmail.com>
References: <20160928123004.665a7a7059d7ee80bb4d670165c8327d.09c24ce131.wbe@email03.godaddy.com>
 <20160928214814.67bf3e87@JRWUBU2>
 <CAGa7JC1nKT+XYSixpmSbziG_NPEB-sb_V=_GYra4XgydbdkH5w@mail.gmail.com>
Message-ID: <20160928233954.2ba6b3de@JRWUBU2>

On Wed, 28 Sep 2016 23:22:34 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2016-09-28 22:48 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Wed, 28 Sep 2016 12:30:04 -0700
> > "Doug Ewell" <doug at ewellic.org> wrote:
> >
> > > > Technically I see one, as b?j?na shound never break between ?
> > > > and j?,
> > >
> > > These wor-
> > > ds should not bre-
> > > ak at the places wh-
> > > ere I have broken t-
> > > hem
> > >
> > > but they don't need embedded control characters to enforce that.
> >
> > Indeed, there aren't any control characters to control hyphenation.
> > Indeed, CGJ between default grapheme clusters is often a very good
> > place to hyphenate.
> >
> 
> Who told about CGJ ?
> 
> But zero-width joiners should prevent such undesired breaking ; the
> legacy ZWNBSP however does not suggest any ligature but instead will
> prevent it, by only gluing two grapheme clusters side by side (with
> just kerning enabled), but without altering these glyphs (like in the
> capital IJ ligature whose I is shortened and placed on top of the
> left arm of the J when using ligaturing joiners).

If you could be bothered to read the Unicode standard annexes and the
character database (UCD), you would note that ZWJ (let alone ZWNJ) has
no effect on line-breaking, except with emoji and ideographs.  In
addition to the UCD, a statement to this effect can be found in TUS
23.2 'Layout Controls'. Indeed, the only character that is described as
having an effect on a hyphenator, and that is only described as a
convention (TR14 Line-Breaking, Section 5.4), is U+00AD SOFT HYPHEN.

So far as Unicode is concerned, there is no other plain text control
over hyphenators.

> In South-Est Asian scripts there are such cases to create complex
> clusters that also carry semantic distinctions and layout
> restrictions.

The only semantic distinction available is the forcing of word
boundaries.

Richard.


From kent.karlsson14 at telia.com  Wed Sep 28 17:59:23 2016
From: kent.karlsson14 at telia.com (Kent Karlsson)
Date: Thu, 29 Sep 2016 00:59:23 +0200
Subject: IJ with accent
In-Reply-To: <A6BE65489BBF417F80B8EEDFFF0E4F36@p4>
Message-ID: <D41217EB.38E12%kent.karlsson14@telia.com>


Den 2016-09-29 00:12, skrev "Alex Plantema" <alex.plantema at xs4all.nl>:

> Op woensdag 28 september 2016 09:59 schreef a.lukyanov:
> 
>> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented,
>> it should take an accent on each component, like this:
>> 
>> If one uses two separate characters (i+j), one can put an accent on each
>> character (?j?).
>> However, if monolithic ligature ? is used, how one can accent it correctly?
>> Unicode standard does not answer this.
>> Probably one should use the sequence U+0133 U+301, with the accent doubling
>> automatically, but this is not implemented (??).
> 
> I've never seen an ij with an accent. You can safely assume it's never needed.

See 
https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemto
onteken

/K

> Alex.
> 


From kent.karlsson14 at telia.com  Wed Sep 28 18:12:43 2016
From: kent.karlsson14 at telia.com (Kent Karlsson)
Date: Thu, 29 Sep 2016 01:12:43 +0200
Subject: IJ with accent
In-Reply-To: <20160928214814.67bf3e87@JRWUBU2>
Message-ID: <D4121B0B.38E18%kent.karlsson14@telia.com>


Den 2016-09-28 22:48, skrev "Richard Wordingham"
<richard.wordingham at ntlworld.com>:

> On Wed, 28 Sep 2016 12:30:04 -0700
> "Doug Ewell" <doug at ewellic.org> wrote:
> 
>>> Technically I see one, as b?j?na shound never break between ? and
>>> j?,
>> 
>> These wor-
>> ds should not bre-
>> ak at the places wh-
>> ere I have broken t-
>> hem
>> 
>> but they don't need embedded control characters to enforce that.
> 
> Indeed, there aren't any control characters to control hyphenation.

Well, there is SOFT HYPHEN, as you yourself noted later.

There is also
0083;<control>;Cc;0;BN;;;;;N;NO BREAK HERE;;;;

"NBH is used to indicate a point where a line break shall not occur when
text is formatted."

But that is in the C1 area, most of which nearly no-one implements...

/K

> Indeed, CGJ between default grapheme clusters is often a very good
> place to hyphenate.
> 
> Richard.
> 


From junichi.chiba.bps at gmail.com  Wed Sep 28 22:13:02 2016
From: junichi.chiba.bps at gmail.com (Junichi Chiba)
Date: Thu, 29 Sep 2016 03:13:02 +0000
Subject: Dates in Japanese Era Names in Unicode Standard
Message-ID: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>

Dear all,

Nice to e-meet you.

I'm looking at the latest Unicode Standard [1] listing the dates for
Japanese Era Names in Table 22-8.
What I noticed is the begin and end dates for each era.
They seem to have one day difference with the dates that are recognized
publicly in Japan.
For example, the current Heisei actually started January 8th, 1989, after
Showa ended on 7th, 1989.

However, the Unicode Standard says in Table 22-8:
U+337B square era name heisei 1989-01-07 to present day
U+337C square era name syouwa 1926-12-24 to 1989-01-06

Looking at Wikipedia in Japanese [2] and English [3], you can see exact
dates for Syouwa end and Heisei start.
Could there be certain intentions to leave some difference in this
description and official dates?
Is the date counted according to GMT, instead of local date/time for some
reason?

REFERENCE

[1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf

[2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90
>
1989????64??1?7????????????????????????????????????????????1989????64??1?7????????????????????????1?8???????????

[3] https://en.wikipedia.org/wiki/Heisei_period
> Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ... since
8 January.
> On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial
Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,...
> The Heisei era went into effect immediately upon the day after Emperor
Akihito's succession to the throne on 7 January 1989.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160929/f1c04448/attachment.html>

From christoph.paeper at crissov.de  Thu Sep 29 01:00:59 2016
From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=)
Date: Thu, 29 Sep 2016 08:00:59 +0200
Subject: IJ with accent
In-Reply-To: <57EB7849.3070908@yspu.org>
References: <57EB7849.3070908@yspu.org>
Message-ID: <F36C1D54-341C-46CB-985F-2E8C4E5F05C4@crissov.de>

a.lukyanov <a.lukyanov at yspu.org>:
> 
> Dutch language writing uses the ligature ? (U+0132, U+0133). When accented, it should take an accent on each component,
> 
> However, if monolithic ligature ? is used, how one can accent it correctly?

JFTR:

- ? U+0133
- ?? U+0133+0301
- ?? U+0133+030B
- y U+0079
- y? U+0079+0301
- ? U+00FD
- y? U+0079+030B
- ? U+00FF
- ?? U+00FF+0301
- ?? U+00FF+030B

<https://www.microsoft.com/typography/otspec/features_ko.htm#locl>

From a.lukyanov at yspu.org  Thu Sep 29 02:45:07 2016
From: a.lukyanov at yspu.org (a.lukyanov)
Date: Thu, 29 Sep 2016 10:45:07 +0300
Subject: IJ with accent
In-Reply-To: <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com>
References: <57EB7849.3070908@yspu.org>
 <80540082-B179-4512-A635-DA86AAFDE4B3@evertype.com>
Message-ID: <57ECC683.9090302@yspu.org>

28.09.2016 22:33, Michael Everson wrote:
> The right way to do this is to follow the ligature (capital or small) with U+0301 and then have your font draw two acute accents on the ligature.
>

That seems good, still the Unicode standard says nothing about it. And 
doubling a diacritic is not quite self-evident.

It would be nice to have an explicit description of this issue somewhere 
in the "Europe-I" section.

From verdy_p at wanadoo.fr  Thu Sep 29 05:06:22 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 29 Sep 2016 12:06:22 +0200
Subject: Dates in Japanese Era Names in Unicode Standard
In-Reply-To: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>
References: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>
Message-ID: <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>

Is it possible that these eras start at midday instead of noon ? This could
explain the date difference, if you do not set the time in your query (your
query will assume a default time at 00:00 midnight)

There's a similar issue with most calendars before the modern Gregorian,
and even within historic documents still using date shifting at midday (and
then naming the morning with the previous day). This practice survided for
long as physical 24-hour clocks were rare. Still today, many
English-speaking countries use AM/PM periods and 12-hour clocks are used
for almost all non-electronic displays (even if some watches also include a
small circle display a 24-clock, the 12-hour display is the most common and
the easiest to read (it is a partial survival of the old Roman calendar
that counted time negatively relative to the date defined clearly at
midday, because midday is more more easily observable with a good precision
than midnight).

The recent introduction of daylight saving (and generalization of official
times in large timezones) changed the perception of clock, as it was no
longer synchronized with observation of the Sun. Negative counting in dates
and time as now almost disappeared (except in popular language for counting
the last minutes relative to hours, a correct form of precision rounding).
Dates are better understood to cover the whole working day (or rest day),
except for religious purpose (e.g. withing Judaism, whose reference is the
variable time of sun fall in the evening, or in Islam with also a variable
reference time at sunrise as observed in a reference location determined by
local or national communities).

Many people still count the second half of the night after midnight as part
of the previous day (and so will say "Saturday evening"/"Saturday night"
even if it's already the first hours of Sunday).

If you test dates and don't want to specify hours, it is highly recommended
to set the default time at midday. For the Japanese eras, it's not clear at
which time they really start, except for the last two eras since WW2 but
setting time at midday shoudl give the correct result. However there's no
ambiguity during the day of era switch, if the era is correctly specified
(and not just the year number in era).


2016-09-29 5:13 GMT+02:00 Junichi Chiba <junichi.chiba.bps at gmail.com>:

> Dear all,
>
> Nice to e-meet you.
>
> I'm looking at the latest Unicode Standard [1] listing the dates for
> Japanese Era Names in Table 22-8.
> What I noticed is the begin and end dates for each era.
> They seem to have one day difference with the dates that are recognized
> publicly in Japan.
> For example, the current Heisei actually started January 8th, 1989, after
> Showa ended on 7th, 1989.
>
> However, the Unicode Standard says in Table 22-8:
> U+337B square era name heisei 1989-01-07 to present day
> U+337C square era name syouwa 1926-12-24 to 1989-01-06
>
> Looking at Wikipedia in Japanese [2] and English [3], you can see exact
> dates for Syouwa end and Heisei start.
> Could there be certain intentions to leave some difference in this
> description and official dates?
> Is the date counted according to GMT, instead of local date/time for some
> reason?
>
> REFERENCE
>
> [1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
>
> [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90
> > 1989????64??1?7????????????????????????????????????????????1989????64??
> 1?7????????????????????????1?8???????????
>
> [3] https://en.wikipedia.org/wiki/Heisei_period
> > Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ...
> since 8 January.
> > On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial
> Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,...
> > The Heisei era went into effect immediately upon the day after Emperor
> Akihito's succession to the throne on 7 January 1989.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160929/b8828b8d/attachment.html>

From raymond at almanach.co.uk  Thu Sep 29 05:23:17 2016
From: raymond at almanach.co.uk (Raymond Mercier)
Date: Thu, 29 Sep 2016 11:23:17 +0100
Subject: Dates in Japanese Era Names in Unicode Standard
In-Reply-To: <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>
References: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>
 <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>
Message-ID: <C840B63383AF46A7B498A6E1A2EE1DA8@UserPC>

Philippe,
>>Is it possible that these eras start at midday instead of noon ?
I assume you mean midnight 

RM

www.raymondm.co.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160929/f7bff715/attachment.html>

From duerst at it.aoyama.ac.jp  Thu Sep 29 05:45:54 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 29 Sep 2016 19:45:54 +0900
Subject: Dates in Japanese Era Names in Unicode Standard
In-Reply-To: <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>
References: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>
 <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>
Message-ID: <6c865cc7-8227-d72a-7794-e9fe9f3bc583@it.aoyama.ac.jp>

Just a few not very closely related comments:

On 2016/09/29 19:06, Philippe Verdy wrote:
> Is it possible that these eras start at midday instead of noon ? This could
> explain the date difference, if you do not set the time in your query (your
> query will assume a default time at 00:00 midnight)

It's extremely difficult to imagine this for Japan in this day and age.

I was in Japan when the era changed from Showa to Heisei. I remember the 
announcement very well, but I don't remember anything about the exact 
time of the cutover.


> Many people still count the second half of the night after midnight as part
> of the previous day (and so will say "Saturday evening"/"Saturday night"
> even if it's already the first hours of Sunday).

In Japan, that happens e.g. in displays of restaurants and bars, which 
may announce their opening hours as 17:30-27:00 (i.e. open until three 
in the morning the next day). But that's only a convention for 
convenience, everybody knows that it's already the next day on the calendar.


> If you test dates and don't want to specify hours, it is highly recommended
> to set the default time at midday. For the Japanese eras, it's not clear at
> which time they really start, except for the last two eras since WW2 but
> setting time at midday shoudl give the correct result. However there's no
> ambiguity during the day of era switch, if the era is correctly specified
> (and not just the year number in era).

Yes indeed. These days, people just refer to 1989 (and any dates in it) 
as Heisei 1 (????). This is all the easier because otherwise, an 
exception would be necesary for only 7 days.

On the other hand, I saw places that said Showa 64 as late as July (that 
was when I climbed Mt. Fuji; a placard put up the year before said 
"closed until July Showa 64"). I also got some money in February or so 
that year and had to sign a receipt that said Showa 64 because it was 
printed earlier.

The Japanese Wikipedia article, at the bottom of the ?? 
(https://ja.wikipedia.org/wiki/??#.E6.94.B9.E5.85.83) section, says that 
in contrast to the two earlier changes in era, the change started on the 
next day, in order to give engineers time for the change. That next day 
was a Sunday, which meant that in effect, they had even more time, 
because most systems had to work with the new ear only from Monday. But 
I guess it must have been a busy weekend for those involved, anyway.

To know all the details, the best thing to do would be to check the 
official government documents, which should be available online. But I 
wouldn't be surprised if they were not specifying things to the second.

Regards,    Martin.

> 2016-09-29 5:13 GMT+02:00 Junichi Chiba <junichi.chiba.bps at gmail.com>:
>
>> Dear all,
>>
>> Nice to e-meet you.
>>
>> I'm looking at the latest Unicode Standard [1] listing the dates for
>> Japanese Era Names in Table 22-8.
>> What I noticed is the begin and end dates for each era.
>> They seem to have one day difference with the dates that are recognized
>> publicly in Japan.
>> For example, the current Heisei actually started January 8th, 1989, after
>> Showa ended on 7th, 1989.
>>
>> However, the Unicode Standard says in Table 22-8:
>> U+337B square era name heisei 1989-01-07 to present day
>> U+337C square era name syouwa 1926-12-24 to 1989-01-06
>>
>> Looking at Wikipedia in Japanese [2] and English [3], you can see exact
>> dates for Syouwa end and Heisei start.
>> Could there be certain intentions to leave some difference in this
>> description and official dates?
>> Is the date counted according to GMT, instead of local date/time for some
>> reason?
>>
>> REFERENCE
>>
>> [1] http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
>>
>> [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90
>>> 1989????64??1?7????????????????????????????????????????????1989????64??
>> 1?7????????????????????????1?8???????????
>>
>> [3] https://en.wikipedia.org/wiki/Heisei_period
>>> Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ...
>> since 8 January.
>>> On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial
>> Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's death,...
>>> The Heisei era went into effect immediately upon the day after Emperor
>> Akihito's succession to the throne on 7 January 1989.
>>
>

-- 
Martin J. D?rst
Department of Intelligent Information Technology
Collegue of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

From doug at ewellic.org  Thu Sep 29 11:02:29 2016
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 29 Sep 2016 09:02:29 -0700
Subject: IJ with accent
Message-ID: <20160929090229.665a7a7059d7ee80bb4d670165c8327d.527942b7de.wbe@email03.godaddy.com>

Kent Karlsson wrote:

>> I've never seen an ij with an accent. You can safely assume it's
>> never needed.
>
> See
> https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemtoonteken

I note with amusement that this Wikipedia page, presumably written and
edited by Dutch speakers who we often hear insist on the precomposed
letters, contains more than 30 instances of IJ or ij (the separate Basic
Latin letters) and zero instances of ? or ?.

?j?, as others have observed, is trivially simple.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From everson at evertype.com  Thu Sep 29 11:17:02 2016
From: everson at evertype.com (Michael Everson)
Date: Thu, 29 Sep 2016 09:17:02 -0700
Subject: IJ with accent
In-Reply-To: <F36C1D54-341C-46CB-985F-2E8C4E5F05C4@crissov.de>
References: <57EB7849.3070908@yspu.org>
 <F36C1D54-341C-46CB-985F-2E8C4E5F05C4@crissov.de>
Message-ID: <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com>

y is not an acceptable variant of ? though. ?Byoux? is not correct; ?bijoux? or ?b?oux? is?

> JFTR:
> 
> - ? U+0133
> - ?? U+0133+0301
> - ?? U+0133+030B
> - y U+0079
> - y? U+0079+0301
> - ? U+00FD
> - y? U+0079+030B
> - ? U+00FF
> - ?? U+00FF+0301
> - ?? U+00FF+030B
> 
> <https://www.microsoft.com/typography/otspec/features_ko.htm#locl>


From alex.plantema at xs4all.nl  Thu Sep 29 11:25:24 2016
From: alex.plantema at xs4all.nl (Alex Plantema)
Date: Thu, 29 Sep 2016 18:25:24 +0200
Subject: IJ with accent
References: <20160929090229.665a7a7059d7ee80bb4d670165c8327d.527942b7de.wbe@email03.godaddy.com>
Message-ID: <AA8BBBEE40A841018923720A7F9350EF@p4>

Op donderdag 29 september 2016 18:02 schreef Doug Ewell:

> Kent Karlsson wrote:
>
>>> I've never seen an ij with an accent. You can safely assume it's
>>> never needed.
>>
>> See
>> https://nl.wikipedia.org/wiki/Accenttekens_in_de_Nederlandse_spelling#Klemtoonteken
>
> I note with amusement that this Wikipedia page, presumably written and
> edited by Dutch speakers who we often hear insist on the precomposed
> letters, contains more than 30 instances of IJ or ij (the separate
> Basic Latin letters) and zero instances of ? or ?.
>
> ?j?, as others have observed, is trivially simple.

The precomposed version isn't recommended anymore.
The ij evolved from ii, because ii is indistinguishable from ? in handwriting.

Alex.


From verdy_p at wanadoo.fr  Thu Sep 29 12:16:33 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 29 Sep 2016 19:16:33 +0200
Subject: IJ with accent
In-Reply-To: <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com>
References: <57EB7849.3070908@yspu.org>
 <F36C1D54-341C-46CB-985F-2E8C4E5F05C4@crissov.de>
 <88B8CCD7-6151-4DD3-8D5B-DCB63D184DB0@evertype.com>
Message-ID: <CAGa7JC0BLKEL90byAbXaWWxVeAS+OK6Cn7YDsBp7rAKg=N70=w@mail.gmail.com>

Actually your example is not contrieved, it cites words in French, which
makes no use at all of this Dutch digraph; French however distinguishes "?"
as a valid letter in its alphabet and will distinguish it from "y" and "ij".

But with the old Dutch way of writing ij, it would become ? (keeping the
dots), not "y", so your incorrect example "bijou(x)" would appear as
"B?ou(x)", not "Byou(x)... if only it was Dutch and if there was no
syllable break between i and j like in this actual French word "bi-jou(x)".

In capitals the dots would disappear and "BIJOU(X)" would become "B?OU(X)"
(with the ligature... if only it was Dutch), but the normal French "?"
(which occurs in rare words) considers the dots as a diareasis (where
there's a clear syllable break before, as "?" only occurs after another
vowel, so that "?" becomes a plain vowel /i/ with an leading glotal stop,
 and not the half-consonant /j/:

   "L'Ha?es-les-Roses" is clearly prononced /la??i?l???oz/ (as if it was
written "L'Hahi(es)-les-Roses")
   but not if there was not this diareasis it would be read incorrectly as
/laj?l???oz/ (as if it was written "L'A?l-les-Roses")
   (the "-es" termination is mute here).

The need of a diareasis if very rare with "y" in French where "y" is
normally /j/ after a vowel (but not before a final mute "e"), or /i/ after
a consonnant, and the digrams "ay" and "oy" are working like "ai" /?/ and
"oi" /wa/ when final, or before a consonnant, or before other final mute
letters. Why there's a "y" and not a "i" here is historic, it was initially
pronounced /la?ji?l???oz/ and could have then been rewritten as
"L'Hayies-les-Roses", but possibly incorrectly read as /l??ji?l???oz/
(using the normal pronouciation of the "ay" digram like "ai". The diaresis
solved the reading problem, the "y" was kept but without any following "i",
to make sure it is not turned into a half-consonnant /j/ and remains an
plain /i/ vowel, the the diareasis implies the glottal stop separation of
syllables.

All this is not relevant for "bijou(x)" or "BIJOU(X)", and not relevant for
Dutch which treats the digram "ij" most often as a long form of the vowel
/i/ alone (and not a pair with the vowel /i/ and a consonnant /?/ or /d?/
or /j/ when there's a syllable break between them). In French, long vowels
are no longer distinguished phonetically and never orthographically, other
languages use diacritics such as a macron (for Japanese romanization) or an
acute accent over stressed/long vowels. I suppose that the need to add
acute accent in the Dutch digraph "ij" is to not just mark the length, but
also the stress (accents are placed on both letters of the digraph, but it
could as well been a single macron, a very unusual diacritic in Dutch).


2016-09-29 18:17 GMT+02:00 Michael Everson <everson at evertype.com>:

> y is not an acceptable variant of ? though. ?Byoux? is not correct;
> ?bijoux? or ?b?oux? is?
>
> > JFTR:
> >
> > - ? U+0133
> > - ?? U+0133+0301
> > - ?? U+0133+030B
> > - y U+0079
> > - y? U+0079+0301
> > - ? U+00FD
> > - y? U+0079+030B
> > - ? U+00FF
> > - ?? U+00FF+0301
> > - ?? U+00FF+030B
> >
> > <https://www.microsoft.com/typography/otspec/features_ko.htm#locl>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160929/03f5f573/attachment.html>

From duerst at it.aoyama.ac.jp  Fri Sep 30 00:43:45 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Fri, 30 Sep 2016 14:43:45 +0900
Subject: Dates in Japanese Era Names in Unicode Standard
In-Reply-To: <CAGjN4aW6wgSQQ63aL1B430mLjSk7G1BQusTuoZ0G0jgtrSAXcQ@mail.gmail.com>
References: <CAGjN4aU6_-YM-EZidkbvXGkNZq98NJqgiQJ9kQoCbyt-CXH9Kg@mail.gmail.com>
 <CAGa7JC2DqS1+gf6K_NA-zeA1dWDas9tdTjA=Gep=7sdZVGjQiA@mail.gmail.com>
 <6c865cc7-8227-d72a-7794-e9fe9f3bc583@it.aoyama.ac.jp>
 <CAGjN4aW6wgSQQ63aL1B430mLjSk7G1BQusTuoZ0G0jgtrSAXcQ@mail.gmail.com>
Message-ID: <59642171-c152-0863-8165-ac48ace1d9a1@it.aoyama.ac.jp>

Hello Junichi,

Your analysis sounds very plausible. I suggest you send an official 
error report using http://www.unicode.org/reporting.html.

Regards,   Martin.

On 2016/09/30 13:16, ?? ?? wrote:
>> Is it possible that these eras start at midday instead of noon ?
>> This could explain the date difference, if you do not set the time in
> your query
>> (your query will assume a default time at 00:00 midnight)
>
> The new era starts 00:00 midnight local time.
> Together with the time zone difference, I assume that the cause was a
> simple chain of mistakes while drafting the unicode document.
>
> My story:
>
> First, the author for the Table 22-8 asks somebody to send a list of the
> dates.
> For the table to work, the accuracy of "day" should be enough, rather than
> time.
> The "day" value is thus recorded in YYYYMMDD format.
> It is then listed in a file format like a spreadsheet, that keeps day value
> in "time" accuracy with time zone marker.
> As there is no intention to keep it in "time" accuracy, let's suppose that
> a default marker such as UTC+0 is embed automatically.
>
> The spreadsheet is then sent to the author and opened in more "Western"
> time zone than it was recorded.
> Upon opening the file, the dates were converted to local time zone.
> Specifying a more "Western" time zone results in smaller date values.
> Thus the smaller values are picked up by the author for Table 22-8.
>
> Actually all of the day values in Table 22-8 are shifted by one earlier.
>
> Current values:
> U+337B square era name heisei 1989-01-07 to present day
> U+337C square era name syouwa 1926-12-24 to 1989-01-06
> U+337D square era name taisyou 1912-07-29 to 1926-12-23
> U+337E square era name meizi 1867 to 1912-07-28
>
> Suggested correction:
> U+337B square era name heisei 1989-01-08 to present day
> U+337C square era name syouwa 1926-12-25 to 1989-01-07
> U+337D square era name taisyou 1912-07-30 to 1926-12-24
> U+337E square era name meizi 1868 to 1912-07-29
>
>
> Here are some citations.
>
> I will cite from the most reliable source, law database provided by the
> government (in Japanese).
> This is the actual law about when Heisei shall start:
> http://law.e-gov.go.jp/cgi-bin/idxselect.cgi?IDX_OPT=1&H_NAME=%8C%B3%8D%86%82%F0%89%FC%82%DF%82%E9%90%AD%97%DF&H_NAME_YOMI=%82%A0&H_NO_GENGO=H&H_NO_YEAR=&H_NO_TYPE=2&H_NO_NO=&H_FILE_NAME=S64SE001&H_RYAKU=1&H_CTG=1&H_YOMI_GUN=1&H_CTG_GUN=1
>
>> ???????????????
>> ...
>> ??????????
>> ??
>> ????????????????????
>
> Translation:
>> Showa 64 January 7 Ordinance 1
>> ...
>> Era name shall be Heisei.
>> Appendix
>> This ordinance shall be effective since the next day of promulgation.
>
> The release date was January 7.
> As Martin mentioned, Heisei started on the next day of the announcement.
> Thus Showa lasted until the end of January 7 very midnight, then Heisei
> started at very morning of January 8.
>
>> On the other hand, I saw places that said Showa 64 as late as July (that
>> was when I climbed Mt. Fuji; a placard put up the year before said
>> "closed until July Showa 64").
>
> I remember the same thing when I was a child.
> For about a half year, many things such as application forms and street
> signs still displayed in Showa. I saw Passport and License showing
> expiration date as Showa 70 or 80. Coins are printed and stocked before
> release, so there are circulation of Showa 64 coins.
>
> People often carry a conversion table like:
> 1986 : Showa 61
> 1987 : Showa 62
> 1988 : Showa 63
> 1989 : Showa 64 : Heisei 1
> 1990 : Showa 65 : Heisei 2
> 1991 : Showa 66 : Heisei 3
>
> I also cite start of Showa. This is citation from Wikisource, another
> reliable source for public documents.
> https://ja.wikisource.org/wiki/%E6%98%AD%E5%92%8C%E3%83%88%E6%94%B9%E5%85%83
>> ??????????????????????????????????????????????????????????
>> ????
>> ????????????
> Translation:
>> In the name of Emperor who is given inherited soverignty to administer
> state affairs, We let Taisho 15 December 25 and forth be begin of Showa.
>> Signed by Emperor
>> Taisho 15 December 25
> As Martin mentioned, eras before Heisei were renewed in the way that
> announcement overwrites the old day.
>
>
> Here is start of Taisho:
> https://ja.wikisource.org/wiki/%E6%98%8E%E6%B2%BB%E5%9B%9B%E5%8D%81%E4%BA%94%E5%B9%B4%E4%B8%83%E6%9C%88%E4%B8%89%E5%8D%81%E6%97%A5%E4%BB%A5%E5%BE%8C%E3%83%B2%E6%94%B9%E3%83%A1%E3%83%86%E5%A4%A7%E6%AD%A3%E5%85%83%E5%B9%B4%E3%83%88%E7%88%B2%E3%82%B9
>> ????????????????????????????
>> ??????????????????????????????????????
>> ????
>> ???????????
>
> Translation:
>> In the name of Emperor under inherited spirit of soverignty to administer
> state affairs with virtue, We let, regarding ordinance enacted by the
> previous Emperor, Meiji 45 July 30 and forth be begin of Taisho.
>> Signed by Emperor
>> Meiji 45 July 30
>
> With this law, Meiji 45 July 30 is overwritten by Taisho 1 July 30.
>
>
> Lastly, here is start of Meiji.
> https://ja.wikisource.org/wiki/%E4%BB%8A%E5%BE%8C%E5%B9%B4%E8%99%9F%E3%83%8F%E5%BE%A1%E4%B8%80%E4%BB%A3%E4%B8%80%E8%99%9F%E3%83%8B%E5%AE%9A%E3%83%A1%E6%85%B6%E6%87%89%E5%9B%9B%E5%B9%B4%E3%83%B2%E6%94%B9%E3%83%86%E6%98%8E%E6%B2%BB%E5%85%83%E5%B9%B4%E3%83%88%E7%88%B2%E3%82%B9%E5%8F%8A%E8%A9%94%E6%9B%B8
>> ??
>> ...?????????????????????????????
>> ????????
>
> Translation:
>> Imperial Edict
>> ... Keio 4 be renamed as Meiji 1 and since now the tradition of frequent
> renaming of Era be limited to one Era per Emperor.
>
> Since Meiji, the Era is less frequently renewed. It is more engineer
> friendly!
>
> In Table 22-8, the Meiji start day is omitted.
> The omission itself is reasonable. It can avoid controversy in writing the
> day along Lunar calendar used until Meiji 5 December 2 midnight. (The next
> day is Meiji 6 January 1.)
>
> The problem here is the year shown as 1867.
> The ordinance was released on Meiji 1 September 8 Lunar, which was 1868
> October 23 Gregorian.
> Meiji 1 January 1 Lunar (and Keio 4 January 1 Lunar) is 1868 January 25
> Gregorian.
> My best guess is that the author of Table 22-8 picked up the year value
> from spreadsheet showing "1867-12-31" in local time, originally intended to
> show merely "1868-01".
>
> On Thu, 29 Sep 2016 at 19:46 Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>
>> Just a few not very closely related comments:
>>
>> On 2016/09/29 19:06, Philippe Verdy wrote:
>>> Is it possible that these eras start at midday instead of noon ? This
>> could
>>> explain the date difference, if you do not set the time in your query
>> (your
>>> query will assume a default time at 00:00 midnight)
>>
>> It's extremely difficult to imagine this for Japan in this day and age.
>>
>> I was in Japan when the era changed from Showa to Heisei. I remember the
>> announcement very well, but I don't remember anything about the exact
>> time of the cutover.
>>
>>
>>> Many people still count the second half of the night after midnight as
>> part
>>> of the previous day (and so will say "Saturday evening"/"Saturday night"
>>> even if it's already the first hours of Sunday).
>>
>> In Japan, that happens e.g. in displays of restaurants and bars, which
>> may announce their opening hours as 17:30-27:00 (i.e. open until three
>> in the morning the next day). But that's only a convention for
>> convenience, everybody knows that it's already the next day on the
>> calendar.
>>
>>
>>> If you test dates and don't want to specify hours, it is highly
>> recommended
>>> to set the default time at midday. For the Japanese eras, it's not clear
>> at
>>> which time they really start, except for the last two eras since WW2 but
>>> setting time at midday shoudl give the correct result. However there's no
>>> ambiguity during the day of era switch, if the era is correctly specified
>>> (and not just the year number in era).
>>
>> Yes indeed. These days, people just refer to 1989 (and any dates in it)
>> as Heisei 1 (????). This is all the easier because otherwise, an
>> exception would be necesary for only 7 days.
>>
>> On the other hand, I saw places that said Showa 64 as late as July (that
>> was when I climbed Mt. Fuji; a placard put up the year before said
>> "closed until July Showa 64"). I also got some money in February or so
>> that year and had to sign a receipt that said Showa 64 because it was
>> printed earlier.
>>
>> The Japanese Wikipedia article, at the bottom of the ??
>> (https://ja.wikipedia.org/wiki/??#.E6.94.B9.E5.85.83) section, says that
>> in contrast to the two earlier changes in era, the change started on the
>> next day, in order to give engineers time for the change. That next day
>> was a Sunday, which meant that in effect, they had even more time,
>> because most systems had to work with the new ear only from Monday. But
>> I guess it must have been a busy weekend for those involved, anyway.
>>
>> To know all the details, the best thing to do would be to check the
>> official government documents, which should be available online. But I
>> wouldn't be surprised if they were not specifying things to the second.
>>
>> Regards,    Martin.
>>
>>> 2016-09-29 5:13 GMT+02:00 Junichi Chiba <junichi.chiba.bps at gmail.com>:
>>>
>>>> Dear all,
>>>>
>>>> Nice to e-meet you.
>>>>
>>>> I'm looking at the latest Unicode Standard [1] listing the dates for
>>>> Japanese Era Names in Table 22-8.
>>>> What I noticed is the begin and end dates for each era.
>>>> They seem to have one day difference with the dates that are recognized
>>>> publicly in Japan.
>>>> For example, the current Heisei actually started January 8th, 1989,
>> after
>>>> Showa ended on 7th, 1989.
>>>>
>>>> However, the Unicode Standard says in Table 22-8:
>>>> U+337B square era name heisei 1989-01-07 to present day
>>>> U+337C square era name syouwa 1926-12-24 to 1989-01-06
>>>>
>>>> Looking at Wikipedia in Japanese [2] and English [3], you can see exact
>>>> dates for Syouwa end and Heisei start.
>>>> Could there be certain intentions to leave some difference in this
>>>> description and official dates?
>>>> Is the date counted according to GMT, instead of local date/time for
>> some
>>>> reason?
>>>>
>>>> REFERENCE
>>>>
>>>> [1]
>> http://www.unicode.org/versions/Unicode9.0.0/UnicodeStandard-9.0.pdf
>>>>
>>>> [2] https://ja.wikipedia.org/wiki/%E5%B9%B3%E6%88%90
>>>>> 1989????64??1?7????????????????????????????????????????????1989????64??
>>>> 1?7????????????????????????1?8???????????
>>>>
>>>> [3] https://en.wikipedia.org/wiki/Heisei_period
>>>>> Thus, 1989 corresponds to Sh?wa 64 until 7 January and Heisei 1 ...
>>>> since 8 January.
>>>>> On 7 January 1989, at 07:55 JST, the Grand Steward of Japan's Imperial
>>>> Household Agency, Sh?ichi Fujimori, announced Emperor Hirohito's
>> death,...
>>>>> The Heisei era went into effect immediately upon the day after Emperor
>>>> Akihito's succession to the throne on 7 January 1989.
>>>>
>>>
>>
>> --
>> Martin J. D?rst
>> Department of Intelligent Information Technology
>> Collegue of Science and Engineering
>> Aoyama Gakuin University
>> Fuchinobe 5-1-10, Chuo-ku, Sagamihara
>> 252-5258 Japan
>>
>

-- 
Martin J. D?rst
Department of Intelligent Information Technology
Collegue of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan

From glorieul at coanda-deviation.info  Fri Sep 30 04:57:15 2016
From: glorieul at coanda-deviation.info (Gael Lorieul)
Date: Fri, 30 Sep 2016 11:57:15 +0200
Subject: Why incomplete subscript/superscript alphabet ?
Message-ID: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>

Hello all,

I wonder why only a subset of the alphabet is available as subscript
and/or superscript ?

This is well illustrated on the table in the following Wikipedia page:

https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Latin_and_Greek_tables

Is there a reason for this ?

I would love to have these characters available because I often use
Unicode to write equations as comments of a source code. For instance:

     class Term_diff_rotDivStressTensor_splitted
     /**
      * Computes:
      *
      *     ?       ???        ?1              ?
      *     ?.?? + ??????u + ????.(?u + ?u?)????
      *     ?       ???        ??              ?
     */
     {
         [...] (class definition)
     }


or a more problematic example:

     /*
      *                    ?t???
      *     q(t?) ? q(t?) +?   rhs(q,t) dt  +   (t??? - t?????)
      *                    ?t?????
     */

Here "end" and "start" would have been better as subscripts, but I could
not do so because letter "d" is not available as a subscript?

As you can see, having only some letters available as subscript (&
superscript) is sometimes a pain?


Ga?l Lorieul

PhD student in Computational Fluid Dynamics
at Universit? catholique de Louvain

From jkorpela at cs.tut.fi  Fri Sep 30 10:07:29 2016
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Fri, 30 Sep 2016 18:07:29 +0300
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
Message-ID: <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi>

30.9.2016, 12:57, Gael Lorieul wrote:

> I wonder why only a subset of the alphabet is available as subscript
> and/or superscript ?

This is explained in section 22.4 of the standard:
http://www.unicode.org/versions/Unicode9.0.0/ch22.pdf#page=25

To put it briefly, in my interpretation, subscript and superscript 
characters have been encoded in Unicode only if they have specialized, 
defined meaning in some notations (e.g. superscript letters in phonetic 
notations) or if they exist in some legacy character encoding.

Apart from specialized cases, the recommended approach is to use higher 
protocols (such as formatting or markup). So instead of trying to find 
superscript letters for ?end?, you should consider using rich text or a 
markup language so that the word written with normal letters ?end? is 
formatted or marked up as a superscript.

Yucca


From jknappen at web.de  Fri Sep 30 10:08:52 2016
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Fri, 30 Sep 2016 17:08:52 +0200
Subject: Aw: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
Message-ID: <trinity-4b12202f-6c78-4d10-aac8-25307d97483f-1475248132453@3capp-webde-bs49>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/48a820be/attachment.html>

From verdy_p at wanadoo.fr  Fri Sep 30 10:19:34 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 30 Sep 2016 17:19:34 +0200
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
Message-ID: <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>

Your problem here is that "start" and "end" are not symbols/variables but
actual English words. Why would this usage be restricted only to English ?
The same formula would need to be really translated in various languages
and scripts, needing then mapping all letters in Latin, Greek, Cyrillic,
but even also Arabic, Japanese Chinese, Hindi...

This usage in plain text as comments in source codes generally do not need
to be really very friendly in their layout, they can remain more symbolic
and you should not even need to split these formulas in multiple lines,
using broken characters (such as parentheses and square braces, whose
presence in Unicode is justified only for mapping legacy characters used to
render actual text on old monospace-only terminals.

Here your source code is intended for programmers and should better use a
technical notation.

If you want to include a conventional formula, include an URL going to an
image or to an anchor in some document (HTML, PDF, Doc(x) file, or a
reference to a page in a book)
So I suggest you use some notational conventions such as TeX here if you
want to be exact (this notation may be different from the actual
implemetnation in the documented code).

The superscript/subscripts in Unicode have been encoded mostly because they
are needed for the orthography of some languages as distinct letters, but
most often as modifiers, they are not intended to be used to compose
separate words like "start" or "end" here.

Note also that many tools generating documentation from source code allow
you to insert HTML comments, so you could as well use <sub></sub>, and then
we don't need these additions (this would be an open door to reending
almost all letters in all scripts as subscripts/superscripts, with many new
problems for their diacritics).

Just consider how you would translate your formula in French: "start" would
become "d?but" (note the combining accute accent...). Here again with a TeX
notation or an HTML notation you solve the problem using <sub>d?but</sub>
in the formula. or using a <math>...</math> HTML element to embed a
complete MathML (TeX-like) formula.

Your souce code documentation is not necessarily in English. English is
used frequently in corporate code or in many open-sourced projects, but not
always. There's even open-sourced code that is managed by teams speaking
another language, for projects targetting mostly another language or an
organization that wants or requires documentation in another language
(notably for the public APIs; internal/private APIs are often excluded from
doc generation tools, so programmers are free to use any language that are
convenient to them, but they won't pass a lot of time tuning these comments
so that they are perfectly readable with all exact linguistic and
scriptural features and good looking for many readers). Discussing these
projects in English would exclude valuable contributions for the target
users of the application, possibly using incorrect terms or very fuzzy
translations to English when there are other requirements (notably with
terms with legal meaning).

Ok, the terms "end" and "start" are understood by all programmers, but not
necessarily all users of a public API (which may use it through other code
generation helpers, templates, HTML/application input forms and so on).


2016-09-30 11:57 GMT+02:00 Gael Lorieul <glorieul at coanda-deviation.info>:

> Hello all,
>
> I wonder why only a subset of the alphabet is available as subscript
> and/or superscript ?
>
> This is well illustrated on the table in the following Wikipedia page:
>
> https://en.wikipedia.org/wiki/Unicode_subscripts_and_
> superscripts#Latin_and_Greek_tables
>
> Is there a reason for this ?
>
> I would love to have these characters available because I often use
> Unicode to write equations as comments of a source code. For instance:
>
>      class Term_diff_rotDivStressTensor_splitted
>      /**
>       * Computes:
>       *
>       *     ?       ???        ?1              ?
>       *     ?.?? + ??????u + ????.(?u + ?u?)????
>       *     ?       ???        ??              ?
>      */
>      {
>          [...] (class definition)
>      }
>
>
> or a more problematic example:
>
>      /*
>       *                    ?t???
>       *     q(t?) ? q(t?) +?   rhs(q,t) dt  +   (t??? - t?????)
>       *                    ?t?????
>      */
>
> Here "end" and "start" would have been better as subscripts, but I could
> not do so because letter "d" is not available as a subscript?
>
> As you can see, having only some letters available as subscript (&
> superscript) is sometimes a pain?
>
>
> Ga?l Lorieul
>
> PhD student in Computational Fluid Dynamics
> at Universit? catholique de Louvain
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/6ee5aeab/attachment.html>

From jkorpela at cs.tut.fi  Fri Sep 30 10:54:27 2016
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Fri, 30 Sep 2016 18:54:27 +0300
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
Message-ID: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>

30.9.2016, 18:19, Philippe Verdy wrote:

> Note also that many tools generating documentation from source code
> allow you to insert HTML comments, so you could as well use <sub></sub>,

Yes, but there?s a serious typographic pitfall with this, as well as 
with using e.g. subscript or superscript formatting in a word processor. 
The problem is that the rendering is almost always simplistic: letters 
(or other characters) of the current font are used in reduced size and 
in lowered or raised position. The result is that the glyphs have 
reduced stroke width too, and the position change very often causes line 
spacing to be uneven.

The typographically correct implementation of such formatting or markup 
would use subscript or superscript glyphs from the font, designed by the 
font creator to match the style of the font. This is more difficult than 
the simplistic approach, and of course it is possible only when using a 
font that contains such glyphs.

Using HTML, for example, the way to achieve that at present would be to 
use markup like <span class="sub">...</span> (to avoid the problems 
caused by the default formatting of <sub> and <sup>) and to use a CSS 
style sheet that sets font-family suitably and uses OpenType font 
feature settings to select subscript or superscript glyphs. In practice, 
you would need to use @font-face to embed a suitable OpenType font. So 
it?s doable, but not trivial like just slapping <sub> and </sub> around 
some text.

A practical conclusion is that if you need only e.g. 2 and 3 as 
superscripts (a rather general situation in general texts, where you 
just need m? or m?), it is much simpler to use the relevant Unicode 
superscript characters (instead of e.g. m<sup>2</sup>). This means using 
typographer-designer superscript glyphs in a simple and reliable way.

Yucca


From leoboiko at gmail.com  Fri Sep 30 11:11:19 2016
From: leoboiko at gmail.com (Leonardo Boiko)
Date: Fri, 30 Sep 2016 13:11:19 -0300
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
Message-ID: <CAJ6uix7zrqP-6=D+D5ECOYSdqhFtDgmf2x4YvENRKUj=K5JWoQ@mail.gmail.com>

The Unicode codepoints are not intended as a place to store typographically
variant glyphs (much like the Unicode "italic" characters aren't designed
as a way of encoding italic faces). The correct thing here is that the
markup and the font-rendering systems *should* automatically work together
to choose the proper face?as they already do with italics or optical sizes,
and as they should do with true small-caps etc.

I agree that our current systems are typographically atrocious and an
abomination before the God of good taste, and I don't blame anyone for
resorting to Unicode tricks to work around that. But that's a crummy
stopgap at best, and legitimizing it would be counterproductive in the long
run?not to mention ethnocentric (unless you want Unicode sub- and
superscript codepoints for every single existing character ever, including
the full Han set).

Rather, let's bug the authors of font rendering systems, user interface
libraries, text editors, web browsers etc. for halfway decent typography.

2016/09/30 12:56 "Jukka K. Korpela" <jkorpela at cs.tut.fi>:

> 30.9.2016, 18:19, Philippe Verdy wrote:
>
> Note also that many tools generating documentation from source code
>> allow you to insert HTML comments, so you could as well use <sub></sub>,
>>
>
> Yes, but there?s a serious typographic pitfall with this, as well as with
> using e.g. subscript or superscript formatting in a word processor. The
> problem is that the rendering is almost always simplistic: letters (or
> other characters) of the current font are used in reduced size and in
> lowered or raised position. The result is that the glyphs have reduced
> stroke width too, and the position change very often causes line spacing to
> be uneven.
>
> The typographically correct implementation of such formatting or markup
> would use subscript or superscript glyphs from the font, designed by the
> font creator to match the style of the font. This is more difficult than
> the simplistic approach, and of course it is possible only when using a
> font that contains such glyphs.
>
> Using HTML, for example, the way to achieve that at present would be to
> use markup like <span class="sub">...</span> (to avoid the problems caused
> by the default formatting of <sub> and <sup>) and to use a CSS style sheet
> that sets font-family suitably and uses OpenType font feature settings to
> select subscript or superscript glyphs. In practice, you would need to use
> @font-face to embed a suitable OpenType font. So it?s doable, but not
> trivial like just slapping <sub> and </sub> around some text.
>
> A practical conclusion is that if you need only e.g. 2 and 3 as
> superscripts (a rather general situation in general texts, where you just
> need m? or m?), it is much simpler to use the relevant Unicode superscript
> characters (instead of e.g. m<sup>2</sup>). This means using
> typographer-designer superscript glyphs in a simple and reliable way.
>
> Yucca
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/ca81ba39/attachment.html>

From jkorpela at cs.tut.fi  Fri Sep 30 11:31:58 2016
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Fri, 30 Sep 2016 19:31:58 +0300
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <CAJ6uix7zrqP-6=D+D5ECOYSdqhFtDgmf2x4YvENRKUj=K5JWoQ@mail.gmail.com>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
 <CAJ6uix7zrqP-6=D+D5ECOYSdqhFtDgmf2x4YvENRKUj=K5JWoQ@mail.gmail.com>
Message-ID: <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi>

30.9.2016, 19:11, Leonardo Boiko wrote:

> The Unicode codepoints are not intended as a place to store
> typographically variant glyphs (much like the Unicode "italic"
> characters aren't designed as a way of encoding italic faces).

There is no disagreement on this. What I was pointing at was that when 
using rich text or markup, it is complicated or impossible to have 
typographically correct glyphs used (even when they exist), whereas the 
use of Unicode codepoints for subscript or superscript characters may do 
that in a much simpler way.

> The
> correct thing here is that the markup and the font-rendering systems
> *should* automatically work together to choose the proper face?as they
> already do with italics or optical sizes, and as they should do with
> true small-caps etc.

While waiting for this, we may need for interim solutions (for a few 
decades, for example). By the way, font-rendering systems don?t even do 
italics the right way in all cases. They may silently use ?fake italics? 
(algorithmically slanted letters). (I?m not suggesting the use of 
Unicode codepoints to deal with this.)

> I agree that our current systems are typographically atrocious and an
> abomination before the God of good taste, and I don't blame anyone for
> resorting to Unicode tricks to work around that.

I don?t think it?s a trick to use characters like SUPERSCRIPT TWO and 
SUPERSCRIPT THREE. The practical problem is that at the point where you 
need other superscripts that cannot be (reliably) produced using similar 
codepoints, you will need to consider replacing  SUPERSCRIPT TWO and 
SUPERSCRIPT THREE by DIGIT TWO and DIGIT THREE with suitable markup or 
formatting, to avoid stylistic mismatch. This isn?t as serious as it 
sounds. When that day comes, you can probably do a suitable global 
replace operation on your texts.

Yucca


From verdy_p at wanadoo.fr  Fri Sep 30 11:36:22 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 30 Sep 2016 18:36:22 +0200
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
Message-ID: <CAGa7JC19pBzfZXA5hXvD-_G2kuADLQ0V+gfjFf7SKi9jgZpQsA@mail.gmail.com>

2016-09-30 17:54 GMT+02:00 Jukka K. Korpela <jkorpela at cs.tut.fi>:

> Using HTML, for example, the way to achieve that at present would be to
> use markup like <span class="sub">...</span> (to avoid the problems caused
> by the default formatting of <sub> and <sup>) and to use a CSS style sheet
> that sets font-family suitably and uses OpenType font feature settings to
> select subscript or superscript glyphs. In practice, you would need to use
> @font-face to embed a suitable OpenType font. So it?s doable, but not
> trivial like just slapping <sub> and </sub> around some text.
>

Not needed. the <sup> and <sup> elements in HTML can be styled directly as
well (also with CSS), with clear implied semantic, without needing the
creation of a custom class in a non-semantic <span> element.
Here the intent in the formula was clearly to designate a subscript
notation (as opposed to a superscript whose meaning in formulas after the
symbol of a variable is generally an exponent. Using superscripts after
other symbols (such as a summation operation) generally designate something
else (an upper bound). After some operators such as "C" it means a cardinal
in a set from which all possible unordered combinations (distinct subsets)
are counted. In cimicla formulas, superscripts and subscripts are used
before or after an element to indicate some physical state (total charge,
charge of the nucleus, total weight, 3D configuration for compound elements
and cristalline forms, orientation, number of occurences for subgroups in
complex compounds...).

In formulas the supercripts and subscripts, are parsed according to the
context after which they occur (which will remap these superscript or
superscripts by assigning them a speficic role), but alone they are just
sub/super-scripts with no other semantics added (but still keeping all the
semantics of their content).

For complex compounds, these subscript/superscripts are not enough and
specific layouts and symbols are needed, but you cannot use simple linear
plain-text to represent them without defining a specific notation
convention and defining annotation terms inserted in the custom formula.
Plain-text encoding will not solve the problem of representation at a
character level: you'll need an upper protocol. There's an infinite way to
define these protocols but they are out of scope of Unicode, which will not
encode them (the same way that it does not encode orthographic conventions
or script conventions for specific languages: the conventions for technical
notations are creating their own language).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/011e6120/attachment.html>

From verdy_p at wanadoo.fr  Fri Sep 30 11:53:43 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 30 Sep 2016 18:53:43 +0200
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
 <CAJ6uix7zrqP-6=D+D5ECOYSdqhFtDgmf2x4YvENRKUj=K5JWoQ@mail.gmail.com>
 <19524b6c-15d8-37e8-78a3-dee1d774c4a0@cs.tut.fi>
Message-ID: <CAGa7JC08cDcAb4_hW0F0xOCc-DPZoGXoLmpW=6h5rX1u+xhE3Q@mail.gmail.com>

2016-09-30 18:31 GMT+02:00 Jukka K. Korpela <jkorpela at cs.tut.fi>:

> 30.9.2016, 19:11, Leonardo Boiko wrote:
>
> The Unicode codepoints are not intended as a place to store
>> typographically variant glyphs (much like the Unicode "italic"
>> characters aren't designed as a way of encoding italic faces).
>>
>
> There is no disagreement on this. What I was pointing at was that when
> using rich text or markup, it is complicated or impossible to have
> typographically correct glyphs used (even when they exist), whereas the use
> of Unicode codepoints for subscript or superscript characters may do that
> in a much simpler way.


If things are simple with the few existing characters encoded in Unicode,
they should also be simple with common markup or notation systems. If not,
blame the authors of these systems for not implementing them correctly.
HTML, TeX or MathML have no problem representing these simple
superscript/subscript notations.

Use them ! including when commenting source code (you'll need these systems
anyway when parsing the source code to generate readable documentation for
your projects). Such doc generating tools are now extremely common and used
in lot of common programming languages. It's high time to invest in them
(most of them are integrated within code quality analysis tools, and
project management tools, they generate progress reports, help tuning the
APIs, help generating or checking test code coverage, help tracking bugs,
coordinating work teams, communicating with final users or recipients of
the software).

Programmers should all know and use some of these tools (which can also
work across multiple programming languages, as modern projects are
frequently using multiple ones, needed for the integration, deployment or
interoperability of systems). Unicode will certainly not favor a specific
system, except for specific standards widely used internationaly (e.g. the
few additions requested for TeX needed for technical reasons, such as
specific distinctions of symbols).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/3c4ddcf3/attachment.html>

From jkorpela at cs.tut.fi  Fri Sep 30 11:54:39 2016
From: jkorpela at cs.tut.fi (Jukka K. Korpela)
Date: Fri, 30 Sep 2016 19:54:39 +0300
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <CAGa7JC19pBzfZXA5hXvD-_G2kuADLQ0V+gfjFf7SKi9jgZpQsA@mail.gmail.com>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
 <CAGa7JC19pBzfZXA5hXvD-_G2kuADLQ0V+gfjFf7SKi9jgZpQsA@mail.gmail.com>
Message-ID: <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi>

30.9.2016, 19:36, Philippe Verdy wrote:

> 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela <jkorpela at cs.tut.fi
> <mailto:jkorpela at cs.tut.fi>>:
>
>     Using HTML, for example, the way to achieve that at present would be
>     to use markup like <span class="sub">...</span> (to avoid the
>     problems caused by the default formatting of <sub> and <sup>) and to
>     use a CSS style sheet that sets font-family suitably and uses
>     OpenType font feature settings to select subscript or superscript
>     glyphs. In practice, you would need to use @font-face to embed a
>     suitable OpenType font. So it?s doable, but not trivial like just
>     slapping <sub> and </sub> around some text.
>
>
> Not needed. the <sup> and <sup> elements in HTML can be styled directly
> as well (also with CSS)

I didn?t want to go into details, but probably I now need to mention 
that some browsers, rather unpleasantly, interpret relative font sizes 
for <sup> and <sub> as relating to their default font size in that 
browser, against CSS specs. This is frustrating enough to ignore the 
?semantics? and use <span> instead. The semantics was never clear, 
actually; the descriptions and examples contain both essential 
superscripting (e.g. mathematical exponents) and stylistic 
superscripting (e.g. rendering ?1st? with the letters as superscripts).

> For complex compounds, these subscript/superscripts are not enough and
> specific layouts and symbols are needed

Certainly. Thinking of a mathematical expression with a superscript that 
has a superscript should be enough to demonstrate this.

My point, however, has been that there are many situations, in general 
texts and even in some specialized texts, where Unicode code points for 
superscripts and subscripts are very useful. It is therefore natural to 
ask why they are such incomplete sets; but I think this question has 
been answered in this discussion.

Yucca


From verdy_p at wanadoo.fr  Fri Sep 30 12:13:24 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 30 Sep 2016 19:13:24 +0200
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <CAGa7JC2JwO4OKa6tSTc1Nwy4TvXP4+_-stb7HFR4=sCCH9CqKg@mail.gmail.com>
 <65dc0e3c-011d-dba4-6126-5a7ff9596fd2@cs.tut.fi>
 <CAGa7JC19pBzfZXA5hXvD-_G2kuADLQ0V+gfjFf7SKi9jgZpQsA@mail.gmail.com>
 <84551e20-7608-c32c-472a-ce0631827802@cs.tut.fi>
Message-ID: <CAGa7JC2KRmztK97Gja-w7km3G3SeXmo+1MW0PSc0OnB-1dMc=A@mail.gmail.com>

2016-09-30 18:54 GMT+02:00 Jukka K. Korpela <jkorpela at cs.tut.fi>:

> 30.9.2016, 19:36, Philippe Verdy wrote:
>
> 2016-09-30 17:54 GMT+02:00 Jukka K. Korpela <jkorpela at cs.tut.fi
>> <mailto:jkorpela at cs.tut.fi>>:
>>
>>     Using HTML, for example, the way to achieve that at present would be
>>     to use markup like <span class="sub">...</span> (to avoid the
>>     problems caused by the default formatting of <sub> and <sup>) and to
>>     use a CSS style sheet that sets font-family suitably and uses
>>     OpenType font feature settings to select subscript or superscript
>>     glyphs. In practice, you would need to use @font-face to embed a
>>     suitable OpenType font. So it?s doable, but not trivial like just
>>     slapping <sub> and </sub> around some text.
>>
>>
>> Not needed. the <sup> and <sup> elements in HTML can be styled directly
>> as well (also with CSS)
>>
>
> I didn?t want to go into details, but probably I now need to mention that
> some browsers, rather unpleasantly, interpret relative font sizes for <sup>
> and <sub> as relating to their default font size in that browser, against
> CSS specs.


Bug the authors of these browsers. But most probably those browsers are
antique and no longer supported. So bug users of these old tools and ask
them to switch. I've not seen any decent modern browser not correctly
respecting the CSS styles you set for the relative size and positioning for
superscripts/supbscripts. It is very easy to do with basic CSS stylesheet
for your document (or website/application).

There's not a lot of modern browsers. The antique browsers no longer
supported are those you find in embedded systems (in their limited
firmware), but they are not the best systems to use to read a technical
documentation, and generally not used by programmers; they all have a
decent PC with a decent browser, or TeX tools, or decent word processors.

These bugs will be solved sooner or later IF there are people requesting
their resolution and a demonstrated usage of these tools (or fonts,
rendering libraries...), or if they pay developers for these needed
corrections. Unicode encodes characters for the long term, but not because
some current tools may have some rendering bugs (that are already resolved
is similar tools). Most of these tools already have several alternatives
some ill disappear new one will be created offering better support.

The only thing that cannot be corrected are historic documents used as
sources and for which there's a need to find an appropriate representation
: if this can be done at character level, may be they will be encoded,
provided that there's evidence thhat they require distinction.

But in your case there's no distinction: the "start" and "end" words in the
formulas are regular English words that should better encoded using normal
Latin letters (the extra layout needed for the formula is not encodable);
using some encoded superscript/subscripts to write them is really a hack,
don't expect them to have a coherent layout or style matching what you
expect in your formulas, they were mostly encoded for compatibility with
old encoding standards (because old archived documents won't be
reencoded/recreated for use with newer tools).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/b8f4233b/attachment.html>

From KalvesmakiJ at doaks.org  Fri Sep 30 13:00:08 2016
From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel)
Date: Fri, 30 Sep 2016 18:00:08 +0000
Subject: Why incomplete subscript/superscript alphabet ?
Message-ID: <D8491D0A-69F2-4052-B397-C506CF5766DC@doaks.org>

Newly proposed OpenType Variable Fonts may go a long way to rectifying those typographic pitfalls. The technology is some ways off, but is promising, as explained in a recent blog post by John Hudson:
https://medium.com/@tiro/https-medium-com-tiro-introducing-opentype-variable-fonts-12ba6cd2369#.ucum1whtl

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435


On 9/30/16, 11:54 AM, "Unicode on behalf of Jukka K. Korpela" <unicode-bounces at unicode.org on behalf of jkorpela at cs.tut.fi> wrote:

    there?s a serious typographic pitfall with this


From everson at evertype.com  Fri Sep 30 13:26:17 2016
From: everson at evertype.com (Michael Everson)
Date: Fri, 30 Sep 2016 11:26:17 -0700
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi>
Message-ID: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com>

On 30 Sep 2016, at 08:07, Jukka K. Korpela <jkorpela at cs.tut.fi> wrote:

> Apart from specialized cases, the recommended approach is to use higher protocols (such as formatting or markup). So instead of trying to find superscript letters for ?end?, you should consider using rich text or a markup language so that the word written with normal letters ?end? is formatted or marked up as a superscript.

Even I don?t because I want stuff to be preserved in plain text. 

Michael

From steve at swales.us  Fri Sep 30 13:36:32 2016
From: steve at swales.us (Steve Swales)
Date: Fri, 30 Sep 2016 11:36:32 -0700
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi>
 <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com>
Message-ID: <E1008004-00E7-46BF-BA66-D1946124B757@swales.us>

I?m with Michael on this. The obvious use case is text messaging, which has no higher protocols to leverage.

-steve

> On Sep 30, 2016, at 11:26 AM, Michael Everson <everson at evertype.com> wrote:
> 
> On 30 Sep 2016, at 08:07, Jukka K. Korpela <jkorpela at cs.tut.fi> wrote:
> 
>> Apart from specialized cases, the recommended approach is to use higher protocols (such as formatting or markup). So instead of trying to find superscript letters for ?end?, you should consider using rich text or a markup language so that the word written with normal letters ?end? is formatted or marked up as a superscript.
> 
> Even I don?t because I want stuff to be preserved in plain text. 
> 
> Michael


From asmusf at ix.netcom.com  Fri Sep 30 15:18:56 2016
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Fri, 30 Sep 2016 13:18:56 -0700
Subject: Why incomplete subscript/superscript alphabet ?
In-Reply-To: <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com>
References: <4bec7eba-d3bb-d6e3-5869-1929e17bc8a4@coanda-deviation.info>
 <563c28fc-7772-59f6-01ae-ab99bcf64a39@cs.tut.fi>
 <99AC47C7-6BAC-4D76-A669-2D7743B00B69@evertype.com>
Message-ID: <328312cd-094c-5f9b-62fd-7803e51173f8@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160930/0a848a03/attachment.html>