From unicode at unicode.org  Sat Aug  4 11:51:54 2018
From: unicode at unicode.org (=?UTF-8?Q?Christoph_P=C3=A4per?= via Unicode)
Date: Sat, 4 Aug 2018 18:51:54 +0200 (CEST)
Subject: Diacritic marks in parentheses
In-Reply-To: <20180727072247.GA1728455@phare.normalesup.org>
References: <104663606.43684.1532601618204@ox.hosteurope.de>
 <1411637396.5643.1532603563943.JavaMail.www@wwinf1m18>
 <CAN49p6pcEMRQScNjakQbEzVd9yZfCugKx9NGtUgvgjwQcZerEw@mail.gmail.com>
 <0b3f7895-1428-0363-4ce5-da82cce00032@ix.netcom.com>
 <CAJ2xs_EkQrNTR9yo7AYgkvToQ7zfkZcsQkNVr9c+nFVYuHjHRg@mail.gmail.com>
 <CAJ2xs_F59uSL675Lo9AYNwuOOZVYbu6K=613SK0VaS3v_+nL4w@mail.gmail.com>
 <20180727072247.GA1728455@phare.normalesup.org>
Message-ID: <2086677586.6791.1533401514672@ox.hosteurope.de>

Arthur Reutenauer:
> On Thu, Jul 26, 2018 at 03:41:47PM -0700, Mark Davis ?? via Unicode wrote:
>
>>       Ein??? A???rzt???? hat eine??? Studenti???n gesehen.
> 
> ?eine??? Student?????? gesehen?.

I certainly would not advocate to go to such extremes. My issue was with putting parentheses at the level they belong, which would, instead, yield something more like

    Ein(e) A???rzt(in) hat eine(n) Student(e/i)n gesehen.

This is not how it would be actually used, though. Those short forms are mostly used outside proper prose, e.g. in diagrams, tables or forms.

Belated thanks to Marcel Schneider for pointing me to the Unicode 7.0 character I had somehow failed to find, U+1ABB COMBINING PARENTHESES ABOVE (not used in the samples above).


From unicode at unicode.org  Thu Aug  9 00:37:21 2018
From: unicode at unicode.org (Shriramana Sharma via Unicode)
Date: Thu, 9 Aug 2018 11:07:21 +0530
Subject: Usage of emoji in coding contexts!
Message-ID: <CAH-HCWWGo-uj+bhLoeqBV=EF9CBpK4S6kap1sZTV6Wnxtou0ZA@mail.gmail.com>

First time I'm seeing this (maybe others have seen this already):

https://github.com/wei/pull

Emoji being used in commit messages for classifying the nature of the
commit ? bug fixes, feature additions etc

Now *that*'s a nice creative usage of emoji IMO?

I see they haven't used them always as the actual emoji characters but
sometimes as :coloned-tags: (or what do you call it) but I presume the
GitHub system will convert it to the actual characters before
displaying?

-- 
Shriramana Sharma ???????????? ???????????? ????????????????????????


From unicode at unicode.org  Thu Aug  9 02:09:57 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Thu, 9 Aug 2018 09:09:57 +0200
Subject: Usage of emoji in coding contexts!
In-Reply-To: <CAH-HCWWGo-uj+bhLoeqBV=EF9CBpK4S6kap1sZTV6Wnxtou0ZA@mail.gmail.com>
References: <CAH-HCWWGo-uj+bhLoeqBV=EF9CBpK4S6kap1sZTV6Wnxtou0ZA@mail.gmail.com>
Message-ID: <CAJ2xs_EsfnwUjWEEpFawiW3PPK1DE-qYRQ5g_9Y7ho3gBXGy-g@mail.gmail.com>

Very amusing. But interesting how it catches your eye when scanning a list.

Mark

On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode <
unicode at unicode.org> wrote:

> First time I'm seeing this (maybe others have seen this already):
>
> https://github.com/wei/pull
>
> Emoji being used in commit messages for classifying the nature of the
> commit ? bug fixes, feature additions etc
>
> Now *that*'s a nice creative usage of emoji IMO?
>
> I see they haven't used them always as the actual emoji characters but
> sometimes as :coloned-tags: (or what do you call it) but I presume the
> GitHub system will convert it to the actual characters before
> displaying?
>
> --
> Shriramana Sharma ???????????? ???????????? ????????????????????????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180809/06e62c10/attachment.html>

From unicode at unicode.org  Thu Aug  9 06:48:59 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 9 Aug 2018 13:48:59 +0200
Subject: Usage of emoji in coding contexts!
In-Reply-To: <CAJ2xs_EsfnwUjWEEpFawiW3PPK1DE-qYRQ5g_9Y7ho3gBXGy-g@mail.gmail.com>
References: <CAH-HCWWGo-uj+bhLoeqBV=EF9CBpK4S6kap1sZTV6Wnxtou0ZA@mail.gmail.com>
 <CAJ2xs_EsfnwUjWEEpFawiW3PPK1DE-qYRQ5g_9Y7ho3gBXGy-g@mail.gmail.com>
Message-ID: <CAGa7JC11+sZm=CxkaT-gfkuKkiG05wXmdQ_ux8Shgvpz8hdpfg@mail.gmail.com>

It's just complicate to select a coherent Emoji for that (in the edit
comment). My opinion is that such icons may be selected from a list as part
of the Github "tagging" system, these icons may then appear automatically
(but as there are multiple candidate tags, each one configured with its own
color, there may as well be multiple emojis).
The problem with the approach is that such leading emoji are difficult to
edit once the GitHub edit is committed. Some of the emojis selected look
very strange, or may be not the best ones (e.g. the pizza slice chosen).
Some edits could not have a suitable emoji selected (e.g. merge commits
should be icons like an Y-shaped arrow with two trails but one leading
arrow: such isonc is already used by GitBug, but not in that description
field).
I bet this icon/emoji should be a separate field. And it could also allow
setting background/foreground color for the text using a convenient palette
(tested also in presence of colored links: not all background/text colors
are suitable, as seen on color option for "Tags").

This is not just for GitHub: you have an equivalent of GitHub tags, with
classification "Labels" in Gmail for example. Emojis start being used too
in Email subject lines (but most often only by spammers trying to defeat
some antispam filters: most often, emojis in email subjects are strong
indicators of spam or very harassing commercial ads! As they have no actual
legal meaning, advertizers tend to use these emojis just to avoid
publishing a statement that would be legally binding to them: these emojis
are almost oalways defective and give false information, they are also too
proeminent, as if the email senders were more important than everything
else than the recipients are really interested in; they are almost always
unnecessarily distractive, and not as important as what senders think).


2018-08-09 9:09 GMT+02:00 Mark Davis ?? via Unicode <unicode at unicode.org>:

> Very amusing. But interesting how it catches your eye when scanning a list.
>
> Mark
>
> On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode <
> unicode at unicode.org> wrote:
>
>> First time I'm seeing this (maybe others have seen this already):
>>
>> https://github.com/wei/pull
>>
>> Emoji being used in commit messages for classifying the nature of the
>> commit ? bug fixes, feature additions etc
>>
>> Now *that*'s a nice creative usage of emoji IMO?
>>
>> I see they haven't used them always as the actual emoji characters but
>> sometimes as :coloned-tags: (or what do you call it) but I presume the
>> GitHub system will convert it to the actual characters before
>> displaying?
>>
>> --
>> Shriramana Sharma ???????????? ???????????? ????????????????????????
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180809/582aec74/attachment.html>

From unicode at unicode.org  Thu Aug  9 02:28:17 2018
From: unicode at unicode.org (George Pollard via Unicode)
Date: Thu, 9 Aug 2018 19:28:17 +1200
Subject: Usage of emoji in coding contexts!
In-Reply-To: <CAJ2xs_EsfnwUjWEEpFawiW3PPK1DE-qYRQ5g_9Y7ho3gBXGy-g@mail.gmail.com>
References: <CAH-HCWWGo-uj+bhLoeqBV=EF9CBpK4S6kap1sZTV6Wnxtou0ZA@mail.gmail.com>
 <CAJ2xs_EsfnwUjWEEpFawiW3PPK1DE-qYRQ5g_9Y7ho3gBXGy-g@mail.gmail.com>
Message-ID: <CADyc23f_A-ATNoTxMGn9iJKSJLOidSM_MPig7XZOp7QdfHQKJg@mail.gmail.com>

I've seen this codified a little in other repositories, e.g. ?? means 'only
a formatting/stylistic change'.


On Thu, 9 Aug 2018 at 19:18 Mark Davis ?? via Unicode <unicode at unicode.org>
wrote:

> Very amusing. But interesting how it catches your eye when scanning a list.
>
> Mark
>
> On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode <
> unicode at unicode.org> wrote:
>
>> First time I'm seeing this (maybe others have seen this already):
>>
>> https://github.com/wei/pull
>>
>> Emoji being used in commit messages for classifying the nature of the
>> commit ? bug fixes, feature additions etc
>>
>> Now *that*'s a nice creative usage of emoji IMO?
>>
>> I see they haven't used them always as the actual emoji characters but
>> sometimes as :coloned-tags: (or what do you call it) but I presume the
>> GitHub system will convert it to the actual characters before
>> displaying?
>>
>> --
>> Shriramana Sharma ???????????? ???????????? ????????????????????????
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180809/55beeca3/attachment.html>

From unicode at unicode.org  Fri Aug 10 15:33:59 2018
From: unicode at unicode.org (Julian Wels via Unicode)
Date: Fri, 10 Aug 2018 22:33:59 +0200
Subject: Thoughts on Emoji Selection Process
Message-ID: <CAJYrn-tDWQgc5=YCP7466pDb1+NCFicOhPe-Dh3VyypxKvtyqw@mail.gmail.com>

Hi there!

In light of the recently featured 179 proposed Emoji Draft Candidates
<http://unicode.org/emoji/future/emoji-candidates.html> for Emoji 12.0, I'd
like to ask if the selection factors for future emojis shouldn't be more
restrictive or rather just enforced more strongly.

Extreme Specificity
For instance, one thing that struck me as odd in previous releases was the
tendency to extreme specificity. I always thought of Emoji as symbols and
not as concrete images.

In a lot of ways Emoji already do that. Every Emoji in the
"Smileys"-category represents an emotion that can be used to enrich the
meaning of text messages, and that's perfect! Then we have a lot of Objects
such as:

   - Hamburger: Represents fast food.
   - Apple: Represents (healthy) food.
   - Bomb: Represents threat.
   - Wheelchair (12.0): Represents physical disabilities.

And those are all great objects because also function as symbols!
But there are far more, very specific or redundant objects (just from 12.0
proposal alone):

   - Guide Dod: Represents specific physical disability.
   - Service Dog: Represents specific physical disability.
   - Motorized wheelchair: Represents specific physical disability.
   - Mechanical arm: Represents specific physical disability.
   - Mechanical leg: Represents specific physical disability.
   - Ear with a hearing aid: Represents specific physical disability.


For one, I think that it should not be the job of Emojis to express as many
words as possible. And - although it seems a bit counterintuitively -
having all those symbols for the sake of more inclusivity is extremely
exclusionary! What about the hundreds of other disabilities that are not
listed here?

In case you get the impression this is a problem with disabilities or the
12.0 proposal, let me show you what I like to call the Emoji
"Family-Problem".
First of all: I don't think that a man and women should represent all
couples, and I don't think a family should be represented by a man, woman,
and their children. But I also don't believe that we should try to include
every possible variant that comes to mind, because as stated before, this
will lead to more exclusion through specificity.
Currently, we have (among others): 1 father with 2 sons, 2 fathers with 2
sons, 2 fathers with 1 son, 2 fathers with 1 son and 1 daughter, 2 fathers
with 1 daughter, and so on. But for instance, single moms or dads with one
child are missing, which by the way are in some places a very neglected
part of society.
And this is exactly my point: There are so many representations that every
missing one is basically an insult.

Sidenote: I think the solution for the "Family-Problem" should be M-M, F-M
and F-F combinations to represent couples, then the same again plus a girl
and a boy to represent families. Next add a man-, woman-, boy- and
girl-emoji separately, and people can represent their families without any
restriction whatsoever if they want. Plus they can express their skin
colors, and even pets can be added! Because, as with every written
language, symbols (or words) can be linked together and create a new
meaning!

Cultural Iconography
Another thing that is is worrisome is the proposed addition of a
traditional Indian piece of clothing in 12.0. This is extremely specific to
one culture, and I'm not sure if we want to open the gate for: "Which
culture is included in Unicode and which is not?". Maybe we want that!
Maybe we don't. But I think there should at least be a discussion about
additions that carry such consequences.
I know that there are tons of Chinese symbols in there already, but even
the selection-factors on the Unicode-website state, that just because there
is a lot of stuff in there from former versions, should not be a basis of
justification for future additions. For instance, the Tokyo Tower-Emoji
does not justify the Eiffel Tower-Emoji. [link]
<https://unicode.org/emoji/proposals.html#Faulty_Comparison>

Emotions
And my last point, maybe even the most important one: There are currently
63 candidates for Emoji 12.0 and only one (ONE!) is an actual smiley. And I
think this category is the most important (and also the most used by far).
Because people use symbols of emotions to add meaning that cannot be easily
expressed with words to their text messages.
I loved the addition of "Face With Raised Eyebrow," "Exploding Head" and
"Face With Monocle" in Emoji 10.0 because they add value to texting!

Conclusion
So what do I mean when I say "future Emoji selection should be more
restrictive"?
1) There should be a large push on actual smileys.
2) The "Selection Factors for Exclusion" should be taken a lot more
seriously, especially for overly specific submissions. They are pretty
comprehensive but apparently just poorly enforced.
3) Very specific submissions should encourage the addition of broader
symbols that would still include the initial submission.
4) Additions that through their mere existence would exclude Symbols that
are not (currently) present should be discussed (E.g. cultural iconography)
5) Additions that are made for the sake of inclusion, which of course is
generally a good thing, should especially be checked against the four
statements above because the mindless addition of inclusive emojis can lead
to exclusivity.

Final thoughts
I really love emoji, and I think it's wonderful that everyone at Unicode
strives to make it more inclusive and progressive. But for me, it feels
like we have symbolism for the sake of symbolism. One being Emoji as a
symbol of inclusion and progress in the world, and the other one being that
Emoji still have an actual symbolic meaning.

Because to hear people say "It's so nice to finally see the introduction of
'Person in Steamy Room'" and then observe how they don't use it can't be a
good direction for future Emoji releases.

Julian ??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180810/95ae2abe/attachment.html>

From unicode at unicode.org  Fri Aug 10 18:14:00 2018
From: unicode at unicode.org (Charlotte Buff via Unicode)
Date: Sat, 11 Aug 2018 01:14:00 +0200
Subject: Thoughts on Emoji Selection Process
Message-ID: <CAKLR3ArvxCtUXvN8AOb7xjM8_0Ww-EZMf0H97LKbSC4MZGge0A@mail.gmail.com>

 > Extreme Specificity
> For instance, one thing that struck me as odd in previous releases was
the
> tendency to extreme specificity. I always thought of Emoji as symbols and
> not as concrete images.

Unicode is choke-full of useless, redundant emoji that nobody ever types,
all of which were originally justified because of ?high usage
expectations?. We?re now stuck with such timeless emoji classics as Water
Polo, Raised Back of Hand, Place of Worship, Petri Dish, and Mother
Christmas while actual, substantial problems in the emoji standard have
remained unaddressed for several years, because the ESC is absolutely
bloody clueless about literally everything they do.

The trouble is not that the rules for emoji submissions are fundamentally
flawed; the trouble is that said rules are completely ignored whenever the
ESC feels like it. A squirrel is too similar to a chipmunk, but a softball
must be disunified from a baseball. A donkey is too similar to a horse, but
we really needed that lab coat emoji because the regular coat that was
added just one year prior just doesn?t cut it. There is no system, and I
highly suspect that there never was one in the first place.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180811/a022a503/attachment.html>

From unicode at unicode.org  Fri Aug 10 20:25:46 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 10 Aug 2018 17:25:46 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAKLR3ArvxCtUXvN8AOb7xjM8_0Ww-EZMf0H97LKbSC4MZGge0A@mail.gmail.com>
References: <CAKLR3ArvxCtUXvN8AOb7xjM8_0Ww-EZMf0H97LKbSC4MZGge0A@mail.gmail.com>
Message-ID: <CABPY6Z2nr_opC5GZOYrzOJeDsJWUh9W4ZSKGHoLgXovQeoqGZw@mail.gmail.com>

Charlotte Buff wrote,

> A squirrel is too similar to a chipmunk, but a
> softball must be disunified from a baseball.

Let's not be too harsh on the ESC.  The set of in-line pictures which
some might use to adorn text is open-ended.  The ESC has to deal with
the really tough questions every day.  Like, is there a semantic
difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or
BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE
vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so
forth.  I wouldn't be able to make such difficult decisions without
flipping a coin, so I'd doff my hat to the ESC if I wore one.

From unicode at unicode.org  Fri Aug 10 22:12:31 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Fri, 10 Aug 2018 19:12:31 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CABPY6Z2nr_opC5GZOYrzOJeDsJWUh9W4ZSKGHoLgXovQeoqGZw@mail.gmail.com>
References: <CAKLR3ArvxCtUXvN8AOb7xjM8_0Ww-EZMf0H97LKbSC4MZGge0A@mail.gmail.com>
 <CABPY6Z2nr_opC5GZOYrzOJeDsJWUh9W4ZSKGHoLgXovQeoqGZw@mail.gmail.com>
Message-ID: <CABPY6Z3f6Tj1g7wgLWhDYQ1V60v49kNGx8R6HbRJ=SOyvTYJSQ@mail.gmail.com>

AUSTRALOPITHECINE, it was the all-caps that threw me.

From unicode at unicode.org  Sat Aug 11 06:58:02 2018
From: unicode at unicode.org (Charlotte Buff via Unicode)
Date: Sat, 11 Aug 2018 13:58:02 +0200
Subject: Thoughts on Emoji Selection Process
Message-ID: <CAKLR3ApD2hizvxqTbADUVgkWcvO6LJjVCbH9jrhqKUYZtjO0EA@mail.gmail.com>

[James Kass wrote:]
> Let's not be too harsh on the ESC. The set of in-line pictures which
> some might use to adorn text is open-ended. The ESC has to deal with
> the really tough questions every day. Like, is there a semantic
> difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or
> BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE
> vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so
> forth. I wouldn't be able to make such difficult decisions without
> flipping a coin, so I'd doff my hat to the ESC if I wore one.

There is no semantic difference between a softball and a baseball. They are
literally the same object, just in slightly different sizes. There isn?t a
semantic difference between a squirrel and a chipmunk either (mainly
because they don?t represent anything beyond their own identities just like
the majority of modern emoji inventions), but at the very least they are
*different things*.

Not to mention that the softball was added ? by the ESC?s very own
admission ? for the sole and only purpose of ?improving gender
representation?, and anyone who has heard of my name in the context of
Unicode before can tell you what a massive hypocrisy that is. As I said,
there is no system. The ESC only approves emoji submissions if they
personally like them, or to make themselves look vaguely more progressive
and open-minded than they really are, but not *too* open-minded, you see,
because then we would have to put actual, proper thought into the issues
we?re dealing with.

Mark Davis hates me already for rightfully calling out his many
shortcomings, so I might as well say it like it is and alienate the rest of
the ESC as well. I have no doubt that many ESC members are competent enough
for their job; the point is that, collectively, the ESC is not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180811/1aa232c2/attachment.html>

From unicode at unicode.org  Sat Aug 11 07:21:13 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Sat, 11 Aug 2018 13:21:13 +0100 (BST)
Subject: Thoughts on Emoji Selection Process
References: <CAKLR3ApD2hizvxqTbADUVgkWcvO6LJjVCbH9jrhqKUYZtjO0EA@mail.gmail.com>
Message-ID: <slrnpmtl5p.umg.jcb@home.stevens-bradfield.com>

On 2018-08-11, Charlotte Buff via Unicode <unicode at unicode.org> wrote:
> There is no semantic difference between a softball and a baseball. They are
> literally the same object, just in slightly different sizes. There isn?t a
> semantic difference between a squirrel and a chipmunk either (mainly
> because they don?t represent anything beyond their own identities just like
> the majority of modern emoji inventions), but at the very least they are
> *different things*.

I think you don't understand the meaning of "semantic", "literally",
or "the same". Which is a pity, because I'm all in sympathy with your
general attitude to emoji and Unicode.

I'm not just being pedantic - I can't even work out what you're
attempting to say in this paragraph.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Sat Aug 11 08:56:37 2018
From: unicode at unicode.org (Julian Wels via Unicode)
Date: Sat, 11 Aug 2018 15:56:37 +0200
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CABPY6Z2nr_opC5GZOYrzOJeDsJWUh9W4ZSKGHoLgXovQeoqGZw@mail.gmail.com>
References: <CAKLR3ArvxCtUXvN8AOb7xjM8_0Ww-EZMf0H97LKbSC4MZGge0A@mail.gmail.com>
 <CABPY6Z2nr_opC5GZOYrzOJeDsJWUh9W4ZSKGHoLgXovQeoqGZw@mail.gmail.com>
Message-ID: <CAJYrn-sG8-fjWOq6AmCaS7x+Gb1+7pGbX+EMMRmdPg66RJ9AXg@mail.gmail.com>

James Kass wrote,
> Like, is there a semantic
> difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or
> BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE
> vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so
> forth.

Yeah they should deal with those questions, but right now I imagine they
would just add all of those. Just that the Australopithecine would have all
gender, hair-style and ball-holding modifiers.
And this is polluting a system where things that once were added can't be
removed.

It's all so contradictory, here
<http://www.unicode.org/faq/emoji_dingbats.html#14> they are encouraging
the use of sticker packs as a long-term solution, here
<http://www.unicode.org/faq/emoji_dingbats.html#4.2> they say they want to
add 60 new emoji per year.
Also - as I just recently discovered - Hair Components
<http://www.unicode.org/reports/tr51/#hair_components> were added in 11.0,
which will just lead to an absurd amount of complexity. And what is
achieved with that? Which gap does this fill? Who will use such specific
Emojis frequently (B. Expected usage level
<https://unicode.org/emoji/proposals.html#Selection_Factors_Usage>)? And if
this is not F. Overly specific
<https://unicode.org/emoji/proposals.html#Specific> or  close to an L. Exact
Image <https://unicode.org/emoji/proposals.html#Exact_Images> then I don't
know what is!

I'm not really sure we can cut them any slack here...

Julian ???

On Sat, Aug 11, 2018 at 3:32 AM James Kass via Unicode <unicode at unicode.org>
wrote:

> Charlotte Buff wrote,
>
> > A squirrel is too similar to a chipmunk, but a
> > softball must be disunified from a baseball.
>
> Let's not be too harsh on the ESC.  The set of in-line pictures which
> some might use to adorn text is open-ended.  The ESC has to deal with
> the really tough questions every day.  Like, is there a semantic
> difference between PICTURE OF SQUIRREL vs. PICTURE OF CHIPMUNK or
> BASEBALL vs. SOFTBALL or AUSTALOPITHECINE RIDING THREE-LEGGED HORSE
> vs. AUSTALOPITHECINE WITH MOHAWK RIDING THREE-LEGGED HORSE, and so
> forth.  I wouldn't be able to make such difficult decisions without
> flipping a coin, so I'd doff my hat to the ESC if I wore one.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180811/19948ae8/attachment.html>

From unicode at unicode.org  Sat Aug 11 12:36:34 2018
From: unicode at unicode.org (Charlotte Buff via Unicode)
Date: Sat, 11 Aug 2018 19:36:34 +0200
Subject: Thoughts on Emoji Selection Process
Message-ID: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>

> I think you don't understand the meaning of "semantic", "literally",
> or "the same". Which is a pity, because I'm all in sympathy with your
> general attitude to emoji and Unicode.

> I'm not just being pedantic - I can't even work out what you're
> attempting to say in this paragraph.
A softball is just a slightly bigger baseball. There is no other difference
between them. We now have two emoji that mean exactly the same thing: A
small ball made from cork or rubber, wrapped in leather with a stitched
seam, that is hit with a bat and caught with a glove. And if they had some
metaphorical meaning (which I don?t think they do), they would also both
represent the same concepts, because they are simply minute variations of
the same object.

Chipmunks and squirrels are clearly different species, but pretty much all
characteristics they have that would be relevant to the average emoji user
are identical. They are small, furry rodents living in forests that eat and
bury nuts. Any meaning you could assign to a pictograph of a chipmunk in a
textual conversation is also shared with squirrels and vice versa.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180811/c12f9fad/attachment.html>

From unicode at unicode.org  Sat Aug 11 16:58:36 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 11 Aug 2018 13:58:36 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
Message-ID: <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>

Charlotte Buff wrote,

? Mark Davis hates me already for rightfully calling
? out his many shortcomings, so I might as well say it
? like it is and alienate the rest of the ESC as well.

Nobody's perfect.  We all have our strengths and weaknesses; it's part
of the human condition.  Although alienating people can bring
considerable short-term satisfaction, in the long run building bridges
trumps building walls.

Conventional character encoding concerns may well be of secondary
importance with respect to emoji.  The driving force may have more to
do with sales and marketing.  In this regard, emoji are "special".
Hence, if we approach emoji encoding issues in the traditional manner,
ESC decisions might appear baffling or unreasonable.  But if we
broaden our horizons and allow that sales and marketing concerns are a
factor, we might gain a little clarity and a better understanding.

Just sayin'.  ?


From unicode at unicode.org  Sat Aug 11 20:45:03 2018
From: unicode at unicode.org (Julian Wels via Unicode)
Date: Sun, 12 Aug 2018 03:45:03 +0200
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
Message-ID: <CAJYrn-t4979YUKrURJEJEQeca0m-cKfsfpLAsZ_3+wXUf9-cZw@mail.gmail.com>

I followed up on the name Charlotte Buff in association with Unicode and
found many documents already describing what I said in my original mail.
Multiple times in the document registry, together with tons of other
helpful suggestions on how to make Emoji better. However, none of these
suggestions have apparently ever been taken seriously enough to cause
changes. So can understand Charlotte Buffs anger for the most part.

James Kass wrote:
> The driving force may have more to
> do with sales and marketing.  In this regard, emoji are "special".
> Hence, if we approach emoji encoding issues in the traditional manner,
>ESC decisions might appear baffling or unreasonable.  But if we
> broaden our horizons and allow that sales and marketing concerns are a
> factor, we might gain a little clarity and a better understanding.

I'm not approaching Emoji in the same manner as other character-sets in
Unicode, but they are still part of this industry-wide encoding standard
that should not be misused for marketing gags and still be handled like a
standard with certain norms and criteria.

Also, there was so much useless stuff added to the Emoji-Set, that just
cannot be explained by "sales and marketing" alone. Charlotte Buff, for
example, made an excellent case against the addition of colored squares and
circles <http://Coloured Squares and Circles> in 12.0. There also was a
suggestion on how to do gender right <http://Charlotte Buff> in emoji,
which I think would have been an easy and smart solution without any
compromise regarding marketing.

I really wonder if no one in the Emoji Subcommittee has these exact
thoughts, because this is not just about correct representation, it's about
maintaining an encoding standard in a more or less future-proof way! So
maybe Emoji encoding should be approached more traditionally given where we
are right now.

And I ask you all honestly: Is there no solution in sight, other than being
ignored when submitting to the document registry?

Julian ??

On Sun, Aug 12, 2018 at 12:06 AM James Kass via Unicode <unicode at unicode.org>
wrote:

> Charlotte Buff wrote,
>
> ? Mark Davis hates me already for rightfully calling
> ? out his many shortcomings, so I might as well say it
> ? like it is and alienate the rest of the ESC as well.
>
> Nobody's perfect.  We all have our strengths and weaknesses; it's part
> of the human condition.  Although alienating people can bring
> considerable short-term satisfaction, in the long run building bridges
> trumps building walls.
>
> Conventional character encoding concerns may well be of secondary
> importance with respect to emoji.  The driving force may have more to
> do with sales and marketing.  In this regard, emoji are "special".
> Hence, if we approach emoji encoding issues in the traditional manner,
> ESC decisions might appear baffling or unreasonable.  But if we
> broaden our horizons and allow that sales and marketing concerns are a
> factor, we might gain a little clarity and a better understanding.
>
> Just sayin'.  ?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180812/25799322/attachment.html>

From unicode at unicode.org  Sun Aug 12 02:27:46 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 11 Aug 2018 23:27:46 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAJYrn-t4979YUKrURJEJEQeca0m-cKfsfpLAsZ_3+wXUf9-cZw@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
 <CAJYrn-t4979YUKrURJEJEQeca0m-cKfsfpLAsZ_3+wXUf9-cZw@mail.gmail.com>
Message-ID: <CABPY6Z2eEjTNj0eobjThPQj=waV1PPD53fmS3Off+XCqBcYLLg@mail.gmail.com>

Julian Wels wrote,

> Also, there was so much useless stuff added to the
> Emoji-Set, that just cannot be explained by "sales
> and marketing" alone. Charlotte Buff, for example,
> made an excellent case against the addition of
> colored squares and circles in 12.0.

Sales & Marketing can explain *anything*.  That's what they *do*.
Some marketing hotshot comes up with a bunch of cool ideas and they
try some of them out.  If any of them catch on like wildfire, swell!
But if one of them fails, it's not because it was a bum idea in the
first place, it's due to market trends.

Taking a couple of Charlotte Buff's generic concerns from earlier in
this thread while keeping Sales & Marketing in mind,

>> The ESC only approves emoji submissions if they
>> personally like them, ...

Naturally.  How marketable is something one doesn't like?

>> ... or to make themselves look vaguely more
>> progressive and open-minded than they really are ...

Of course.  In the advertising world, image is everything.

> ... So maybe Emoji encoding should be approached more
> traditionally given where we are right now.

Unicode's traditional approach has been to encode what is or what was
rather than what might be.  But the emoji are an evolving set, so they
don't fall within that tradition.  Any requests for clarification of
the evolving set of encoding practices at any stage of the evolution
seem like reasonable requests.  It's unfortunate if such requests go
unanswered.

> And I ask you all honestly: Is there no solution
> in sight, other than being ignored when submitting
> to the document registry?

Committees are somewhat political in nature.  There are two proven
ways to curry favor with a politician.  One is to become a lobbyist,
which means finding out what the subject wants and providing it.
Everybody likes cash, but we don't call it a "bribe", we call it a
"consulting fee".  The other way is to become a toady.

Since neither of those vocations seems suitable for any of us
participating in this thread, perhaps it's time to mend some fences
and/or build some bridges.

From unicode at unicode.org  Sun Aug 12 06:30:29 2018
From: unicode at unicode.org (Charlotte Buff via Unicode)
Date: Sun, 12 Aug 2018 13:30:29 +0200
Subject: Thoughts on Emoji Selection Process
Message-ID: <CAKLR3Ar8f2+W-xiACuKs1v_4nKaGfJfOSHbq0DF=GJ1hgejCNw@mail.gmail.com>

[James Kass wrote:]
> Naturally. How marketable is something one doesn't like?

That is the issue. You are supposed to think that the emoji submission
process is bureaucratic in nature, when in reality it all hinges on the
personal preferences of a handful of unaccountable, largely unknown people.
Everything you hear about emoji proposals, from the UTC?s own instructions
and guidelines on the Unicode homepage to lazy clickbait articles written
by shady (and sadly increasingly also not-so-shady) news outlets, is meant
to make you believe that ?everyone?s voice is equal? and that all you need
to do to get something you care about implemented in emoji is to write a
document addressing a few enumerated issues and mail it to the Consortium,
but that is not how it works. The UTC will gladly ignore heaps and heaps of
evidence and statistics and whatnot if they feel indifferent towards a
proposed emoji, while simultaneously fast-tracking their own pet ideas into
the standard without any sort of documentation *cough* Ice Cube *cough*.

If you?re gonna be evil, at least have the guts to be open about it. Nobody
is forcing you to pretend that there are official procedures still in place.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180812/96f195f7/attachment.html>

From unicode at unicode.org  Sun Aug 12 06:51:43 2018
From: unicode at unicode.org (Charlotte Buff via Unicode)
Date: Sun, 12 Aug 2018 13:51:43 +0200
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
Message-ID: <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>

[James Kass wrote:]
> Nobody's perfect.  We all have our strengths and weaknesses; it's part
> of the human condition.  Although alienating people can bring
> considerable short-term satisfaction, in the long run building bridges
> trumps building walls.

I would be inclined to agree with you, if it weren?t for the fact that I
have been dealing with the ESC for two years now. I used to be nice and
diplomatic, back when I was still convinced that these people were
genuinely interested in developing a decent product. Back when I still
thought that they were actually trying to do good, but just didn?t quite
know how.

Do you want to know what ?building bridges? achieved? Bloody nothing. They
ignored literally every single word I had written and marched onward
regardless.

I am sick of sugarcoating their flaws. They mess up again and again and
again, and they refuse to mend or even acknowledge their mistakes. If they
can?t deal with criticism straight to their faces then they shouldn?t be in
these positions.

People like Andrew West, Michael Everson, Cristoph P?per, Eduardo Mar?n
Silva, and even myself sacrifice their time to develop and document
detailled solutions to many problems the ESC has created, but they simply
don?t care. They are too busy churning out these stupid pictographs year
after year because that?s what gives them publicity. Who cares that 80% of
the emoji standard is horribly broken? What could the Emoji Subcommittee
possibly do about that?

2018-08-11 23:58 GMT+02:00 James Kass <jameskasskrv at gmail.com>:

> Charlotte Buff wrote,
>
> ? Mark Davis hates me already for rightfully calling
> ? out his many shortcomings, so I might as well say it
> ? like it is and alienate the rest of the ESC as well.
>
> Nobody's perfect.  We all have our strengths and weaknesses; it's part
> of the human condition.  Although alienating people can bring
> considerable short-term satisfaction, in the long run building bridges
> trumps building walls.
>
> Conventional character encoding concerns may well be of secondary
> importance with respect to emoji.  The driving force may have more to
> do with sales and marketing.  In this regard, emoji are "special".
> Hence, if we approach emoji encoding issues in the traditional manner,
> ESC decisions might appear baffling or unreasonable.  But if we
> broaden our horizons and allow that sales and marketing concerns are a
> factor, we might gain a little clarity and a better understanding.
>
> Just sayin'.  ?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180812/cb47872f/attachment.html>

From unicode at unicode.org  Sun Aug 12 17:03:27 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 12 Aug 2018 14:03:27 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
 <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>
Message-ID: <CABPY6Z3MJ5SPeh9Lf7V6Pc-CXV4ge4gXEkOsSEnq83qQ_FyV0g@mail.gmail.com>

Charlotte Buff wrote,

> Do you want to know what ?building bridges?
> achieved? Bloody nothing. They ignored literally
> every single word I had written and marched
> onward regardless.

So much for rhetoric, eh?

Sorry if I've underestimated the scope of the dilemma.

It's best to understand both sides of an issue.  When one faction
posts criticisms and questions, and the other side fails to respond,
it leaves everyone with a one-sided viewpoint.

As speculation, the ESC members probably have other responsibilities
besides grinding out pictographs.  (Day jobs, real world, etc.)  Emoji
popularity combined with click-bait news articles urging every yob in
town to submit documents suggests that the ESC may simply be
overwhelmed with such documents, some of which were probably written
in crayon.

As the emoji character set evolves, so do procedures.  It's possible
that people are simply scrambling around trying to do too much at
once.

I am the most unlikely apologist for the ESC imaginable, I'm just
trying to be fair.

Alienating the very people who are the only ones competent to respond
to your questions and concerns won't get questions answered or
concerns addressed.

Hoping someone with the answers will respond to this thread, but not
holding my breath while waiting.  In this particular case, a lack of
response might be more informative than an actual one.


From unicode at unicode.org  Mon Aug 13 06:39:50 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 13 Aug 2018 03:39:50 -0800
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
 <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>
Message-ID: <CABPY6Z1_YmKF=8sD1MD7X+9ndvPqPsnuzv-J1_sMk7hC59Uo1A@mail.gmail.com>

Charlotte Buff wrote,

> ... I have been dealing with the ESC for two
> years now.

Two years passes in the blink of an eye.  Elsewhere you mention several
names including Andrew West and Michael Everson.  Both of them have been
working with, against, or around various committees and members for about
two decades now.  Infinite patience is essential, if one doesn't have it,
it has to be feigned.

> I used to be nice ...

That may have been a tactical error.  This is the 21st century and one has
to be rude just to get noticed.  Besides, once people find out you are a
nice person, they have a tendency to step all over you.

> ... Back when I still thought that they were
> actually trying to do good, but just didn?t
> quite know how.

Most people don't perceive themselves as villains.

> I am sick of sugarcoating their flaws.

They probably didn't like it anyway.  Sugarcoating flaws calls attention to
them and attracts flies.  It's been said that a friend is someone who likes
us in spite of our many faults.  The friendly thing to do would be to
overlook flaws, focus on strengths, and find some kind of common ground.
(If any.)

> If they can?t deal with criticism straight to their
> faces then they shouldn?t be in these positions.

Agreed, as long as it's constructive criticism, tolerably polite, offering
viable alternatives/solutions, and provides for them to "save face"
(because where image is everything, looking good is considered important).

Some people deal with criticism by shunning the critic, preventing any
recurrence.  Some Sales & Marketing people can be *so* hypersensitive.

> Who cares that 80% of the emoji standard is
> horribly broken? What could the Emoji Subcommittee
> possibly do about that?

Well, they could break the other 20%.  Heh heh.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180813/3181d51a/attachment.html>

From unicode at unicode.org  Mon Aug 13 02:44:02 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 13 Aug 2018 08:44:02 +0100 (BST)
Subject: (offline humour) Re: Thoughts on Emoji Selection Process
In-Reply-To: <CABPY6Z3MJ5SPeh9Lf7V6Pc-CXV4ge4gXEkOsSEnq83qQ_FyV0g@mail.gmail.com>
References: <CAKLR3ArLe2vWEXDn2gnVpN6UKP_9QCDZXpVmioGBjjNm0C+f+g@mail.gmail.com>
 <CABPY6Z0SnXsOFD0ex4NYhCMNpBiF3TSHRpiZgUY9vmrnv1+5JA@mail.gmail.com>
 <CAKLR3Artim7vjE9SWtSPihneYYH3__bG0TyH0JStkSmK2a5L2g@mail.gmail.com>
 <CABPY6Z3MJ5SPeh9Lf7V6Pc-CXV4ge4gXEkOsSEnq83qQ_FyV0g@mail.gmail.com>
Message-ID: <8616567.3155.1534146242186.JavaMail.defaultUser@defaultHost>

James Kass wrote:

> ... the ESC may simply be overwhelmed with such documents, some of which were probably written in crayon.

Maybe a document written in crayon is so that the author can wax lyrical: is it fair to chalk the author off? :-)

https://en.oxforddictionaries.com/definition/crayon

William Overington

Monday 13 August 2018


From unicode at unicode.org  Tue Aug 14 06:16:35 2018
From: unicode at unicode.org (Andre Schappo via Unicode)
Date: Tue, 14 Aug 2018 11:16:35 +0000
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <CAJYrn-tDWQgc5=YCP7466pDb1+NCFicOhPe-Dh3VyypxKvtyqw@mail.gmail.com>
References: <CAJYrn-tDWQgc5=YCP7466pDb1+NCFicOhPe-Dh3VyypxKvtyqw@mail.gmail.com>
Message-ID: <B810D555-74BB-4FB1-8BC3-C4D7295453C5@lboro.ac.uk>


On 10 Aug 2018, at 21:33, Julian Wels via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:

Cultural Iconography
Another thing that is is worrisome is the proposed addition of a traditional Indian piece of clothing in 12.0. This is extremely specific to one culture, and I'm not sure if we want to open the gate for: "Which culture is included in Unicode and which is not?". Maybe we want that! Maybe we don't. But I think there should at least be a discussion about additions that carry such consequences.
I know that there are tons of Chinese symbols in there already, but even the selection-factors on the Unicode-website state, that just because there is a lot of stuff in there from former versions, should not be a basis of justification for future additions. For instance, the Tokyo Tower-Emoji does not justify the Eiffel Tower-Emoji. [link]<https://unicode.org/emoji/proposals.html#Faulty_Comparison>


Unicode is an essential building block for software internationalisation. I consider including cultural icon emoji in Unicode to be an essential part of internationalisation. The more cultures that are included the better. Actually I think a specific aim of ESC could be, in the long term, to encompass all cultures. ESC could encourage cultural icon emoji submissions.

Andr? Schappo


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180814/9375348b/attachment.html>

From unicode at unicode.org  Tue Aug 14 09:55:08 2018
From: unicode at unicode.org (Julian Wels via Unicode)
Date: Tue, 14 Aug 2018 16:55:08 +0200
Subject: Thoughts on Emoji Selection Process
In-Reply-To: <B810D555-74BB-4FB1-8BC3-C4D7295453C5@lboro.ac.uk>
References: <CAJYrn-tDWQgc5=YCP7466pDb1+NCFicOhPe-Dh3VyypxKvtyqw@mail.gmail.com>
 <B810D555-74BB-4FB1-8BC3-C4D7295453C5@lboro.ac.uk>
Message-ID: <CAJYrn-v+mSejLst56_nfxNUwwhCOwQHMdJBb8s8gYN0X-kOyfA@mail.gmail.com>

I mean I'd love to have this discussion, and maybe you could even turn me
around to your side of this argument if the current way of Emoji
development wouldn't be such a hot mess. My initial mail essentially said:
"Stop adding random stuff, until you find a way to streamline your
process." (in an abstract sense). So I'm just generally against
uncontrolled development.

If the ESC were to say: "We committed ourselves to add around 10 Emojis
each year, representing another culture." then this would be amazing. But
it appears that right now they'd say: "We committed ourselves to embrace
different cultures with Emoji.", then add 50 western Emoji, 36 Indian, and
12 African and then never speak of it again.

So for me, it's all about controlled development that leaves us with a
clean and well-organized set of Emojis. Right now it shows that can't have
that due to a lack of communication and ambitions on the part of the ESC.

Julian ??

On Tue, Aug 14, 2018 at 1:26 PM Andre Schappo via Unicode <
unicode at unicode.org> wrote:

>
>
> On 10 Aug 2018, at 21:33, Julian Wels via Unicode <unicode at unicode.org>
> wrote:
>
> Cultural Iconography
> Another thing that is is worrisome is the proposed addition of a
> traditional Indian piece of clothing in 12.0. This is extremely specific to
> one culture, and I'm not sure if we want to open the gate for: "Which
> culture is included in Unicode and which is not?". Maybe we want that!
> Maybe we don't. But I think there should at least be a discussion about
> additions that carry such consequences.
> I know that there are tons of Chinese symbols in there already, but even
> the selection-factors on the Unicode-website state, that just because there
> is a lot of stuff in there from former versions, should not be a basis of
> justification for future additions. For instance, the Tokyo Tower-Emoji
> does not justify the Eiffel Tower-Emoji. [link]
> <https://unicode.org/emoji/proposals.html#Faulty_Comparison>
>
>
> Unicode is an essential building block for software internationalisation.
> I consider including cultural icon emoji in Unicode to be an essential part
> of internationalisation. The more cultures that are included the better.
> Actually I think a specific aim of ESC could be, in the long term, to
> encompass all cultures. ESC could encourage cultural icon emoji submissions.
>
> Andr? Schappo
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180814/eddf60ff/attachment.html>

From unicode at unicode.org  Wed Aug 15 03:32:41 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 15 Aug 2018 00:32:41 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts on
 Emoji Selection Process)
Message-ID: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>

Suppose there's someone who has been working with the ESC for a while
and whose frustration level has passed the boiling point.  Let's call
this person "X".  X has become so angry that X is distilling recent
experiences into an expos? article for submission to the media.  The
media outlet, if responsible journalists, would fact-check the
article.  Would the fact-checking find proof, or would it be
determined that it is simply a, uh, dissing contest between two or
more personalities?  (If the latter, one of the tabloids might buy the
article.  They just *love* dissing contests.)

The original thread includes some sweeping allegations concerning
competence and integrity, but offers no specific examples.  Even
though many people do it daily, it's best not to make judgments
without evidence.

A list member kindly sent me links to a pair of documents.

L2/17-147
http://www.unicode.org/L2/L2017/17147-emoji-subcommittee.pdf

L2/17-192
http://www.unicode.org/L2/L2017/17192-response-cmts.pdf

The first one, L2/17-147 (by West, Buff, and P?per), is a request for
more ESC transparency.  It raises a couple of legitimate concerns:
(1) requests complete public documentation of all incoming
submissions, and (2) requests a public roster of ESC members.  The
requests seem reasonable.

The second one, L2/17-192 (by Davis and Edberg), rejects the first one.

A superficial analysis might persuade someone that the ESC does things
the way they like to do things and are going to continue to do things,
so neener-neener.

But if we examine the reasoning behind L2/17-192, it does make some sense.

For (1), there's too many submissions and the vast majority of them
are D.O.A..  Why spend resources documenting non-starters?  L2/17-192
goes on to explain the way viable submissions become public.

For (2), which the ESC rejected first, the underlying reasoning is
clearly stated.  The roster is in a perpetual state of flux, there is
no fixed membership, there is no membership list.

Putting aside any obvious advantages anonymity offers over
accountability, any committee with a constantly shifting membership is
unstable by definition.  Why would any committee want to make its
instability a matter of public record?

Putting aside any snide humor, it does appear that the ESC responds to
requests/suggestions and is willing to work with submitters.  (Based
on one example, at least.)

One one hand, there's a group who is interested in exploiting the
emoji ranges to advance corporate commercial concerns.  On another
hand, there are emoji enthusiasts who want the sterling reputation of
excellence Unicode has earned to continue far into the future.
There's got to be some common ground here.  Why not shake those hands,
find that common ground, and explore it together?  And have some fun
while doing it.  Aren't the emoji supposed to be fun?


From unicode at unicode.org  Fri Aug 17 04:35:54 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 17 Aug 2018 10:35:54 +0100 (BST)
Subject: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
Message-ID: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>

May I mention please a situation that may be of interest as indicative of some of the issues with the present system.

In the discussion after the end of the lecture ?Unicode Emoji: How do we standardize that je ne sais quoi?? at the Internationalization & Unicode Conference 39 conference in October 2015, a gentleman in the audience raised the possibility of emoji for ?I? and for ?You?.

https://www.youtube.com/watch?v=9ldSVbXbjl4

I was not there but I have viewed the video several times.

I decided that trying to design emoji for 'I' and for 'You' seemed interesting so I decided to have a go at designing some.

However pictures of people with arrows seemed to be ambiguous in meaning and also they seemed to need to be too detailed for rendering in mobile telephone messages and in many situations in web pages and emails generally.
 
So eventually I decided that abstract designs would be a good solution to the problem.

I designed abstract emoji in two colours, yet such that a monochrome fallback display would still be recognisable.

There is a web page in my webspace that has more information about this, including the designs, including links back to several posts in the archives of this mailing list, some by me, some by other people.

http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm

At one stage I sent a copy of the PDF (Portable Document Format) document to 'docsubmit' an was informed that it was not in the correct format for a submission. A problem that I find with 'docsubmit' is that there is no indication of who is running it, replies are just from 'docsubmit' with no name of a human being. I had sent the document with the idea that if it were included in the Document Register that discussion could take place that could lead to progress.

So I considered trying to prepare a document compliant with the submission rules, for emoji for 'I' and for 'You' as mentioned in the discussion period in the conference session, together with perhaps a few other personal pronouns as well.

I have not produced such a document yet for a number of reasons.

It takes time to produce such a document. I am happy to spend that time producing a document but I am somewhat deterred by the possibility that the document might just be discarded and never get anywhere with no explanation.

It is not clear whether abstract emoji would be accepted. If I remember correctly at one stage ESC (Emoji Subcommittee) opined that abstract emoji were possible, though possibly the UTC (Unicode Technical Committee) was of the opposite opinion and it there seems no clarity on the matter at present.

All proposals for new emoji seem to now require those blue and red charts from Google Trends.

I have never understood why these are needed and what they are supposed to prove. That being for any emoji proposal.

When it comes to a proposal for emoji for 'I' and for 'You', I cannot decide what Google Trends chart would be of any relevance to support an emoji proposal.

I opine that emoji for 'I' and for 'You' are a good idea. I am happy to try to submit a proposal if there is a reasonable prospect of success, but the requirement for Google Trends pictures seems to be a block to me being able to do that at present as I just do not understand it all.

If anyone out there would like to help get emoji for 'I' and for 'You' and maybe a few other personal pronouns encoded, whether using my designs or other designs, whether abstract designs or not abstract designs, whether with me being involved or otherwise, then that would be welcome.

William Overington

Thursday 16 August 2018


From unicode at unicode.org  Sat Aug 18 03:07:17 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 18 Aug 2018 00:07:17 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
Message-ID: <CABPY6Z0LM_CSQyq-Y_Hjfy6KcArtv=im7Huy72H10nm6cJFuoQ@mail.gmail.com>

William Overington wrote,

> All proposals for new emoji seem to now require
> those blue and red charts from Google Trends.
>
> I have never understood why these are needed and
> what they are supposed to prove.

If an emoji being proposed represents a concept which is popular, its
potential popularity *as an emoji* can perhaps be estimated by seeing
how many people are making web searches for the concept.

The red and blue chart should compare the proposed emoji to a
"reference emoji", but I don't really understand why, either.  No
doubt there's some kind of reasoning behind this, but to me it's like
comparing apples to oranges.

> When it comes to a proposal for emoji for 'I'
> and for 'You', I cannot decide what Google Trends
> chart would be of any relevance to support an emoji
> proposal.

https://trends.google.com/trends/explore?q=personal%20pronoun%20emoji&geo=US

Well, *those* keywords don't look very promising.

> I decided that trying to design emoji for 'I' and
> for 'You' seemed interesting so I decided to have
> a go at designing some.
>
> However pictures of people with arrows seemed to
> be ambiguous in meaning ...
>
> So eventually I decided that abstract designs
> would be a good solution to the problem.

Hand gestures such as an overview of a finger pointing away (for
"YOU") and a thumbs-up with the thumb pointing inward at about a
hundred degree angle (for "I") might work.  But since body language
and hand gestures differ between cultures, those gestures might only
be recognizable to westerners.

For example, I've seen Japanese people refer to themselves with a hand
gesture which is their own pointing finger touching their own nose.
And the thumbs up gesture which means "everything is jake" or
"everything is hunky-dory" in my locale means something vastly
different south of the international border between the U.S. and
Mexico.

Even an emoji pair representing Narcissus gazing fondly upon
reflection (with the upper figure emphasized for "I" and the lower for
"YOU") might be western-centric.  Or too subtle or cerebral.

But an abstract design for those pronouns would remain abstract unless
people *like* it and use it.

> I am happy to spend that time producing a document
> but I am somewhat deterred by the possibility that
> the document might just be discarded and never get
> anywhere with no explanation.

Yet your initial submission was rejected with an explanation.

> It is not clear whether abstract emoji would be
> accepted. If I remember correctly at one stage
> ESC (Emoji Subcommittee) opined that abstract
> emoji were possible, ...

... but unlikely?  Perhaps if a corporate sponsor designed a set of
abstract emoji and got some agreement from some of the other corporate
sponsors, *those* abstract emoji would be pushed towards the character
encoding stage.  But such abstractions proposed by a John Doe or Job
Lowe strike me as unlikely candidates.

Quoting from:

http://www.unicode.org/emoji/proposals.html

"?Simple words (?NEW?) or abstract symbols (???)
would not qualify as emoji."

Also, in the section "Selection Factors for Exclusion", the part
headed "L. Exact Images" would seem to rule out any such abstractions.


From unicode at unicode.org  Sat Aug 18 09:13:09 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Sat, 18 Aug 2018 15:13:09 +0100 (BST)
Subject: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <CABPY6Z0LM_CSQyq-Y_Hjfy6KcArtv=im7Huy72H10nm6cJFuoQ@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CABPY6Z0LM_CSQyq-Y_Hjfy6KcArtv=im7Huy72H10nm6cJFuoQ@mail.gmail.com>
Message-ID: <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost>

James Kass wrote:

> Quoting from:

> http://www.unicode.org/emoji/proposals.html

> "?Simple words (?NEW?) or abstract symbols (???) would not qualify as emoji."

Well, that is quite clear. In order for abstract emoji to become encoded, that rule would need to be either removed, or made waivable in some instances at the discretion of the Unicode Technical Committee.

> Also, in the section "Selection Factors for Exclusion", the part headed "L. Exact Images" would seem to rule out any such abstractions.

Hmmm, maybe not.

What is stated in the document is as follows.

> Emoji are by their nature subject to variation in order to have consistent graphic designs for a full set. Precise images (such as from a specific visual meme) are not appropriate as emoji; images such as GIFs or PNGs should be used in such cases, instead of emoji characters.

The designs that I have produced for abstract emoji of personal pronouns could be drawn, whilst each retaining enough of their shape information to still convey the intended meaning, in, say, the style of the Comic Sans font. So the designs that I produced are not necessarily subject to that ruling; yet I do need to add that the designs that I produced are somewhat constrained against as much variation as is possible for many emoji. Yet the designs that I produced have about as much flexibility as to glyph design as do letters of the English alphabet.   

> Once an emoji is released, it is typically used for a wide variety of items that have similar visual appearance.

Well, if some people use the same code point for a variety of things then that is a matter for them! One can only do so much in trying to convey meaning without distortion of meaning.

Referring to the designs in the following document,

http://www.users.globalnet.co.uk/~ngo/Some_designs_for_emoji_of_personal_pronouns.pdf

some readers may be interested to know of how I arrived at the general structure of those designs.

It all goes back to when I was new to learning French.

The present tense of the verb ?tre ("to be") was set out something like in the text diagram below, though the underscore characters are added here by me so as to try to produce a fairly reasonable display in a text diagram that may become displayed in a variety of fonts. Hopefully the diagram will look good with a monospaced font.

je suis ______ nous sommes
tu es ________ vous ?tes
il est _______ ils sont
elle est _____ elles sont

So horizontally there is singular and plural, and vertically there is first person, second person and then third person on each of two rows for two genders.

So my designs are based on that layout.

One square for singular, two squares horizontally side by side for plural.

The location of the square or squares is then upper left corner for first person, middle left not in any corner for second person, and lower left corner for third person.

Then there are a few additional lines for various third person personal pronouns so as to distinguish male from female and from both genders together, together with a slightly anomalous location of a square at lower middle not in any corner for the personal pronoun 'one'.

William Overington

Saturday 18 August 2018


From unicode at unicode.org  Sat Aug 18 20:55:42 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 18 Aug 2018 17:55:42 -0800
Subject: Tales from the Archives
Message-ID: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html

Back in 2000, William Overington asked about ligation for Latin and
mentioned something about preserving older texts digitally.  John
Cowan replied with some information about ZWJ/ZWNJ and I offered a
link to a Unicode-based font, Junicode, which had (at that time)
coverage for archaic letters already encoded, and which used the PUA
for unencoded ligatures.

At that time, OpenType support was primitive and not generally
available.  If I'm not mistaken, the word "ligation" for typographic
ligature forming had not yet been coined. IIRC John Hudson borrowed
the medical word some time after that particular Unicode e-mail
thread.  (One poster in that thread called it "ligaturing".)

Peter Constable replied and explained clearly how ligation was
expected to work for Latin in Unicode.  John Cowan posted again and
augmented the information which Peter Constable had provided.  The
information from Peter and John was instructional and helpful and
furthered the education of at least one neophyte.

Back then, display issues were on everyone's mind.  Many questions
about display issues were posted to this list.  Unicode provided some
novel methods of encoding complex scripts, such as for Indic, but
those methods didn't actually work anywhere in the real world, so
users stuck to the "ASCII-hack" fonts that actually did work.

When questions about display issues and other technical aspects of
Unicode were posted, experts from everywhere quickly responded with
helpful pointers and explanations.

Eighteen years pass, display issues have mostly gone away, nearly
everything works "out-of-the-box", and list traffic has dropped
dramatically.  Today's questions are usually either highly technical
or emoji-related.

Recent threads related to emoji included some questions and issues
which remain unanswered in spite of the fact that there are list
members who know the answers.

This gives the impression that the Unicode public list has become
pass?.  That's almost as sad as looking down the archive posts, seeing
the names of the posters, and remembering colleagues who no longer
post.

So I'm wondering what changed, but I don't expect an answer.


From unicode at unicode.org  Sun Aug 19 01:37:44 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sat, 18 Aug 2018 22:37:44 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CABPY6Z0LM_CSQyq-Y_Hjfy6KcArtv=im7Huy72H10nm6cJFuoQ@mail.gmail.com>
 <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost>
Message-ID: <CABPY6Z1MtwhxfWgcL7bt0wqXoze7ip+cZycwnW-Bk35YGUgDqg@mail.gmail.com>

William Overington wrote,

> The designs that I have produced for abstract emoji of
> personal pronouns could be drawn, whilst each retaining
> enough of their shape information to still convey the
> intended meaning, in, say, the style of the Comic Sans
> font. So the designs that I produced are not necessarily > subject to that ruling; yet I do need to add that the
> designs that I produced are somewhat constrained against
> as much variation as is possible for many emoji. Yet the
> designs that I produced have about as much flexibility as
> to glyph design as do letters of the English alphabet.

Exactly, except for the part about 'not necessarily subject to that ruling'.

Quoting from,

https://www.thoughtco.com/what-is-alphabet-1689080

... which is quoting from Mitchell Stephens, The Rise of the Image,
the Fall of the Word. Oxford University Press, 1998 ...

"In about 1500 B.C., the world's first alphabet appeared among the
Semites in Canaan. It featured a limited number of abstract symbols
(at one point thirty-two, later reduced to twenty-two) out of which
most of the sounds of speech could be represented. The Old Testament
was written in a version of this alphabet. ..."

(Of course, nobody called it "The Old Testament" back then.)

Do you consider alphabetic letters to be anything other than abstract symbols?

You've devised a set of abstract symbols to depict personal pronouns
based on typical verb conjugation diagrams.  It's my opinion that such
symbols aren't emoji candidates, but I am not an emoji expert.

From unicode at unicode.org  Sun Aug 19 04:20:56 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Sun, 19 Aug 2018 01:20:56 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z1MtwhxfWgcL7bt0wqXoze7ip+cZycwnW-Bk35YGUgDqg@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CABPY6Z0LM_CSQyq-Y_Hjfy6KcArtv=im7Huy72H10nm6cJFuoQ@mail.gmail.com>
 <13396503.16197.1534601589320.JavaMail.defaultUser@defaultHost>
 <CABPY6Z1MtwhxfWgcL7bt0wqXoze7ip+cZycwnW-Bk35YGUgDqg@mail.gmail.com>
Message-ID: <CABPY6Z3mEdxFSio=je8UFMCk9_1uJS4bma8fQOtHWtaXAQOSRA@mail.gmail.com>

My apologies for my last post.  I realize now that William Overington
was referring to "exact images" rather than "abstract symbols"
exclusions.  My opinion stands, though, FWIW.

From unicode at unicode.org  Sun Aug 19 09:25:47 2018
From: unicode at unicode.org (Marius Spix via Unicode)
Date: Sun, 19 Aug 2018 16:25:47 +0200
Subject: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
Message-ID: <20180819162547.39117490@spixxi>

William Overington wrote:

> 
> I decided that trying to design emoji for 'I' and for 'You' seemed
> interesting so I decided to have a go at designing some.
> 
> However pictures of people with arrows seemed to be ambiguous in
> meaning and also they seemed to need to be too detailed for rendering
> in mobile telephone messages and in many situations in web pages and
> emails generally. So eventually I decided that abstract designs would
> be a good solution to the problem.
> 


I also played with a similar idea, which requires a new
GSUB LookupType, let?s call it 9: Reader-dependent substitution.

The idea is that the reader of the text will see another glyph when
he/she is the author of the text. For example if you use the codepoint
for for ME, all other readers see the glpyh for YOU and vice versa. This
is for example usable in instant messaging and social networking
services.

In the attachment you find some ideas for the
following emoji

IDEOGRAM FOR ME / IDEOGRAM FOR YOU
IDEOGRAM FOR TWO OF US / IDEOGRAM FOR YOU TWO 
IDEOGRAM FOR WE ALL / IDEOGRAM FOR YOU ALL
IDEOGRAM FOR ME AND ANOTHER PERSON / IDEOGRAM FOR YOU AND ANOTHER PERSON
IDEOGRAM FOR ME AND MULTIPLE OTHER PERSONS / IDEOGRAM FOR YOU AND
MULTIPLE OTHER PERSONS


IDEOGRAM FOR YOU AND ME (the counterpart has no own codepoint, but is
mirrored, as you may arrange other emoji to the left or right)

The following emoji may look equal independent of the reader:
IDEOGRAM FOR ANOTHER PERSON
IDEOGRAM FOR TWO OTHER PERSONS
IDEOGRAM FOR MULTIPLE OTHER PERSONS

The rendering engine requires a flag if the user is the author or not.
I think it would be possible to implement.

What about this idea?

Regards,

Marius Spix
-------------- next part --------------
A non-text attachment was scrubbed...
Name: youme.png
Type: image/png
Size: 3035 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180819/69abc48b/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://unicode.org/pipermail/unicode/attachments/20180819/69abc48b/attachment.pgp>

From unicode at unicode.org  Sun Aug 19 10:01:28 2018
From: unicode at unicode.org (Alan Wood via Unicode)
Date: Sun, 19 Aug 2018 15:01:28 +0000 (UTC)
Subject: Tales from the Archives
In-Reply-To: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
Message-ID: <2096117.2386243.1534690888302@mail.yahoo.com>

James
I think you have answered your own question: nearly everything works "out-of-the-box".
Unicode is just there, and most computer users have probably never heard of it.? I routinely produce web pages with English, French, Russian and Chinese text and a few symbols, and don't even think whether other people can see everything displayed properly.

Long ago, the response to the question "Why can't I see character x" was often to install a copy of the Code2000 font and send the fee ($10 ?) to James Kass by airmail.

These days, Windows 10 can display all of the major living languages (and I expect Macs can too, but I can't afford one now that I have retired).

Some of the frequent posters have probably passed away, while others (like me) have got older, and slowed down and/or developed new interests.

Best regards

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names) 

    On Sunday, 19 August 2018, 03:05:41 GMT+1, James Kass via Unicode <unicode at unicode.org> wrote:  
 
 http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html

Back in 2000, William Overington asked about ligation for Latin and
mentioned something about preserving older texts digitally.? John
Cowan replied with some information about ZWJ/ZWNJ and I offered a
link to a Unicode-based font, Junicode, which had (at that time)
coverage for archaic letters already encoded, and which used the PUA
for unencoded ligatures.

At that time, OpenType support was primitive and not generally
available.? If I'm not mistaken, the word "ligation" for typographic
ligature forming had not yet been coined. IIRC John Hudson borrowed
the medical word some time after that particular Unicode e-mail
thread.? (One poster in that thread called it "ligaturing".)

Peter Constable replied and explained clearly how ligation was
expected to work for Latin in Unicode.? John Cowan posted again and
augmented the information which Peter Constable had provided.? The
information from Peter and John was instructional and helpful and
furthered the education of at least one neophyte.

Back then, display issues were on everyone's mind.? Many questions
about display issues were posted to this list.? Unicode provided some
novel methods of encoding complex scripts, such as for Indic, but
those methods didn't actually work anywhere in the real world, so
users stuck to the "ASCII-hack" fonts that actually did work.

When questions about display issues and other technical aspects of
Unicode were posted, experts from everywhere quickly responded with
helpful pointers and explanations.

Eighteen years pass, display issues have mostly gone away, nearly
everything works "out-of-the-box", and list traffic has dropped
dramatically.? Today's questions are usually either highly technical
or emoji-related.

Recent threads related to emoji included some questions and issues
which remain unanswered in spite of the fact that there are list
members who know the answers.

This gives the impression that the Unicode public list has become
pass?.? That's almost as sad as looking down the archive posts, seeing
the names of the posters, and remembering colleagues who no longer
post.

So I'm wondering what changed, but I don't expect an answer.

  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180819/c52aa173/attachment.html>

From unicode at unicode.org  Sun Aug 19 14:03:19 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Sun, 19 Aug 2018 21:03:19 +0200
Subject: Tales from the Archives
In-Reply-To: <2096117.2386243.1534690888302@mail.yahoo.com>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <2096117.2386243.1534690888302@mail.yahoo.com>
Message-ID: <CAJ2xs_G_g=x_RsCX=9D3f-Bet09x44A+1tnGdRhQ7b+3ySPAfg@mail.gmail.com>

You and Alan both raise good issues and make good points. I'd mention a
couple of others.

When we started Unicode, there were not a lot of alternatives to a
general-purpose discussion email list for internationalization, but now
there are many. Often the technical discussions are moved to more specific
forums. There are interesting discussions on the identification of Unicode
spoofing (because of look-alikes) on a variety of forums dealing with
security, for example. I suspect many of the font rendering issues have
widespread solutions now (as Alan notes) and that discussions of remaining
issues have shifted to forums on OpenType. There are some very intense
discussions of Mongolian model issues, but those also tend to be handled in
different venues. Work on ICU / CLDR also tend to take place in many cases
in the comments on particular tickets, rather than in email lists.

The work of the consortium has also broadened significantly beyond encoding
and issues closely related to encoding. Here's a slide to illustrate that.
(The first 24 slides in the deck are to give people some context and
perspective on what the Unicode Consortium does before focusing on a
narrower issue.)

https://docs.google.com/presentation/d/1QAyfwAn_0SZJ1yd0WiQgoJdG7djzDiq2Isb254ymDZc/edit#slide=id.g38b1fcd632_0_166

Mark


On Sun, Aug 19, 2018 at 5:06 PM Alan Wood via Unicode <unicode at unicode.org>
wrote:

> James
>
> I think you have answered your own question: nearly everything works
> "out-of-the-box".
>
> Unicode is just there, and most computer users have probably never heard
> of it.  I routinely produce web pages with English, French, Russian and
> Chinese text and a few symbols, and don't even think whether other people
> can see everything displayed properly.
>
> Long ago, the response to the question "Why can't I see character x" was
> often to install a copy of the Code2000 font and send the fee ($10 ?) to
> James Kass by airmail.
>
> These days, Windows 10 can display all of the major living languages (and
> I expect Macs can too, but I can't afford one now that I have retired).
>
> Some of the frequent posters have probably passed away, while others (like
> me) have got older, and slowed down and/or developed new interests.
>
> Best regards
>
> Alan Wood
> http://www.alanwood.net (Unicode, special characters, pesticide names)
>
>
> On Sunday, 19 August 2018, 03:05:41 GMT+1, James Kass via Unicode <
> unicode at unicode.org> wrote:
>
>
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html
>
> Back in 2000, William Overington asked about ligation for Latin and
> mentioned something about preserving older texts digitally.  John
> Cowan replied with some information about ZWJ/ZWNJ and I offered a
> link to a Unicode-based font, Junicode, which had (at that time)
> coverage for archaic letters already encoded, and which used the PUA
> for unencoded ligatures.
>
> At that time, OpenType support was primitive and not generally
> available.  If I'm not mistaken, the word "ligation" for typographic
> ligature forming had not yet been coined. IIRC John Hudson borrowed
> the medical word some time after that particular Unicode e-mail
> thread.  (One poster in that thread called it "ligaturing".)
>
> Peter Constable replied and explained clearly how ligation was
> expected to work for Latin in Unicode.  John Cowan posted again and
> augmented the information which Peter Constable had provided.  The
> information from Peter and John was instructional and helpful and
> furthered the education of at least one neophyte.
>
> Back then, display issues were on everyone's mind.  Many questions
> about display issues were posted to this list.  Unicode provided some
> novel methods of encoding complex scripts, such as for Indic, but
> those methods didn't actually work anywhere in the real world, so
> users stuck to the "ASCII-hack" fonts that actually did work.
>
> When questions about display issues and other technical aspects of
> Unicode were posted, experts from everywhere quickly responded with
> helpful pointers and explanations.
>
> Eighteen years pass, display issues have mostly gone away, nearly
> everything works "out-of-the-box", and list traffic has dropped
> dramatically.  Today's questions are usually either highly technical
> or emoji-related.
>
> Recent threads related to emoji included some questions and issues
> which remain unanswered in spite of the fact that there are list
> members who know the answers.
>
> This gives the impression that the Unicode public list has become
> pass?.  That's almost as sad as looking down the archive posts, seeing
> the names of the posters, and remembering colleagues who no longer
> post.
>
> So I'm wondering what changed, but I don't expect an answer.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180819/1caebfe8/attachment.html>

From unicode at unicode.org  Sun Aug 19 17:41:55 2018
From: unicode at unicode.org (Leo Broukhis via Unicode)
Date: Sun, 19 Aug 2018 15:41:55 -0700
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
Message-ID: <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>

On Fri, Aug 17, 2018 at 2:35 AM, William_J_G Overington via Unicode <
unicode at unicode.org> wrote:

>
> I decided that trying to design emoji for 'I' and for 'You' seemed
> interesting so I decided to have a go at designing some.
>

Why don't we just encode Blissymbolics, where pronouns are already
expressible as abstract symbols, and emojify them?

Leo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180819/dfafaee0/attachment.html>

From unicode at unicode.org  Mon Aug 20 08:08:46 2018
From: unicode at unicode.org (Steffen Nurpmeso via Unicode)
Date: Mon, 20 Aug 2018 15:08:46 +0200
Subject: Tales from the Archives
In-Reply-To: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
Message-ID: <20180820130846.h7mij%steffen@sdaoden.eu>

James Kass via Unicode wrote in <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E\
1wN+_+fsLMBw at mail.gmail.com>:
  ...
 |Eighteen years pass, display issues have mostly gone away, nearly
 |everything works "out-of-the-box", and list traffic has dropped
 |dramatically.  Today's questions are usually either highly technical
 |or emoji-related.
 |
 |Recent threads related to emoji included some questions and issues
 |which remain unanswered in spite of the fact that there are list
 |members who know the answers.
 |
 |This gives the impression that the Unicode public list has become
 |pass?.  That's almost as sad as looking down the archive posts, seeing
 |the names of the posters, and remembering colleagues who no longer
 |post.
 |
 |So I'm wondering what changed, but I don't expect an answer.

I have the impression that many things which have been posted here
some years ago are now only available via some Forums or other
browser based services.  What is posted here seems to be mostly
a duplicate of the blog only.  (And the website has its pitfalls
too, for example [1] is linked from [2], but does not exist.)

  [1] http://www.unicode.org/resources/readinglist.html
  [2] http://www.unicode.org/publications/

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


From unicode at unicode.org  Mon Aug 20 09:09:21 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 06:09:21 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
Message-ID: <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>

Leo Broukhis responded to William Overington:

>> I decided that trying to design emoji for 'I' and for 'You' seemed
>> interesting so I decided to have a go at designing some.
>
> Why don't we just encode Blissymbolics, where pronouns are already
> expressible as abstract symbols, and emojify them?

Emoji enthusiasts seeking to devise a universal pictographic set might
be well-advised to build from existing work such as Blissymbolics.

I think William Overington's designs are clever, though.  Anyone who
has ever studied a foreign language (or even their own language) would
easily and quickly recognize the intended meanings of the symbols once
they understand the derivation.

From unicode at unicode.org  Mon Aug 20 09:20:59 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Mon, 20 Aug 2018 07:20:59 -0700
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
Message-ID: <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180820/35459f75/attachment.html>

From unicode at unicode.org  Mon Aug 20 09:30:12 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 06:30:12 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
Message-ID: <CABPY6Z2-WcUTJAZcbtvxr-nY1P8ko453TVz4VcP3AU6Lu4Pb7w@mail.gmail.com>

There are enthusiasts who want to add many cool emoji to the set and
who may be frustrated by the process and new character limits.  There
are other enthusiasts who apparently want to add even more emoji with
the idea of producing some kind of universal pictographic system.
They'd likely need personal pronouns for something like that and are
probably even more frustrated.  Then there's the corporate interests
who also want to add more cool emoji, as long as they are cool enough,
and within limits.

There's some common ground there, but it's easy to understand that the
enthusiasts are stymied by the pace.  With a limit of sixty new emoji
per year, it would take quite a while before the regular enthusiasts
are satisfied and it would take decades to encode any kind of
universal pictographic system.

What the enthusiasts need is a large block of characters in which to
experiment.  A place where proposed and pending (or rejected) emoji
could be sorted, stored, mapped, documented, and published without any
lengthy delays.  A range from which such emoji could be transmitted to
other enthusiasts as computer plain text, and by prior agreement the
recipient could display the emoji as the sender intended.

Would two complete planes of Unicode be large enough for that?

Deseret and Phaistos, as two examples, were being used in a Unicode
environment way before they were added to The Unicode Standard.  There
were web pages published in Deseret before Deseret was accepted into
Unicode, and newer pages weren't using "ASCII-hack" fonts.

Enthusiasts could form their own ad-hoc committee and set up some form
of registry for pre-Unicode emoji using the Private Use Planes of
Unicode.

Vendor support wouldn't be likely, at least not right away, but vendor
support isn't happening any time soon for most proposed emoji anyway.
Since emoji enthusiasts come from all walks of life, there's surely
someone who can whip up an app or an add-on.  Plus, conventional fonts
can be made for the black and white fallback glyphs, and those would
get things going while awaiting apps/add-ons.

If usage of these new emoji snowballs as much as the enthusiasts
expect it to, then the search engine trending might be tuned to the
individual PUA character and give an *exact* reading of just how
popular any particular proposed emoji is.  And *those* figures would
tend to support the promotion of specific candidates into regular
Unicode if the figures were high enough.

And if these new emoji turn out to be just a passing fad, no harm done.

As The Universal Character Set, it should be able to support the needs
of all users.  And with the Private Use Areas, it does.

As a caveat, some Unicode cognoscenti express disdain for the PUA, so
there would be some people who would call a PUA solution either batty
or crazy.  But such PUA solutions have the advantage of getting things
up-and-running and allowing specialists and enthusiasts to exchange
exactly the kind of information they want to exchange, such as the
anarchy symbol, without needing anybody's approval or permission.
Which might explain the disdain.  ?

https://en.wikipedia.org/wiki/Private_Use_Areas


From unicode at unicode.org  Mon Aug 20 13:22:13 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 20 Aug 2018 11:22:13 -0700
Subject: Tales from the Archives
In-Reply-To: <20180820130846.h7mij%steffen@sdaoden.eu>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <20180820130846.h7mij%steffen@sdaoden.eu>
Message-ID: <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net>

Steffen,

Are you looking for the Unicode list email archives?

https://www.unicode.org/mail-arch/

Those contain list content going back all the way to 1994.

--Ken


On 8/20/2018 6:08 AM, Steffen Nurpmeso via Unicode wrote:
> I have the impression that many things which have been posted here
> some years ago are now only available via some Forums or other
> browser based services.  What is posted here seems to be mostly
> a duplicate of the blog only.


From unicode at unicode.org  Mon Aug 20 13:47:49 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 20 Aug 2018 11:47:49 -0700
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
Message-ID: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>

James Kass wrote:

> As a caveat, some Unicode cognoscenti express disdain for the PUA, so
> there would be some people who would call a PUA solution either batty
> or crazy.

I'm concerned that the constant "health warnings" about avoiding the PUA
may have scared everyone away from this primary use case.

Yes, you run the risk of someone else's PUA implementation colliding
with yours. That's why you create a Private Use Agreement, and make sure
it's prominently available to people who want to use your solution. It's
not like there are hundreds of PUA schemes anyway.

Yes, you will have to convert any existing data if the solution ever
gets encoded in Unicode. That happened for Deseret and Shavian, and
maybe others, and the sky didn't fall.

People forget that it was the PUA in Shift-JIS, by Japanese mobile
providers, that provided the platform for emoji to take off to such an
extent that... well, we know the rest. If private-use is good enough for
a legacy encoding, it ought to be good enough for Unicode.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Mon Aug 20 14:12:42 2018
From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode)
Date: Mon, 20 Aug 2018 21:12:42 +0200
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
In-Reply-To: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
Message-ID: <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>

> ... some people who would call a PUA solution either batty
> or crazy.

I don't think it is either batty or crazy. People can certainly use the PUA
to interchange text (assuming that they have downloaded fonts and keyboards
or some other input method beforehand), and
it
 can definitely serve as a proof of concept
. Plain symbols ? with no interactions between them (like changing shape
with complex scripts), no combining/non-spacing marks, no case mappings,
and so on ? are the best possible case for PUA.

The only caution I would give is that people shouldn't expect general
purpose software to do anything with PUA text that depends on character
properties.

Mark


On Mon, Aug 20, 2018 at 8:52 PM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> James Kass wrote:
>
> > As a caveat, some Unicode cognoscenti express disdain for the PUA, so
> > there would be some people who would call a PUA solution either batty
> > or crazy.
>
> I'm concerned that the constant "health warnings" about avoiding the PUA
> may have scared everyone away from this primary use case.
>
> Yes, you run the risk of someone else's PUA implementation colliding
> with yours. That's why you create a Private Use Agreement, and make sure
> it's prominently available to people who want to use your solution. It's
> not like there are hundreds of PUA schemes anyway.
>
> Yes, you will have to convert any existing data if the solution ever
> gets encoded in Unicode. That happened for Deseret and Shavian, and
> maybe others, and the sky didn't fall.
>
> People forget that it was the PUA in Shift-JIS, by Japanese mobile
> providers, that provided the platform for emoji to take off to such an
> extent that... well, we know the rest. If private-use is good enough for
> a legacy encoding, it ought to be good enough for Unicode.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180820/8bf582c4/attachment.html>

From unicode at unicode.org  Mon Aug 20 14:38:30 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Mon, 20 Aug 2018 12:38:30 -0700
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
Message-ID: <20180820123830.665a7a7059d7ee80bb4d670165c8327d.2bbe13127f.wbe@email03.godaddy.com>

Mark Davis wrote:

> The only caution I would give is that people shouldn't expect general
> purpose software to do anything with PUA text that depends on
> character properties.

Very true, and a good point. People with creative PUA ideas do sometimes
expect this to magically work.

I have anecdotes, if anyone is interested off-list.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Mon Aug 20 17:22:33 2018
From: unicode at unicode.org (Steffen Nurpmeso via Unicode)
Date: Tue, 21 Aug 2018 00:22:33 +0200
Subject: Tales from the Archives
In-Reply-To: <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <20180820130846.h7mij%steffen@sdaoden.eu>
 <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net>
Message-ID: <20180820222233.iwl8c%steffen@sdaoden.eu>

Terrible!

Ken Whistler wrote in <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631 at att.net>:
 |Steffen,
 |
 |Are you looking for the Unicode list email archives?
 |
 |https://www.unicode.org/mail-arch/
 |
 |Those contain list content going back all the way to 1994.

Dear Ken Whistler, no, and yes, having an archive is very good,
though your statement from 1997-07-16 ("Plan 9 (a Unix OS) uses
UTF-8") i cannot agree with (it feels very different from Unix).

It was just that i have read on one of the mailing-lists i am
subscribed to a cite of a Unicode statement that i have never read
of anything on the Unicode mailing-list.  It is very awkward, but
i _again_ cannot find what attracted my attention, even with the
help of a search machine.  I think "faith alone will reveal the
true name of shuruq" (1997-07-18).

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

From unicode at unicode.org  Mon Aug 20 18:23:13 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 20 Aug 2018 16:23:13 -0700
Subject: Tales from the Archives
In-Reply-To: <20180820222233.iwl8c%steffen@sdaoden.eu>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <20180820130846.h7mij%steffen@sdaoden.eu>
 <12e6ad91-89e4-ec87-85ad-8fc4ab3f6631@att.net>
 <20180820222233.iwl8c%steffen@sdaoden.eu>
Message-ID: <b967e300-7f95-60a2-b2fe-75d02ba4caf4@att.net>

Steffen noted:


On 8/20/2018 3:22 PM, Steffen Nurpmeso via Unicode wrote:
> It was just that i have read on one of the mailing-lists i am
> subscribed to a cite of a Unicode statement that i have never read
> of anything on the Unicode mailing-list.  It is very awkward, but
> i_again_  cannot find what attracted my attention, even with the
> help of a search machine.  I think "faith alone will reveal the
> true name of shuruq" (1997-07-18).
>
> --steffen

Fortunately, since I collect everything, this one has not been lost to 
the mists of
history yet. So here you go, another "tale from the archives", aka "every
character has a story".

--Ken

===================================================================

 From kenw Thu Sep 18 14:23 PDT 1997
Date: Thu, 18 Sep 1997 14:20:29 -0700
From: kenw (Kenneth Whistler)
Message-Id: <9709182120.AA16670 at birdie.sybase.com>
To: unicode at unicode.org
Subject: War over 'shuruq' narrowly averted
Cc: kenw


Dateline: Geneva, Thursday, September 18, 1997

The ISOnominalists and the SInominalists met today at
the bargaining table in their long-running dispute over
whether the correct name of U+05BC should be:

HEBREW POINT DAGESH OR MAPIQ (shuruq)

or

HEBREW POINT DAGESH OR MAPIQ OR SHURUQ

After considerable posturing and threats by both sides,
opposing camps reluctantly agreed that a compromise
solution was preferable to open flamewar. Unnamed sources
state that the new name to be revealed in a press
conference this evening is:

HEBREW POINT DAGESH OR MAPIQ (or shuruq)

Both sides have also now agreed to focus their attention
jointly at countering the antinomianist camp, which claims
that no names can be imposed by human moral strictures,
and that faith alone will reveal the true name of
shuruq.

=============================================================


From unicode at unicode.org  Mon Aug 20 18:49:45 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 20 Aug 2018 19:49:45 -0400
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
Message-ID: <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>

On 08/20/2018 10:20 AM, Asmus Freytag via Unicode wrote:
> On 8/20/2018 7:09 AM, James Kass via Unicode wrote:
>> Leo Broukhis responded to William Overington:
>>
>>>> I decided that trying to design emoji for 'I' and for 'You' seemed
>>>> interesting so I decided to have a go at designing some.
>>> Why don't we just encode Blissymbolics, where pronouns are already
>>> expressible as abstract symbols, and emojify them?
>> Emoji enthusiasts seeking to devise a universal pictographic set might
>> be well-advised to build from existing work such as Blissymbolics.
>>
>> I think William Overington's designs are clever, though.  Anyone who
>> has ever studied a foreign language (or even their own language) would
>> easily and quickly recognize the intended meanings of the symbols once
>> they understand the derivation.
>>
> What about languages that don't have or don't use personal pronouns. 
> Their speakers might find their use odd or awkward.
>
> The same for many other grammatical concepts: they work reasonably 
> well if used by someone from a related language, or for linguists 
> trained in general concepts, but languages differ so much in what they 
> express explicitly that if any native speaker transcribes the features 
> that are exposed (and not implied) in their native language it may not 
> be what a reader used to a different language is expecting to see.
>

Most of the emoji are heavily dependent on a presumed culture anyway.? 
The smiley-faces maybe could be argued to be cross-cultural (facial 
expressions are the same for all people?well, mostly), though even then 
the styling is cultural.? But a lot of the rest are 
culture-dependent?and that's fine and how it should be, IMO.

That said, I think William Overington's designs are generally opaque and 
incomprehensible.? James Kass says, "Anyone who has ever studied a 
foreign language (or even their own language) would easily and quickly 
recognize the intended meanings of the symbols *once they understand the 
derivation*." (emphasis added).? Well, yeah, once you tell me what 
something means, I know what it means!? The point of emoji is that they 
already make some sort of "obvious" sense?admittedly, to those who are 
in the covered culture.? (You can't say the same would be true of 
pronoun emoji for linguists, because no linguist would ever look at 
those symbols and think, "Oh right!? Pronouns!"? Yes, they'll make sense 
*once explained* and once you're told they're pronouns, but that's not 
the same thing.)

Moreover, they are once again an attempt to shoehorn Overington's pet 
project, "language-independent sentences/words," which are still 
generally deemed out of scope for Unicode.

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180820/d8812721/attachment.html>

From unicode at unicode.org  Mon Aug 20 18:55:27 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 20 Aug 2018 19:55:27 -0400
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z2-WcUTJAZcbtvxr-nY1P8ko453TVz4VcP3AU6Lu4Pb7w@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <CABPY6Z2-WcUTJAZcbtvxr-nY1P8ko453TVz4VcP3AU6Lu4Pb7w@mail.gmail.com>
Message-ID: <0124f1a2-01e5-80c6-35b7-8143b71437da@kli.org>

On 08/20/2018 10:30 AM, James Kass via Unicode wrote:
> As The Universal Character Set, it should be able to support the needs
> of all users.  And with the Private Use Areas, it does.

Here, I agree with you.? This kind of experimentation is exactly what 
the PUA is for, especially for these putative "universal pictographic 
systems" which will need space to hold the whole system, since the 
individual signs won't mean much unless you understand the system (which 
I know I said was an argument against encoding them at all, but that's 
the point of the PUA: see if you can get some traction, if people really 
DO find it useful, etc. Then you can make me eat my words.)? I think 
it's been suggested a few times.

Go forth into the PUA, and make it yours, then!

~mark


From unicode at unicode.org  Mon Aug 20 19:04:34 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 20 Aug 2018 20:04:34 -0400
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
In-Reply-To: <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
Message-ID: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>

On 08/20/2018 03:12 PM, Mark Davis ?? via Unicode wrote:
> > ... some people who would call a PUA solution either batty
> > or crazy.
>
> I don't think it is either batty or crazy. People can certainly use 
> the PUA to interchange text (assuming that they have downloaded fonts 
> and keyboards or some other input method beforehand), and
> it
> ?can definitely serve as a proof of concept
> . Plain symbols ? with no interactions between them (like changing 
> shape with complex scripts), no combining/non-spacing marks, no case 
> mappings, and so on ? are the best possible case for PUA.

It is kind of a bummer, though, that you can't experiment (easily? or at 
all?) in the PUA with scripts that have complex behavior, or even 
not-so-complex behavior like accents & combining marks, or RTL direction 
(here, also, am I speaking true?? Is there a block of RTL PUA also?? I 
guess there's always RLO, but meh.)? Still, maybe it doesn't really 
matter much: your special-purpose font can treat any codepoint any way 
it likes, right?

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180820/06c4f841/attachment.html>

From unicode at unicode.org  Mon Aug 20 19:17:21 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Mon, 20 Aug 2018 17:17:21 -0700
Subject: Private Use areas
In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
Message-ID: <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>


On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> Is there a block of RTL PUA also? 

No.

--Ken

From unicode at unicode.org  Mon Aug 20 19:53:18 2018
From: unicode at unicode.org (via Unicode)
Date: Tue, 21 Aug 2018 08:53:18 +0800
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
Message-ID: <444142b31601a3fbbdbb765e47cbd125@koremail.com>

On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:
> On 08/20/2018 03:12 PM, Mark Davis ?? via Unicode wrote:
> 
>>> ... some people who would call a PUA solution either batty > or
>> crazy.
>> 
>> I don't think it is either batty or crazy. People can certainly use
>> the PUA to interchange text (assuming that they have downloaded
>> fonts and keyboards or some other input method beforehand), and
>> it  can definitely serve as a proof of concept
>> . Plain symbols ? with no interactions between them (like changing
>> shape with complex scripts), no combining/non-spacing marks, no case
>> mappings, and so on ? are the best possible case for PUA.
> 
> It is kind of a bummer, though, that you can't experiment (easily?  or
> at all?) in the PUA with scripts that have complex behavior, or even
> not-so-complex behavior like accents & combining marks, or RTL
> direction (here, also, am I speaking true?  Is there a block of RTL
> PUA also?  I guess there's always RLO, but meh.)  Still, maybe it
> doesn't really matter much: your special-purpose font can treat any
> codepoint any way it likes, right?
> 

Not all properties come from the font. For example a Zhuang character 
PUA font, which supplements CJK ideographs, does not rotate characters 
90 degrees, when change from RTL to vertical display of text.

John Knightley

> ~mark


From unicode at unicode.org  Mon Aug 20 19:53:58 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 16:53:58 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
Message-ID: <CABPY6Z2zsuZ7K1eKc0cq4yz9-yfPRXaK22HFAkyRDMxWeyBymA@mail.gmail.com>

Mark E. Shoulson wrote,

> ... James Kass says, "Anyone who has ever studied a
> foreign language (or even their own language) would
> easily and quickly recognize the intended meanings
> of the symbols once they understand the derivation."
> ... Well, yeah, once you tell me what something
> means, I know what it means!  The point of emoji is
> that they already make some sort of "obvious"
> sense?admittedly, to those who are in the covered
> culture.

To be clear, I do not think William Overington's personal pronoun
symbol designs would make valid emoji candidates.  I'm only talking
about the symbols as abstract symbols.  Blissymbolics, as pointed out
by Leo Broukhis, might be good candidates for "emojification".

Emoji are pictographic.  Abstract symbols are not.


From unicode at unicode.org  Mon Aug 20 20:47:30 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 17:47:30 -0800
Subject: Tales from the Archives
In-Reply-To: <2096117.2386243.1534690888302@mail.yahoo.com>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <2096117.2386243.1534690888302@mail.yahoo.com>
Message-ID: <CABPY6Z1U9UPYAceXJmS00iLzEWZcrDmEgf1V1k11+Z4zNQ3d7g@mail.gmail.com>

Alan Wood wrote,

> Long ago, the response to the question "Why can't I
> see character x" was often to install a copy of the
> Code2000 font and send the fee ($10 ?) to James Kass
> by airmail.

It was always only $5.  (About twenty years ago, Alan was the first to
register it.  I still have the envelope.)

> Some of the frequent posters have probably passed away,
> while others (like me) have got older, and slowed down
> and/or developed new interests.

We don't get any younger, it's true.  Both time and people move on.

Best regards,

James Kass

From unicode at unicode.org  Mon Aug 20 20:57:42 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 17:57:42 -0800
Subject: Tales from the Archives
In-Reply-To: <CAJ2xs_G_g=x_RsCX=9D3f-Bet09x44A+1tnGdRhQ7b+3ySPAfg@mail.gmail.com>
References: <CABPY6Z0yuK9E9WmZecyGO9Ao8V-0f1wx81aq=E1wN+_+fsLMBw@mail.gmail.com>
 <2096117.2386243.1534690888302@mail.yahoo.com>
 <CAJ2xs_G_g=x_RsCX=9D3f-Bet09x44A+1tnGdRhQ7b+3ySPAfg@mail.gmail.com>
Message-ID: <CABPY6Z0yRAM5vWaoFSTSafxSbPfTpu2rfonwe2d6MQVeW=K_ig@mail.gmail.com>

Mark Davis wrote,

> https://docs.google.com/presentation/d/1QAyfwAn_0SZJ1yd0WiQgoJdG7djzDiq2Isb254ymDZc/edit#slide=id.g38b1fcd632_0_166
>

That's an effective presentation.  I especially liked the two Stephen
Colbert clips.

Mark makes a good point here about how specialized Unicode technical
issues have their own special forums now.  I'd really not taken that
into account, and should have.

The public list is geared towards being a forum for developers and
Unicoders getting together to discuss various aspects of the Standard
and its implementation, along with general public users with specific
questions/concerns.  Another feature of this list is that new
character proposals can be vetted here, whether for a single new
character or an entire new script.

The display issues of yore no longer exist.  Technical and specific
aspects of Unicode have special forums.  New character proposals are
usually written by pros who have been through the process many times
before.

It doesn't really leave us much to discuss, does it?

I'll be standing by in case anyone posts a question about how to
display "character-x" on their Windows 98 system.  Sigh.

From unicode at unicode.org  Mon Aug 20 22:20:48 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Mon, 20 Aug 2018 20:20:48 -0700
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z2zsuZ7K1eKc0cq4yz9-yfPRXaK22HFAkyRDMxWeyBymA@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
 <CABPY6Z2zsuZ7K1eKc0cq4yz9-yfPRXaK22HFAkyRDMxWeyBymA@mail.gmail.com>
Message-ID: <CAH=y87ZAna6EH79=p8NLv1j0mpCzbyFOE-Cm4tYbFG+rpDWYLg@mail.gmail.com>

On Mon, Aug 20, 2018 at 5:53 PM, James Kass via Unicode <unicode at unicode.org
> wrote:

> Blissymbolics, as pointed out
> by Leo Broukhis, might be good candidates for "emojification".
>

Why don't we just get Blissymbolics encoded as it is?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180820/d4cd4df8/attachment.html>

From unicode at unicode.org  Mon Aug 20 14:32:20 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 20 Aug 2018 20:32:20 +0100 (BST)
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
Message-ID: <18901449.46712.1534793540040.JavaMail.defaultUser@defaultHost>

Doug Ewell wrote:

> Yes, you run the risk of someone else's PUA implementation colliding with yours. That's why you create a Private Use Agreement, and make sure it's prominently available to people who want to use your solution. It's not like there are hundreds of PUA schemes anyway.

Yes, that is generally true. However, a situation where that does not matter is if one just wishes to include some specially designed glyphs of one's own design in a PDF (Portable Document Format) document and one uses a Private Use Area encoding simply so that the PDF document with a subset of the glyphs of the font embedded in the PDF can be produced using a desktop publishing program. That is, one makes the font, one installs the font, one uses the font within the desktop publishing package.

I have used that technique and the technique worked very well as the Windows operating system treated my font the same way as it did other fonts. With the desktop publishing package that I am using (Serif PagePlus version X7) that is only using the plane zero Private Use Area.

Thus the providing of information to anyone reading the PDF document is as displayed glyphs rather than as code points.

The availability of the Private Use Area allowed me to make such code point assignments for the glyphs that I had designed and then use those code points in a manner entirely compatible with The Unicode Standard.

William Overington

Monday 20 August 2018


From unicode at unicode.org  Tue Aug 21 01:50:39 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 20 Aug 2018 22:50:39 -0800
Subject: Unicode 11 Georgian uppercase vs. fonts
In-Reply-To: <CABPY6Z0g4AriT4hzsjJCuKBFY2txc6uLdG9TwcfQ+Pyb6Ky2qg@mail.gmail.com>
References: <CAN49p6pEa3i_dEMFrLVeEyqHchv_cnbfDgwnXh4SKYivHcDy-Q@mail.gmail.com>
 <DM5PR2101MB0982FBBFCBDE08DAEBBC1D09D5510@DM5PR2101MB0982.namprd21.prod.outlook.com>
 <CAGV-Gh80LXHDkZrO-zJ-8EZrHtZnvt+g0xL9p1z_wgQ+KkpyDQ@mail.gmail.com>
 <20180726204652.39387370@JRWUBU2>
 <CAGV-Gh9qF=vZNojpDpaHqLn+6CNp7XxUaceYcu6ThACoV+Ph4A@mail.gmail.com>
 <CABPY6Z15DGs+MWKM4=w6sdwS43LAA_TrgFwVQHynFh5=Z2Lduw@mail.gmail.com>
 <D93721CC-80FF-4EAF-A4F2-5D73EAE84F20@evertype.com>
 <CAGV-Gh_=juOf9c7ZhDph4Sja4d+jBJ7sJUAAiDNOxAeEyN=1BA@mail.gmail.com>
 <CABPY6Z39CpD__r0eyt9fic_zmkkuujRUbi_41AZHrmkgAtOnmw@mail.gmail.com>
 <CAGV-Gh8eX7UUHZkJ0283Esgt26Kajm6yrJUV+gDiNZbCW3geQg@mail.gmail.com>
 <A7A25760-6716-46B8-B918-D1B8CF6AD1C4@evertype.com>
 <CABPY6Z12=dWcGQKxpRO2JXsY5QmeU9wbGWc5ufSs+fPJkjKuzQ@mail.gmail.com>
 <3C168136-067E-4FE3-AB25-F8CED964A035@evertype.com>
 <CABPY6Z0g4AriT4hzsjJCuKBFY2txc6uLdG9TwcfQ+Pyb6Ky2qg@mail.gmail.com>
Message-ID: <CABPY6Z0Smv8Zm4SNcmQCGAY9XCqJO7SMxW4YN2v-QxAHEvqHog@mail.gmail.com>

(from 2018-07-27)

> Michael Everson responded,
>
>>> If members of the Georgian user community want to consider this a stylistic difference, they are free to do so.
>>
>> It isn?t a stylistic difference. It is a different use of capital letters than Latin, Cyrillic and other scripts use them.

suppose that english was written with a bicameral script, but english
users only used the upper case letters for emphasis.  in other words,
personal names (like bela lugosi), place names (like bechuanaland),
and book titles (like "the bridge over the river kwai") would always
be in lower case.  if someone needed to emphasize something by
SHOUTING, they would use all-caps to make this stylistic distinction.
if english users called upper case "harcourt" and lower case "fenton",
there would be no earthly reason for them to consider switching from
fenton to harcourt to be anything other than a stylistic difference.

along comes a consortium with script experts and computer encoding
experts who rightfully determine that the difference between harcourt
and fenton is actually a casing difference, even though the english
writing system does not actually use casing in a manner consistent
with other bicameral scripts.  so the consortium, tasked with breaking
down elements of text for computer entry, exchange, and storage,
encodes the english script as a casing script.

would that action by the consortium alter my perception (as a typical
member of the english user community) that the difference between
harcourt and fenton is simply stylistic?  HECK, no!

the same applies to georgian.  or any script.  whatever the consortium
does for computer text processing purposes should NEVER be interpreted
as an effort to make the users change their perceptions of their OWN
writing systems.  we've been through this kind of thing before, with
tamil as a notorious example.

best regards,

james kass


From unicode at unicode.org  Tue Aug 21 03:01:46 2018
From: unicode at unicode.org (Julian Bradfield via Unicode)
Date: Tue, 21 Aug 2018 09:01:46 +0100 (BST)
Subject: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
Message-ID: <slrnpnnhna.s8a.jcb@home.stevens-bradfield.com>

On 2018-08-20, Mark E. Shoulson via Unicode <unicode at unicode.org> wrote:
> Moreover, they [William's pronoun symbols] are once again an attempt to shoehorn Overington's pet 
> project, "language-independent sentences/words," which are still 
> generally deemed out of scope for Unicode.

I find it increasingly hard to understand why William's project is out
of scope (apart from the "demonstrate use first, then encode"
principle, which is in any case not applied to emoji), when emoji are
language-independent words - or even sentences: the GROWING HEART
emoji is (I presume) supposed to be a language-independent way of
saying "I love you more every day". Which seems rather more
fatuous as a thing to put in a writing-systems standard than the
things I think William would want.

Not that I want to hear any more about William's unmentionables; I
just wish emoji were equally unmentionable.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From unicode at unicode.org  Tue Aug 21 05:01:56 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 21 Aug 2018 11:01:56 +0100
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
In-Reply-To: <444142b31601a3fbbdbb765e47cbd125@koremail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
Message-ID: <20180821110156.453c129a@JRWUBU2>

On Tue, 21 Aug 2018 08:53:18 +0800
via Unicode <unicode at unicode.org> wrote:

> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

> > Still, maybe it
> > doesn't really matter much: your special-purpose font can treat any
> > codepoint any way it likes, right?

> Not all properties come from the font. For example a Zhuang character 
> PUA font, which supplements CJK ideographs, does not rotate
> characters 90 degrees, when change from RTL to vertical display of
> text.

Isn't that supposed to be treated by an OpenType feature such as
'vert'?  Or does the rendering stack get in the way?

However, one might need reflowing text to be about 40% WJ.

Richard.

From unicode at unicode.org  Tue Aug 21 05:08:16 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 21 Aug 2018 03:08:16 -0700
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <slrnpnnhna.s8a.jcb@home.stevens-bradfield.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
 <slrnpnnhna.s8a.jcb@home.stevens-bradfield.com>
Message-ID: <1317044b-7de2-f4b2-9baf-18ccc85a475e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180821/e4a207ee/attachment.html>

From unicode at unicode.org  Tue Aug 21 08:17:21 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Tue, 21 Aug 2018 05:17:21 -0800
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CAH=y87ZAna6EH79=p8NLv1j0mpCzbyFOE-Cm4tYbFG+rpDWYLg@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
 <CABPY6Z2zsuZ7K1eKc0cq4yz9-yfPRXaK22HFAkyRDMxWeyBymA@mail.gmail.com>
 <CAH=y87ZAna6EH79=p8NLv1j0mpCzbyFOE-Cm4tYbFG+rpDWYLg@mail.gmail.com>
Message-ID: <CABPY6Z0iN-fj2-+T4eu40-8MAmefn7TUh+OtGtVkwTNgMbKshA@mail.gmail.com>

Rebecca Bettencourt wrote,

> Why don't we just get Blissymbolics encoded as it is?

The Pipeline still has the Everson proposal from 1998, but Blissymbols
are still in the Pipeline.

Scripts Encoding Initiative
( http://linguistics.berkeley.edu/sei/ )
 page,
http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
shows Blissymbols and links the same proposal.

Blissymbolics Communication International,
http://www.blissymbolics.org/
will likely produce the next proposal.

Both Scripts Encoding Initiative and Blissymbolics Communication
International depend upon funding.

From unicode at unicode.org  Tue Aug 21 09:56:51 2018
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Tue, 21 Aug 2018 16:56:51 +0200
Subject: Private Use areas
In-Reply-To: <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
Message-ID: <20180821145651.75orx5kfrtlzhfel@angband.pl>

On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
> 
> No.

Perhaps there should be?

What about designating a part of the PUA to have a specific property?  Only
certain properties matter enough:
* wide
* RTL
* combining
as most others are better represented in the font itself.

This could be done either by parceling one of existing PUA ranges: planes 15
and 16 are virtually unused thus any damage would be negligible; or perhaps
by allocating a new range elsewhere.


Meow!
-- 
??????? What Would Jesus Do, MUD/MMORPG edition:
??????? ? multiplay with an admin char to benefit your mortal [Mt3:16-17]
??????? ? abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
??????? ? use glitches to walk on water [Mt14:25-26]

From unicode at unicode.org  Tue Aug 21 12:21:43 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Tue, 21 Aug 2018 19:21:43 +0200
Subject: Private Use areas
In-Reply-To: <20180821145651.75orx5kfrtlzhfel@angband.pl> (Adam Borowski via
 Unicode's message of "Tue, 21 Aug 2018 16:56:51 +0200")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
Message-ID: <86h8jnab4o.fsf@mimuw.edu.pl>

On Tue, Aug 21 2018 at 16:56 +0200, unicode at unicode.org writes:
> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>> > Is there a block of RTL PUA also?
>> 
>> No.
>
> Perhaps there should be?
>
> What about designating a part of the PUA to have a specific property?  Only
> certain properties matter enough:
> * wide
> * RTL
> * combining
> as most others are better represented in the font itself.
>
> This could be done either by parceling one of existing PUA ranges: planes 15
> and 16 are virtually unused thus any damage would be negligible; or perhaps
> by allocating a new range elsewhere.

I don't think it's a good idea. I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Tue Aug 21 12:36:04 2018
From: unicode at unicode.org (Steven R. Loomis via Unicode)
Date: Tue, 21 Aug 2018 10:36:04 -0700
Subject: Private Use areas
In-Reply-To: <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
Message-ID: <CAFYQx+BMs2jG9uoLHv=OMU=VSh0E65XV+P8rQBS_Hb7EK+=0gA@mail.gmail.com>

2011 Thread:
https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0124.html

Please read in particular these two:

- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0174.html
- https://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0212.html

(tl;dr: 1. the PUA set is fixed, 2. being private, the properties may be
overridable by conformant implementations.)


On Mon, Aug 20, 2018 at 5:17 PM Ken Whistler via Unicode <
unicode at unicode.org> wrote:

>
>
> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > Is there a block of RTL PUA also?
>
> No.
>
> --Ken
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180821/16d476df/attachment.html>

From unicode at unicode.org  Tue Aug 21 13:03:41 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Tue, 21 Aug 2018 11:03:41 -0700
Subject: Private Use areas
In-Reply-To: <20180821145651.75orx5kfrtlzhfel@angband.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
Message-ID: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>


On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:
> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>>> Is there a block of RTL PUA also?
>> No.
> Perhaps there should be?

This is a periodic suggestion that never goes anywhere--for good reason. 
(You can search the email archives and see that it keeps coming up.)

Presuming that this question was asked in good faith...

>
> What about designating a part of the PUA to have a specific property?

The problem with that is that assigning *any* non-default property to 
any PUA code point would break existing implementations' assumptions 
about PUA character properties and potentially create havoc with 
existing use.

> Only certain properties matter enough:

That is an un-demonstrated assertion that I don't think you have thought 
through sufficiently.

> * wide
> * RTL

RTL is not some binary counterpart of LTR. There are 23 values of 
Bidi_Class, and anyone who wanted to implement a right-to-left script in 
PUA might well have to make use of multiple values of Bidi_Class. Also, 
there are two major types of strong right-to-leftness: Bidi_Class=R and 
Bidi_Class=AL. Should a "RTL PUA" zone favor Arabic type behavior or 
non-Arabic type behavior?

> * combining

Also not a binary switch. Canonical_Combining_Class is a numeric value, 
and any value but ccc=0 for a PUA character would break normalization. 
Then for the General_Category, there are three types of "marks" that 
count as combining: gc=Mn, gc=Mc, gc=Me. Which of those would be favored 
in any PUA assignment?

> as most others are better represented in the font itself.

Really? Suppose someone wants to implement a bicameral script in PUA. 
They would need case mappings for that, and how would those be "better 
represented in the font itself"? Or how about digits? Would numeric 
values for digits be "better represented in the font itself"? How about 
implementation of punctuation? Would segmentation properties and 
behavior be "better represented in the font itself"?

>
> This could be done either by parceling one of existing PUA ranges: planes 15
> and 16 are virtually unused thus any damage would be negligible;

That is simply an assertion -- and not the kind of assertion that the 
UTC tends to accept on spec. I rather suspect that there are multiple 
participants on this email list, for example, who *do* have 
implementations making extensive use of Planes 15/16 PUA code points for 
one thing or another.

>   or perhaps
> by allocating a new range elsewhere.
See:

https://www.unicode.org/policies/stability_policy.html

The General_Category property value Private_Use (Co) is immutable: the 
set of code points with that value will never change.

That guarantee has been in place since 1996, and is a rule that binds 
the UTC. So nope, sorry, no more PUA ranges.
> Meow!

Grrr! ;-)

As I see it, the only feasible way for people to get specialized 
behavior for PUA ranges involves first ceasing to assume that somehow 
they can jawbone the UTC into *standardizing* some ranges for some 
particular use or another. That simply isn't going to happen. People who 
assume this is somehow easy, and that the UTC are a bunch of boneheads 
who stand in the way of obvious solutions, do not -- I contend -- 
understand the complicated interplay of character properties, stability 
guarantees, and implementation behavior baked into system support 
libraries for the Unicode Standard.

The way forward for folks who want to do this kind thing is:

1. Define a *protocol* for reliable interchange of custom character 
property information about PUA code points.

2. Convince more than one party to actually *use* that protocol to 
define sets of interchangeable character property definitions.

3. Convince at least one implementer to support that protocol to create 
some relevant interchangeable *behavior* for those PUA characters.

And if the goal for #3 is to get some *system* implementer to support 
the protocol in widespread software, then before starting any of #1, #2, 
or #3, you had better start instead with:

0. Create a consortium (or other ongoing organization) with a 10-year 
time horizon and participation by at least one major software 
implementer, to define, publicize, and advocate for support of the 
protocol. (And if you expect a major software implementer to 
participate, you might need to make sure you have a business case 
defined that would warrant such a 10-year effort!)

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180821/98582cf9/attachment.html>

From unicode at unicode.org  Tue Aug 21 13:23:48 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Tue, 21 Aug 2018 11:23:48 -0700
Subject: Private Use areas
In-Reply-To: <86h8jnab4o.fsf@mimuw.edu.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl>
Message-ID: <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>

On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode <
unicode at unicode.org> wrote:

> I think PUA users should provide the
> properties of the characters used in a form analogical to the Unicode
> itself, and the software should be able to use this additional
> information.
>

I already provide this myself for my uses of the PUA as well as the CSUR
and any vendor-specific agreements I can find:

http://www.kreativekorp.com/charset/PUADATA/

Of course there is no way to get software to use this information. I have
entertained the idea of being able to embed this information into the font
itself as OpenType tables, e.g.:

PUAB -> Blocks.txt
PUAC -> CaseFolding.txt
PUAW -> EastAsianWidth.txt
PUAL -> LineBreak.txt
PUAD -> UnicodeData.txt

I've actually invented table names for the majority of UCD files, but those
are probably the most relevant. The table names for the more obscure files
get rather... creative, e.g.:

PUA[ -> BidiBrackets.txt
PUA] -> BidiMirroring.txt

That alone may get some people to think twice about this idea. :P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180821/e1975667/attachment.html>

From unicode at unicode.org  Tue Aug 21 15:08:49 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Tue, 21 Aug 2018 21:08:49 +0100
Subject: Private Use areas
In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
Message-ID: <20180821210849.56aef231@JRWUBU2>

On Tue, 21 Aug 2018 11:03:41 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:

> Really? Suppose someone wants to implement a bicameral script in PUA. 
> They would need case mappings for that, and how would those be
> "better represented in the font itself"? Or how about digits? Would
> numeric values for digits be "better represented in the font itself"?
> How about implementation of punctuation? Would segmentation
> properties and behavior be "better represented in the font itself"?

The least intrusive way of defining the meaning of a graphic (sensu
lato) character is by a font, in a very wide sense that would interpret
a Unicode code chart as a font.  Without a font in this sense, normal
characters in the PUA have no meaning.  If one insists on a font to
have an interpretation, then:

(1) PUA characters in plain text are meaningless - I believe that's
pretty much the position now.

(2) Different schemes can co-exist, even within the same formatted
document, by having different formats.  This is the case now.  It then
makes sense to store the properties in the font, which needs to be
saved with or in the document for the document to continue to make
sense. 

Casing and digits are luxuries.  Are we not told that searching should
be done by collation?  We then do not need case-folding!  Interpreting
the preferred representation of Roman numerals does not use Unicode
properties beyond the approximate principle of one character, one
codepoint. 

As to segmentation, my understanding was that there were no characters
available to indicate word boundaries in scriptio continua; the closest
one has is line-breaking suggestions.  If my memory serves me right,
SIL Graphite fonts can hold line-breaking information.

Richard.

From unicode at unicode.org  Tue Aug 21 15:15:35 2018
From: unicode at unicode.org (Adam Borowski via Unicode)
Date: Tue, 21 Aug 2018 22:15:35 +0200
Subject: Private Use areas
In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
Message-ID: <20180821201535.mfgzkrszsqweps23@angband.pl>

On Tue, Aug 21, 2018 at 11:03:41AM -0700, Ken Whistler via Unicode wrote:
> 
> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:
> > On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
> > > On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
> > > > Is there a block of RTL PUA also?
> > > No.
> > Perhaps there should be?
> 
> This is a periodic suggestion that never goes anywhere--for good reason.
> (You can search the email archives and see that it keeps coming up.)
> 
> Presuming that this question was asked in good faith...

Oif, looks like mere months of inattentive lurking are not enough (the
thread I got pointed to was from 2011).  Apologies.

> > or perhaps by allocating a new range elsewhere.
> See:
> 
> https://www.unicode.org/policies/stability_policy.html
> 
> The General_Category property value Private_Use (Co) is immutable: the set
> of code points with that value will never change.
> 
> That guarantee has been in place since 1996, and is a rule that binds the
> UTC. So nope, sorry, no more PUA ranges.

Right.

> The way forward for folks who want to do this kind thing is:
> 
> 1. Define a *protocol* for reliable interchange of custom character property
> information about PUA code points.
[...]
> And if the goal for #3 is to get some *system* implementer to support the
> protocol in widespread software, then before starting any of #1, #2, or #3,
> you had better start instead with:
> 
> 0. Create a consortium (or other ongoing organization) with a 10-year time
> horizon and participation by at least one major software implementer, to
> define, publicize, and advocate for support of the protocol.

Heh, good point.  I wonder, perhaps a long-lived consortium tasked with
assigning properties to characters already exists?

So your answer _does_ provide a way to go: any PUA use that's no longer
private, or any problem someone has with character properties, should go
through official channels here instead of inventing an own standard.

With my existing hats on (Debian fonts team member, and someone who messes
with terminals in general) I already have two such itches to scratch.
Thus, it sounds like I should do the research, prepare a write-up, and then
come back to harass you folks with inane questions.  Inventing new solutions
that work around instead of with you is a bad idea...


Meow!
-- 
?????????????????????

From unicode at unicode.org  Tue Aug 21 16:59:19 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 21 Aug 2018 14:59:19 -0700
Subject: Private Use areas
Message-ID: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com>

Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 
  
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Aug 21 17:23:00 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Tue, 21 Aug 2018 15:23:00 -0700
Subject: Private Use areas
In-Reply-To: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com>
References: <20180821145919.665a7a7059d7ee80bb4d670165c8327d.5eca04c37c.wbe@email03.godaddy.com>
Message-ID: <CAH=y87YasmTWgJvGO9ki4ukts=6puFBp7BofA9p532o2F0fkPw@mail.gmail.com>

On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode <unicode at unicode.org>
wrote:

> Ken Whistler wrote:
>
> > The way forward for folks who want to do this kind thing is:
> >
> > 1. Define a *protocol* for reliable interchange of custom character
> > property information about PUA code points.
>
> I've often thought that would be a great idea. You can't get to steps 2
> and 3 without step 1. I'd gladly participate in such a project.
>

As would I.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180821/8fdde8a7/attachment.html>

From unicode at unicode.org  Tue Aug 21 18:45:10 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Tue, 21 Aug 2018 19:45:10 -0400
Subject: Private Use areas
In-Reply-To: <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
Message-ID: <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>

On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>
>
> On 8/21/2018 7:56 AM, Adam Borowski via Unicode wrote:
>> On Mon, Aug 20, 2018 at 05:17:21PM -0700, Ken Whistler via Unicode wrote:
>>> On 8/20/2018 5:04 PM, Mark E. Shoulson via Unicode wrote:
>>>> Is there a block of RTL PUA also?
>>> No.
>> Perhaps there should be?
>
> This is a periodic suggestion that never goes anywhere--for good 
> reason. (You can search the email archives and see that it keeps 
> coming up.)
>
> Presuming that this question was asked in good faith...

Yeah, I know there has been talk about such things, and I also knew that 
whether or not there was an RTL block (which I did not remember for 
certain), there weren't going to be any *changes* in the PUA, and we 
were going to have to make do with what there was.? There's no way to 
anticipate all the possible properties people would want in the PUA, 
though I remember thinking it was probably wrong to make the PUA 
*strongly* LTR; I know there's a not-strongly flavor too.

Best we can do is shout loudly at OpenType tables and hope to cram in 
behavior (or at least appearance, which is more likely all we can get) 
that vaguely resembles what we're after.? And that's not SO awful, given 
what we're dealing with.

>
> As I see it, the only feasible way for people to get specialized 
> behavior for PUA ranges involves first ceasing to assume that somehow 
> they can jawbone the UTC into *standardizing* some ranges for some 
> particular use or another. That simply isn't going to happen. People 
> who assume this is somehow easy, and that the UTC are a bunch of 
> boneheads who stand in the way of obvious solutions, do not -- I 
> contend -- understand the complicated interplay of character 
> properties, stability guarantees, and implementation behavior baked 
> into system support libraries for the Unicode Standard.

The whole point of the PUA is that it *isn't* standardized (by the 
UTC).? It might have been nice to make some more varied choices of 
things that couldn't be left unspecified, but you're still going to wind 
up with "but there aren't any PUA codepoints that are JUST what I 
need!"? And, as said, it's too late now.

~mark

From unicode at unicode.org  Tue Aug 21 21:50:18 2018
From: unicode at unicode.org (Andrew Cunningham via Unicode)
Date: Wed, 22 Aug 2018 12:50:18 +1000
Subject: Private Use areas
In-Reply-To: <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
Message-ID: <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>

On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
unicode at unicode.org> wrote:

> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>
>>
>>
> Best we can do is shout loudly at OpenType tables and hope to cram in
> behavior (or at least appearance, which is more likely all we can get) that
> vaguely resembles what we're after.  And that's not SO awful, given what
> we're dealing with.
>
>>
>>
At the moment I am looking at implementing three unencoded Arabic
characters in  the PUA.

For the foreseeable future OpenType is a non-starter, so I will look at
implementing them in Graphite tables in a font.

Andrew


-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180822/9cbacf67/attachment.html>

From unicode at unicode.org  Wed Aug 22 04:58:58 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Wed, 22 Aug 2018 11:58:58 +0200
Subject: Private Use areas
In-Reply-To: <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
Message-ID: <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>

May be this debate could find an end if there was a way to encode "private
use variants", so that we can override an existing character with correct
properties by creating a custom variant, which would immediately inherit
the properties of the base character on which it is encoded.

But for now there's no private use variant codes (PUV). I think that a
small block of 16 codes (may be even less) would be largely enough (given
that it would be used only in pairs after any standard character). They
could be used after any base character, possibly even after a combining
character (so the default combining class for these PUV should be 0).

For now there's still no way to have variant sequences unless they are
registered and standardized by Unicode but registration should be not
needed (forbidden) for sequences containing PUV.

I think there's a usage pattern for such schemes. Their default (spacing)
glyph could be a dotted circle with a single hex digit inside, it would be
itself non-joining, it would be itself bidi-neutral and used only after a
base character from which it would inherit the directionality (so the glyph
would appear automatically on the correct side). Actual fonts implementing
these PUV sequences would treat the PUV sequences as distinct unbreakable
entities  mapped to their own abstract character, and subject to common
ligation.


Le mer. 22 ao?t 2018 ? 04:58, Andrew Cunningham via Unicode <
unicode at unicode.org> a ?crit :

>
>
> On Wednesday, 22 August 2018, Mark E. Shoulson via Unicode <
> unicode at unicode.org> wrote:
>
>> On 08/21/2018 02:03 PM, Ken Whistler via Unicode wrote:
>>
>>>
>>>
>> Best we can do is shout loudly at OpenType tables and hope to cram in
>> behavior (or at least appearance, which is more likely all we can get) that
>> vaguely resembles what we're after.  And that's not SO awful, given what
>> we're dealing with.
>>
>>>
>>>
> At the moment I am looking at implementing three unencoded Arabic
> characters in  the PUA.
>
> For the foreseeable future OpenType is a non-starter, so I will look at
> implementing them in Graphite tables in a font.
>
> Andrew
>
>
>
> --
> Andrew Cunningham
> lang.support at gmail.com
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180822/290ef784/attachment.html>

From unicode at unicode.org  Thu Aug 23 04:31:34 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 23 Aug 2018 10:31:34 +0100
Subject: Private Use areas
In-Reply-To: <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
Message-ID: <20180823103134.00645f90@JRWUBU2>

On Wed, 22 Aug 2018 11:58:58 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> For now there's still no way to have variant sequences unless they are
> registered and standardized by Unicode but registration should be not
> needed (forbidden) for sequences containing PUV.

I believe this scheme is no worse than hack encodings that using Latin
character codes for other characters.  These schemes often work.
(Indeed, the currently best method of getting Tai Tham displayed as rich
text that I can find is to use a transliteration-type encoding and a
special font, though I can now get pretty close using the proper
character codes in the order laid down in the proposals.)

The major problems I can see with appropriating variation sequences
are:
(1) It might be restricted to base characters - I have no
experimental evidence on whether this would happen.  Fonts can happily
convert base characters to combining characters, though this works
best if Latin line-breaking rules take effect.

(2) The appropriated variation sequence might be assigned a meaning -
but this is no worse than the general ambiguity of PUA characters.

(3) Some base characters get special treatment.  For example, I had
to change my transliteration scheme because hyphen-minus is treated
specially by MS Edge - I was using it as a digraph disjunctor - and
so clusters were not being formed.  In this case, I would have come
unstuck as soon as line-wrapping started, so it was a bad choice anyway.

Or are there significant renderers that deliberately ignore variation
selectors in unregistered, unstandardised variation sequences?  I don't
recall any problems from when we were discussing variation
sequences for chess pieces.

For supplementing a script, it might be best to start at
VARIATION-SELECTOR-256, and work down if need be with specialist
characters.

Richard.

From unicode at unicode.org  Thu Aug 23 05:28:00 2018
From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode)
Date: Thu, 23 Aug 2018 12:28:00 +0200
Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
Message-ID: <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/5134652d/attachment.html>

From unicode at unicode.org  Thu Aug 23 05:48:52 2018
From: unicode at unicode.org (Asmus Freytag (c) via Unicode)
Date: Thu, 23 Aug 2018 03:48:52 -0700
Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>
Message-ID: <e9e24b0f-d4db-4a59-9327-017e763f1588@ix.netcom.com>

On 8/23/2018 3:28 AM, "J?rg Knappen" wrote:
> Asmus,
> I know your style of humor, but to keep it straight:
> All known human languages, even Piraha, have pronouns for "I" and "you".

And languages like Japanese, tend to use them - mostly not.

Even if the concepts are known, and can be named, there are deep 
differences across languages concerning the need? or conventions for 
demarcating them with words in any given context.

Replacing words by symbols is not going to fix this - the only way to 
get a 'universal' system of symbolic expression is to invent a new 
language, with its own conventions for use of these symbols in any given 
context.

A./

> --J?rg Knappen
> *Gesendet:*?Montag, 20. August 2018 um 16:20 Uhr
> *Von:*?"Asmus Freytag via Unicode" <unicode at unicode.org>
> *An:*?unicode at unicode.org
> *Betreff:*?Re: Thoughts on working with the Emoji Subcommittee (was 
> Re: Thoughts on Emoji Selection Process)
>
> What about languages that don't have or don't use personal pronouns. 
> Their speakers might find their use odd or awkward.
>
> The same for many other grammatical concepts: they work reasonably 
> well if used by someone from a related language, or for linguists 
> trained in general concepts, but languages differ so much in what they 
> express explicitly that if any native speaker transcribes the features 
> that are exposed (and not implied) in their native language it may not 
> be what a reader used to a different language is expecting to see.
>
> A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/c9d5d730/attachment.html>

From unicode at unicode.org  Thu Aug 23 07:10:35 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Thu, 23 Aug 2018 14:10:35 +0200
Subject: Private Use areas
In-Reply-To: <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 (Rebecca Bettencourt via Unicode's message of "Tue, 21 Aug 2018
 11:23:48 -0700")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
Message-ID: <86ftz5cmh0.fsf@mimuw.edu.pl>

On Tue, Aug 21 2018 at 11:23 -0700, unicode at unicode.org writes:
> On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode <unicode at unicode.org> wrote:
>
>  I think PUA users should provide the
>  properties of the characters used in a form analogical to the Unicode
>  itself, and the software should be able to use this additional
>  information.
>
> I already provide this myself for my uses of the PUA as well as the
> CSUR and any vendor-specific agreements I can find:
>
> http://www.kreativekorp.com/charset/PUADATA/

I would prefer to see the data in a repository, so others can can
comment and contribute.

As for "any vendor-specific agreements", do MUFI and LINCUA qualify?

https://folk.uib.no/hnooh/mufi/
http://andron-typeforum.xobor.de/t10f13-Towards-a-linguistic-corporate-use-area-LINCUA.html

>
> Of course there is no way to get software to use this information.

What kind of software do you have in mind?

I'm primarily interested in the locally developed programs

https://bitbucket.org/jsbien/unihistext/

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/

and in Emacs - to my disappointed it looks like the Unicode data are set
at the compile time, but perhaps this can be negotiated with the
developers.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Thu Aug 23 10:39:15 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 23 Aug 2018 17:39:15 +0200
Subject: Private Use areas
In-Reply-To: <20180823103134.00645f90@JRWUBU2>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
 <20180823103134.00645f90@JRWUBU2>
Message-ID: <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>

You make a confusion: I do not propose "hacking" existing codes, but
instead adding new codes for private variations. It's then up to PUV
sequence authors to choose an appropropriate base character that can have
the properties they want to be inherited by the private-use variation
sequence, or to choose a base character that will provide some reasonnable
reading if rendererd as is (by renderers or fonts not implementing the
pricate viaration sequence, give nthat they will also append a symbol for
the PUV itself after the standard character).

Also I do not want to change anything to any existing variation sequences
(using VS1 and so on) and their encoding policies, requiring a prior
registration and standardisation.

Le jeu. 23 ao?t 2018 ? 11:42, Richard Wordingham via Unicode <
unicode at unicode.org> a ?crit :

> On Wed, 22 Aug 2018 11:58:58 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
> > For now there's still no way to have variant sequences unless they are
> > registered and standardized by Unicode but registration should be not
> > needed (forbidden) for sequences containing PUV.
>
> I believe this scheme is no worse than hack encodings that using Latin
> character codes for other characters.  These schemes often work.
> (Indeed, the currently best method of getting Tai Tham displayed as rich
> text that I can find is to use a transliteration-type encoding and a
> special font, though I can now get pretty close using the proper
> character codes in the order laid down in the proposals.)
>
> The major problems I can see with appropriating variation sequences
> are:
> (1) It might be restricted to base characters - I have no
> experimental evidence on whether this would happen.  Fonts can happily
> convert base characters to combining characters, though this works
> best if Latin line-breaking rules take effect.
>
> (2) The appropriated variation sequence might be assigned a meaning -
> but this is no worse than the general ambiguity of PUA characters.
>
> (3) Some base characters get special treatment.  For example, I had
> to change my transliteration scheme because hyphen-minus is treated
> specially by MS Edge - I was using it as a digraph disjunctor - and
> so clusters were not being formed.  In this case, I would have come
> unstuck as soon as line-wrapping started, so it was a bad choice anyway.
>
> Or are there significant renderers that deliberately ignore variation
> selectors in unregistered, unstandardised variation sequences?  I don't
> recall any problems from when we were discussing variation
> sequences for chess pieces.
>
> For supplementing a script, it might be best to start at
> VARIATION-SELECTOR-256, and work down if need be with specialist
> characters.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/b0feea5f/attachment.html>

From unicode at unicode.org  Thu Aug 23 11:11:05 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 23 Aug 2018 17:11:05 +0100
Subject: Private Use areas
In-Reply-To: <86ftz5cmh0.fsf@mimuw.edu.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl>
Message-ID: <20180823171105.058ac317@JRWUBU2>

On Thu, 23 Aug 2018 14:10:35 +0200
"Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:

> What kind of software do you have in mind?
> 
> I'm primarily interested in the locally developed programs
> 
> https://bitbucket.org/jsbien/unihistext/
> 
> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/

It looks as though the security certificates are awry - has someone
forgotten to pay the protection money to the right people?  (Firefox
objects with "The page you are trying to view cannot be shown because
the authenticity of the received data could not be verified.")

> and in Emacs - to my disappointed it looks like the Unicode data are
> set at the compile time, but perhaps this can be negotiated with the
> developers.

Can you be more specific?  For Indic rearrangement I had to define
syllables myself with definitions which I then added to
composition-function-table.  Unfortunately, I then hit the problem
that I had to define Indic rearrangement myself, and OpenType fonts
fall into several incompatible families, which is why I haven't
released a general solution.  My emacs kit for Tai Tham is given via
http://www.wrdingham.co.uk/lanna/toolkit.html (a probable kinsman got
the 'o'), but there are a lot of odds and ends that need sorting out.

I would expect that you would be able to override any relevant
'compiler' settings via your Emacs start up file - I expect Eli
Zaretski will be along soon with more details.  Of course, you could
always revert to the old tradition and recompile Emacs yourself -
though it may need something like MinGW to compile for Windows.

Richard.


From unicode at unicode.org  Thu Aug 23 11:26:42 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 23 Aug 2018 17:26:42 +0100
Subject: Private Use areas
In-Reply-To: <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
 <20180823103134.00645f90@JRWUBU2>
 <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>
Message-ID: <20180823172642.55f167a6@JRWUBU2>

On Thu, 23 Aug 2018 17:39:15 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> You make a confusion: I do not propose "hacking" existing codes, but
> instead adding new codes for private variations. It's then up to PUV
> sequence authors to choose an appropropriate base character that can
> have the properties they want to be inherited by the private-use
> variation sequence, or to choose a base character that will provide
> some reasonnable reading if rendererd as is (by renderers or fonts
> not implementing the pricate viaration sequence, give nthat they will
> also append a symbol for the PUV itself after the standard character).

Variation sequences cannot be used to add new characters.  Most PUA
characters are used to represent new characters.  A
standard-conformant private variation sequence would generally achieve
the same effect as could be achieved by a font feature (typically one
of the cvxx, though possibly one of the ssxx), though using font
features would be fiddlier and have more limited support, and variation
sequences would facilitate data processing.

Richard.

From unicode at unicode.org  Thu Aug 23 11:46:50 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 23 Aug 2018 18:46:50 +0200
Subject: Private Use areas
In-Reply-To: <20180823172642.55f167a6@JRWUBU2>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
 <20180823103134.00645f90@JRWUBU2>
 <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>
 <20180823172642.55f167a6@JRWUBU2>
Message-ID: <CAGa7JC0ekyDj=bDRrfb0Qq7XY3geNJ73-x+P+tydYjxxJdWDYQ@mail.gmail.com>

Le jeu. 23 ao?t 2018 ? 18:31, Richard Wordingham via Unicode <
unicode at unicode.org> a ?crit :

> On Thu, 23 Aug 2018 17:39:15 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
> > You make a confusion: I do not propose "hacking" existing codes, but
> > instead adding new codes for private variations. It's then up to PUV
> > sequence authors to choose an appropropriate base character that can
> > have the properties they want to be inherited by the private-use
> > variation sequence, or to choose a base character that will provide
> > some reasonnable reading if rendererd as is (by renderers or fonts
> > not implementing the pricate viaration sequence, give nthat they will
> > also append a symbol for the PUV itself after the standard character).
>
> Variation sequences cannot be used to add new characters.


Did you remember I did not speak about existing variation sequences ? Only
about the new encocing do provite use variation sequences which do not have
to obey the policy of exising VS, and whose purpose whould be to inherit
most properties (notably direction, breaking, spacing, general category of
another existing character).


> Most PUA
> characters are used to represent new characters.


I did not speak as well about PUAs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/3143826d/attachment.html>

From unicode at unicode.org  Thu Aug 23 13:30:52 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Thu, 23 Aug 2018 20:30:52 +0200
Subject: Private Use areas
In-Reply-To: <20180823171105.058ac317@JRWUBU2> (Richard Wordingham via
 Unicode's message of "Thu, 23 Aug 2018 17:11:05 +0100")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
Message-ID: <86lg8x9bqb.fsf@mimuw.edu.pl>

On Thu, Aug 23 2018 at 17:11 +0100, unicode at unicode.org writes:
> On Thu, 23 Aug 2018 14:10:35 +0200
> "Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:
>
>> What kind of software do you have in mind?
>> 
>> I'm primarily interested in the locally developed programs
>> 
>> https://bitbucket.org/jsbien/unihistext/
>> 
>> https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
>
> It looks as though the security certificates are awry - has someone
> forgotten to pay the protection money to the right people?  (Firefox
> objects with "The page you are trying to view cannot be shown because
> the authenticity of the received data could not be verified.")

I see no such problems with Firefox ESR 52.9.0 on Debian
testing. Moreover the program reports that the certificate is valid till
04/21/2020.

>
>> and in Emacs - to my disappointed it looks like the Unicode data are
>> set at the compile time, but perhaps this can be negotiated with the
>> developers.
>
> Can you be more specific?

I often search characters by name with C-x 8 Return. I would like to use
it also for MUFI characters, I have already the name list (the example
directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
very closely into the problem and don't remember now the details, but my
impression was that it's not simple.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Thu Aug 23 13:34:20 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Thu, 23 Aug 2018 20:34:20 +0200
Subject: Private Use areas
In-Reply-To: <20180823172642.55f167a6@JRWUBU2> (Richard Wordingham via
 Unicode's message of "Thu, 23 Aug 2018 17:26:42 +0100")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
 <20180823103134.00645f90@JRWUBU2>
 <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>
 <20180823172642.55f167a6@JRWUBU2>
Message-ID: <86h8jl9bkj.fsf@mimuw.edu.pl>

On Thu, Aug 23 2018 at 17:26 +0100, unicode at unicode.org writes:
> On Thu, 23 Aug 2018 17:39:15 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>
>> You make a confusion: I do not propose "hacking" existing codes, but
>> instead adding new codes for private variations. It's then up to PUV
>> sequence authors to choose an appropropriate base character that can
>> have the properties they want to be inherited by the private-use
>> variation sequence, or to choose a base character that will provide
>> some reasonnable reading if rendererd as is (by renderers or fonts
>> not implementing the pricate viaration sequence, give nthat they will
>> also append a symbol for the PUV itself after the standard character).
>
> Variation sequences cannot be used to add new characters.  Most PUA
> characters are used to represent new characters.  A
> standard-conformant private variation sequence would generally achieve
> the same effect as could be achieved by a font feature (typically one
> of the cvxx, though possibly one of the ssxx),

This is a typical but IMHO obsolete perspective. Fonts are for
*rendering*, new characters and variants are more and more often needed
for *input* of real life old texts with sufficient precision.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Thu Aug 23 13:49:31 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Thu, 23 Aug 2018 11:49:31 -0700
Subject: Private Use areas
In-Reply-To: <86ftz5cmh0.fsf@mimuw.edu.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl>
Message-ID: <CAH=y87ZcSkSH0LwrVPjVEOK70znHqe1hcCHapypzh7d0ZVBqow@mail.gmail.com>

On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:

> > I already provide this myself for my uses of the PUA as well as the
> > CSUR and any vendor-specific agreements I can find:
> >
> > http://www.kreativekorp.com/charset/PUADATA/
>
> I would prefer to see the data in a repository, so others can can
> comment and contribute.
>

That is actually my intent for the future. Though it's not quite ready yet:

https://github.com/kreativekorp/charset/tree/master/puadata

That's the data in a "pre-compiled" form; it's turned into a "proper"
PUADATA directory using this script:

https://github.com/kreativekorp/charset/blob/master/bin/build-public.py


As for "any vendor-specific agreements", do MUFI and LINCUA qualify?
>

I certainly do want to see MUFI and LINCUA provided in this form, but I put
them in a different category along with CSUR. I basically have three
categories of PUA agreements:

Fonts - PUA assignments specific to a font family, e.g. Constructium,
Fairfax, Nishiki-teki, Quivira, Junicode, etc.

Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR, MUFI,
LINCUA, etc.

Vendors - PUA assignments meant to be used by a single vendor or platform,
e.g. Adobe, Apple, etc. but also Linux, MirOS, etc.

Thank you for those links by the way. I had tried to find charts for MUFI
in the past but had somehow been unsuccessful.


> Of course there is no way to get software to use this information.
>
> What kind of software do you have in mind?
>

Unicode-related utilities, text editors to start with. You pretty much hit
the nail on the head with uniname and emacs as examples. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/58ecab44/attachment.html>

From unicode at unicode.org  Thu Aug 23 14:17:15 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Thu, 23 Aug 2018 22:17:15 +0300
Subject: Private Use areas
In-Reply-To: <86lg8x9bqb.fsf@mimuw.edu.pl> (unicode@unicode.org)
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl>
Message-ID: <83r2ioao5g.fsf@gnu.org>

> Date: Thu, 23 Aug 2018 20:30:52 +0200
> Cc: Richard Wordingham <richard.wordingham at ntlworld.com>
> From: "Janusz S. Bie? via Unicode" <unicode at unicode.org>
> 
> >> and in Emacs - to my disappointed it looks like the Unicode data are
> >> set at the compile time, but perhaps this can be negotiated with the
> >> developers.
> >
> > Can you be more specific?
> 
> I often search characters by name with C-x 8 Return. I would like to use
> it also for MUFI characters, I have already the name list (the example
> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
> very closely into the problem and don't remember now the details, but my
> impression was that it's not simple.

What is "it" in the last sentence?  IOW, what is not simple about that
with Emacs?

It is true that the Unicode related data is produced at build time,
but only some of that is actually recorded in the Emacs binary, the
rest is loaded upon demand.  But all the data is stored in data
structures that are mutable, given some Lisp programming.

(It is not clear to me which part of the Unicode data you would like
to change; are you talking about adding characters to the list of
those defined by Unicode?  If you are using the PUA codepoints, it's
possible that you will need to update Emacs's notion of PUA as well.)

From unicode at unicode.org  Thu Aug 23 14:47:03 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Thu, 23 Aug 2018 21:47:03 +0200
Subject: Private Use areas
In-Reply-To: <83r2ioao5g.fsf@gnu.org> (Eli Zaretskii's message of "Thu, 23 Aug
 2018 22:17:15 +0300")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
Message-ID: <86va80987c.fsf@mimuw.edu.pl>

On Thu, Aug 23 2018 at 22:17 +0300, eliz at gnu.org writes:
>> Date: Thu, 23 Aug 2018 20:30:52 +0200
>> Cc: Richard Wordingham <richard.wordingham at ntlworld.com>
>> From: "Janusz S. Bie? via Unicode" <unicode at unicode.org>
>> 
>> >> and in Emacs - to my disappointed it looks like the Unicode data are
>> >> set at the compile time, but perhaps this can be negotiated with the
>> >> developers.
>> >
>> > Can you be more specific?
>> 
>> I often search characters by name with C-x 8 Return. I would like to use
>> it also for MUFI characters, I have already the name list (the example
>> directory at https://bitbucket.org/jsbien/unihistext/). I haven't looked
>> very closely into the problem and don't remember now the details, but my
>> impression was that it's not simple.
>
> What is "it" in the last sentence?  IOW, what is not simple about that
> with Emacs?

I'm very glad you join the discussion.

My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
A WITH MACRON AND BREVE [MUFI] should yield the character with the code
E010. I can provide the list of names and codes.

>
> It is true that the Unicode related data is produced at build time,
> but only some of that is actually recorded in the Emacs binary, the
> rest is loaded upon demand.  But all the data is stored in data
> structures that are mutable, given some Lisp programming.

I never was fluent in Lisp programming and by now I forgot almost
everything I knew, so it's not a task for me. I was thinking about
submitting a feature request, but I forgot also the proper procedures to
do it. Moreover I had the impression that I'm the only person who needs
it...

>
> (It is not clear to me which part of the Unicode data you would like
> to change; are you talking about adding characters to the list of
> those defined by Unicode?  If you are using the PUA codepoints, it's
> possible that you will need to update Emacs's notion of PUA as well.)

Yes, I would like the PUA codepoints to be handled analogically as the
proper ones. What do you mean by Emacs's notion of PUA?

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Thu Aug 23 15:37:39 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 23 Aug 2018 21:37:39 +0100
Subject: Private Use areas
In-Reply-To: <86h8jl9bkj.fsf@mimuw.edu.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <618b0e57-5c2e-db89-d4ac-dcd7985c565d@att.net>
 <05c9667c-7ddc-6dcf-a0b6-9482661963c1@kli.org>
 <CAGJ7U-V8cTW_-o7DWYg=4pviLkmD-bXrGgrcSzAVTgtYyMmhqQ@mail.gmail.com>
 <CAGa7JC166jLz3FNW2kurvMx4MRL=Rv6cOV_wZvguYzVqbaPSTQ@mail.gmail.com>
 <20180823103134.00645f90@JRWUBU2>
 <CAGa7JC0U02ALF6yd3MJeG4NAkP0yvXoBcnU7qjtCGFZvay2p+Q@mail.gmail.com>
 <20180823172642.55f167a6@JRWUBU2> <86h8jl9bkj.fsf@mimuw.edu.pl>
Message-ID: <20180823213739.24365c81@JRWUBU2>

On Thu, 23 Aug 2018 20:34:20 +0200
"Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:

> This is a typical but IMHO obsolete perspective. Fonts are for
> *rendering*, new characters and variants are more and more often
> needed for *input* of real life old texts with sufficient precision.

If we're talking about glyphs which don't actually correspond to new
characters, then that sounds like a good case for private use variation
selectors. To quote Tully, "Abusus non tollit usum".

Richard.


From unicode at unicode.org  Thu Aug 23 16:15:10 2018
From: unicode at unicode.org (Richard Wordingham via Unicode)
Date: Thu, 23 Aug 2018 22:15:10 +0100
Subject: Emacs Verbose Character Entry (was Private Use Areas)
In-Reply-To: <86va80987c.fsf@mimuw.edu.pl>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
 <86va80987c.fsf@mimuw.edu.pl>
Message-ID: <20180823221510.54c6c43f@JRWUBU2>

On Thu, 23 Aug 2018 21:47:03 +0200
"Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:

> My needs are very simple, for example C-x 8 Return LATIN CAPITAL
> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
> the code E010. I can provide the list of names and codes.

While it should obviously yield, if anything, <U+0100, U+0306> or
<U+0041, U+0304, U+0306> for 'LATIN CAPITAL LETTER A WITH MACRON AND
BREVE', it would probably be more important to recognise formal
aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo
ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao
letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING.

For <U+0100, U+0306>, I prefer to type "A\_M_X", but then I learnt
XSAMPA. 

Richard.


From unicode at unicode.org  Thu Aug 23 20:43:39 2018
From: unicode at unicode.org (Julian Wels via Unicode)
Date: Fri, 24 Aug 2018 03:43:39 +0200
Subject: Thoughts on working with the Emoji Subcommittee (was Re: Thoughts
 on Emoji Selection Process)
In-Reply-To: <CABPY6Z0iN-fj2-+T4eu40-8MAmefn7TUh+OtGtVkwTNgMbKshA@mail.gmail.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <bfb2c537-92b6-a50f-a618-78a0b25cd787@kli.org>
 <CABPY6Z2zsuZ7K1eKc0cq4yz9-yfPRXaK22HFAkyRDMxWeyBymA@mail.gmail.com>
 <CAH=y87ZAna6EH79=p8NLv1j0mpCzbyFOE-Cm4tYbFG+rpDWYLg@mail.gmail.com>
 <CABPY6Z0iN-fj2-+T4eu40-8MAmefn7TUh+OtGtVkwTNgMbKshA@mail.gmail.com>
Message-ID: <CAJYrn-t2jAC93BV9NjQK-NCPz=w9_3K6jvZw-+PL4LfpAsDG9A@mail.gmail.com>

I think Blissymbols could be a separate, well-defined script in Unicode
because they are already more or less well defined by their respective
groups. This community of interest can lobby for these implementations as a
whole instead of multiple individuals separately.

Emoji were born in quite a different way and are in no way as well defined
as Blissymbols are for example. There is no self-governing forum of people
to discuss the future of emoji and forthcoming additions. Obviously,
because they gained international attention just as they were added to
Unicode-Standard but also maybe because "working with the Emoji
Subcommittee" is rather hard.

The conversation about Blissymbols made me think about a solution on how to
solve the current communication problem, although it might be a bit radical:
Why not remove the authority to propose new emojis from the ESC and give it
to a dedicated, public Emoji-Community. Such a community could formulate
additional guidelines for upcoming emojis, draft roadmaps and send a
quarterly proposal to the ESC for individual approval. Unicode Members
could still express ideas and exercise power through participating in the
community and appointing people to the ESC.

[image: diagram.png]

This change would remove pressure and workload from the ESC while retaining
most of the control, especially the last word, but the Emoji-Standart would
benefit from a dedicated community.

I'm just putting this out there. What are your thoughts on this? Do you
think this is unreasonable, or achievable?

Julian ??

On Tue, Aug 21, 2018 at 3:25 PM James Kass via Unicode <unicode at unicode.org>
wrote:

> Rebecca Bettencourt wrote,
>
> > Why don't we just get Blissymbolics encoded as it is?
>
> The Pipeline still has the Everson proposal from 1998, but Blissymbols
> are still in the Pipeline.
>
> Scripts Encoding Initiative
> ( http://linguistics.berkeley.edu/sei/ )
>  page,
> http://linguistics.berkeley.edu/sei/scripts-not-encoded.html
> shows Blissymbols and links the same proposal.
>
> Blissymbolics Communication International,
> http://www.blissymbolics.org/
> will likely produce the next proposal.
>
> Both Scripts Encoding Initiative and Blissymbolics Communication
> International depend upon funding.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180824/6c82c3fb/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: diagram.png
Type: image/png
Size: 52833 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180824/6c82c3fb/attachment.png>

From unicode at unicode.org  Thu Aug 23 20:58:11 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Thu, 23 Aug 2018 21:58:11 -0400
Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>
Message-ID: <4000079c-7d37-7526-dad0-f955e50aaa76@kli.org>

Still, pronouns may be universal, but their features aren't... Pronouns 
in Japanese are not a closed class, and it is not uncommon to use a 
person's name/title instead of "you".? Happens in English and other 
languages too, with extremely formal speech, even down to conjugating 
with 3rd-person verb forms.? (it's really cool to see the mid-sentence 
back-and-forth shifting in Biblical Hebrew, e.g. Genesis chapter 44.)? 
All of which is to say, as Asmus did, that even "I" and "you" are not 
interchangeable pieces between languages, easily symbolized by a single 
"fits-all-languages" placeholder.

~mark

On 08/23/2018 06:28 AM, "J?rg Knappen" via Unicode wrote:
> Asmus,
> I know your style of humor, but to keep it straight:
> All known human languages, even Piraha, have pronouns for "I" and "you".
> --J?rg Knappen
> *Gesendet:*?Montag, 20. August 2018 um 16:20 Uhr
> *Von:*?"Asmus Freytag via Unicode" <unicode at unicode.org>
> *An:*?unicode at unicode.org
> *Betreff:*?Re: Thoughts on working with the Emoji Subcommittee (was 
> Re: Thoughts on Emoji Selection Process)
>
> What about languages that don't have or don't use personal pronouns. 
> Their speakers might find their use odd or awkward.
>
> The same for many other grammatical concepts: they work reasonably 
> well if used by someone from a related language, or for linguists 
> trained in general concepts, but languages differ so much in what they 
> express explicitly that if any native speaker transcribes the features 
> that are exposed (and not implied) in their native language it may not 
> be what a reader used to a different language is expecting to see.
>
> A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/ccf178f0/attachment.html>

From unicode at unicode.org  Thu Aug 23 21:03:05 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Thu, 23 Aug 2018 22:03:05 -0400
Subject: Aw: Re: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <e9e24b0f-d4db-4a59-9327-017e763f1588@ix.netcom.com>
References: <CABPY6Z2TEL2=UXX+i3+xPJ0COta7xX-Jr_Uv0DQU=nLw743zEw@mail.gmail.com>
 <20312152.16818.1534418061331.JavaMail.defaultUser@defaultHost>
 <2249937.3718.1534493111789.JavaMail.defaultUser@defaultHost>
 <28920539.8997.1534498554809.JavaMail.defaultUser@defaultHost>
 <CAFmvRsdJBRkZAixXpROCmvCxyQzb+-BJPHfOmnMFCcPoSY5ijQ@mail.gmail.com>
 <CABPY6Z2rhGdkX1oUNUg2Be1PLHZdrbhxnzNpG5b=CH3g_OwqUw@mail.gmail.com>
 <b308ca91-b090-9dd0-1d0f-3f08d83963f9@ix.netcom.com>
 <trinity-7a4281d4-c855-4c51-b47e-f7d0a99cbfba-1535020080493@3c-app-webde-bs02>
 <e9e24b0f-d4db-4a59-9327-017e763f1588@ix.netcom.com>
Message-ID: <7eac15eb-7342-1f97-e5a6-ef42d371423e@kli.org>

On 08/23/2018 06:48 AM, Asmus Freytag (c) via Unicode wrote:
> On 8/23/2018 3:28 AM, "J?rg Knappen" wrote:
>> Asmus,
>> I know your style of humor, but to keep it straight:
>> All known human languages, even Piraha, have pronouns for "I" and "you".
>
> And languages like Japanese, tend to use them - mostly not.
>
> Even if the concepts are known, and can be named, there are deep 
> differences across languages concerning the need? or conventions for 
> demarcating them with words in any given context.
>
> Replacing words by symbols is not going to fix this - the only way to 
> get a 'universal' system of symbolic expression is to invent a new 
> language, with its own conventions for use of these symbols in any 
> given context.
>

It isn't like replacing words with symbols hasn't been tried... I think 
Francis Lodwick had a "universal symbology" like this in the works in 
the 1600s.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180823/75afb3f3/attachment.html>

From unicode at unicode.org  Fri Aug 24 03:01:14 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 24 Aug 2018 10:01:14 +0200
Subject: Private Use areas
In-Reply-To: <CAH=y87ZcSkSH0LwrVPjVEOK70znHqe1hcCHapypzh7d0ZVBqow@mail.gmail.com>
 (Rebecca Bettencourt's message of "Thu, 23 Aug 2018 11:49:31 -0700")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl>
 <CAH=y87ZcSkSH0LwrVPjVEOK70znHqe1hcCHapypzh7d0ZVBqow@mail.gmail.com>
Message-ID: <864lfk2nxx.fsf@mimuw.edu.pl>

On Thu, Aug 23 2018 at 11:49 -0700, beckiergb at gmail.com writes:
> On Thu, Aug 23, 2018 at 5:10 AM, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
>
>  > I already provide this myself for my uses of the PUA as well as the
>  > CSUR and any vendor-specific agreements I can find:
>  >
>  > http://www.kreativekorp.com/charset/PUADATA/
>
>  I would prefer to see the data in a repository, so others can can
>  comment and contribute.
>
> That is actually my intent for the future. Though it's not quite ready yet:
>
> https://github.com/kreativekorp/charset/tree/master/puadata

Great!

>
> That's the data in a "pre-compiled" form; it's turned into a "proper"
> PUADATA directory using this script:
>
> https://github.com/kreativekorp/charset/blob/master/bin/build-public.py
>
>  As for "any vendor-specific agreements", do MUFI and LINCUA qualify?
>
> I certainly do want to see MUFI and LINCUA provided in this form, but
> I put them in a different category along with CSUR. I basically have
> three categories of PUA agreements:
>
> Fonts - PUA assignments specific to a font family, e.g. Constructium, Fairfax, Nishiki-teki, Quivira, Junicode, etc.

You are probably aware that Junicode 1.000, released in September 2017,
supports in full MUFI 4.0  (released in December 2015). I don't know
whether Junicode contains now any PUA characters which are not in MUFI.

>
> Public - PUA agreements meant to be widely used, e.g. CSUR, UCSUR,
> MUFI, LINCUA, etc.
>
> Vendors - PUA assignments meant to be used by a single vendor or
> platform, e.g. Adobe, Apple, etc. but also Linux, MirOS, etc.
>
> Thank you for those links by the way. I had tried to find charts for
> MUFI in the past but had somehow been unsuccessful.

Similar files for different purpose has been created by Mikkel Eide
Eriksen:

https://github.com/mikkelee/mufi-latex

An earlier version of MUFI was incorporated in the ENRICH Gaiji bank:

http://v2.manuscriptorium.com/apps/gbank/

You can download the source but it doesn't seem useful.

A version of MUFI is available also as a searchable character database
created by the present single-person MUFI board, i.e. Tarrin Wills, as a
part of the beta version of a new MUFI site:

http://skaldic.abdn.ac.uk/m.php?p=mufi

Some time ago I wrote on the mufi-fonts list:

--8<---------------cut here---------------start------------->8---
On Sun, Dec 03 2017 at  6:55 +0100, jsbien at mimuw.edu.pl writes:

[...]

> I wanted the file quickly to get an overview of the recently released
> corpus of 16th century Polish, and it's seemed to me that the simplest
> and fastest way is to convert the PDF recommendation in a semi-automatic
> way. It was more cumbersome than I expected, but thanks to this approach
> I've discovered a typo in the recommendation: letter I instead of digit
> 1 in EAFI, the code for LATIN ENLARGED LETTER SMALL LIGATURE AE (p. 93
> in the code chart order version).
>
> For the planned extension of the program I need more info on MUFI
> characters, preferably in the format of the UnicodeData.txt. This time
> however I intend to make haste slowly, so I have a question:
>
> Is it possible to make publicly available for download the database
> underlying http://skaldic.abdn.ac.uk/db.php?if=mufi&table=mufi_char?

--8<---------------cut here---------------end--------------->8---

Unfortunately I got no answer to the question.


>  > Of course there is no way to get software to use this information.
>
>  What kind of software do you have in mind?
>
> Unicode-related utilities, text editors to start with. You pretty much
> hit the nail on the head with uniname and emacs as examples. :)

Thanks! As for uniname by Bill Poser, I exchanged mails with him in
2011:

--8<---------------cut here---------------start------------->8---
On Sun, Aug 28 2011 at 12:01 +0200, jsbien at mimuw.edu.pl writes:

[...]

> A student of mine wrote an alternative program according to my
> specification. The program is GPLed and available with
>
> git clone http://students.mimuw.edu.pl/~findepi/unihistext unihistext

Now https://bitbucket.org/jsbien/unihistext

>
> The source is ready for Debian packaging.
>
> I think the program is worth better distribution, but its author is no
> longer interested in it. Would you be so kind to consider including
> either the program itself in your uniutils or extend your unidesc with
> its features?
>
> Best regards
>
> Janusz

On Sun, Aug 28 2011 at 16:03 -0700, billposer2 at gmail.com writes:
> In principle, sure. I'll have a look at it.

--8<---------------cut here---------------end--------------->8---

Unfortunatelly nothing happened, and I thought I should not press the
point.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Fri Aug 24 08:12:15 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 24 Aug 2018 16:12:15 +0300
Subject: Private Use areas
In-Reply-To: <86va80987c.fsf@mimuw.edu.pl> (jsbien@mimuw.edu.pl)
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
 <86va80987c.fsf@mimuw.edu.pl>
Message-ID: <83d0u7aoy8.fsf@gnu.org>

> From: jsbien at mimuw.edu.pl (Janusz S. Bie?)
> Cc: unicode at unicode.org,  richard.wordingham at ntlworld.com
> Date: Thu, 23 Aug 2018 21:47:03 +0200
> 
> I'm very glad you join the discussion.

I'm sorry for not joining sooner.  In my defense, I missed the
reference to Emacs, and the rest of the discussion is not really
interesting for me, as using PUA for new characters is not something I
have interest in or experience with.

> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
> A WITH MACRON AND BREVE [MUFI] should yield the character with the code
> E010. I can provide the list of names and codes.

So you'd like to extend "C-x 8 RET" to recognize names of additional
characters and associate them with codepoints in the PUA area?  That
shouldn't be hard to add.  But is that all? won't you also want to
tell Emacs about the properties of those characters? or be able to set
up fonts for displaying them?  IOW, would it be okay to have these
characters be "second-class citizens" in Emacs?

> > It is true that the Unicode related data is produced at build time,
> > but only some of that is actually recorded in the Emacs binary, the
> > rest is loaded upon demand.  But all the data is stored in data
> > structures that are mutable, given some Lisp programming.
> 
> I never was fluent in Lisp programming and by now I forgot almost
> everything I knew, so it's not a task for me. I was thinking about
> submitting a feature request, but I forgot also the proper procedures to
> do it.

The proper procedure is to type "M-x report-emacs-bug RET" and then
describe the feature(s) you'd like to see added/improved.

> Moreover I had the impression that I'm the only person who needs
> it...

That shouldn't stop you.  Many a feature in Emacs started as a request
from a single individual.

> > (It is not clear to me which part of the Unicode data you would like
> > to change; are you talking about adding characters to the list of
> > those defined by Unicode?  If you are using the PUA codepoints, it's
> > possible that you will need to update Emacs's notion of PUA as well.)
> 
> Yes, I would like the PUA codepoints to be handled analogically as the
> proper ones. What do you mean by Emacs's notion of PUA?

Emacs knows about the PUA regions of the Unicode code-space, and
treats those codepoints specially.  The features you request will
probably need to affect the PUA region as well, because the codepoints
you use should no longer be treated as PUA.

From unicode at unicode.org  Fri Aug 24 09:05:34 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Fri, 24 Aug 2018 17:05:34 +0300
Subject: Emacs Verbose Character Entry (was Private Use Areas)
In-Reply-To: <20180823221510.54c6c43f@JRWUBU2> (message from Richard
 Wordingham via Unicode on Thu, 23 Aug 2018 22:15:10 +0100)
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
 <86va80987c.fsf@mimuw.edu.pl> <20180823221510.54c6c43f@JRWUBU2>
Message-ID: <83a7pbamhd.fsf@gnu.org>

> Date: Thu, 23 Aug 2018 22:15:10 +0100
> From: Richard Wordingham via Unicode <unicode at unicode.org>
> 
> On Thu, 23 Aug 2018 21:47:03 +0200
> "Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:
> 
> > My needs are very simple, for example C-x 8 Return LATIN CAPITAL
> > LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
> > the code E010. I can provide the list of names and codes.
> 
> While it should obviously yield, if anything, <U+0100, U+0306> or
> <U+0041, U+0304, U+0306> for 'LATIN CAPITAL LETTER A WITH MACRON AND
> BREVE', it would probably be more important to recognise formal
> aliases, such as 'LAO LETTER LO' for the input of the Lao letter lo
> ling (U+0EA5 LAO LETTER LO LOOT), not to be be confused with the Lao
> letter lo lot (a.k.a. ro rot), U+0EA5 LETTER LO LING.
> 
> For <U+0100, U+0306>, I prefer to type "A\_M_X", but then I learnt
> XSAMPA. 

The Emacs command "C-x 8 RET" expects the name of a single codepoint.
It should be possible to extend it (or perhaps provide a separate
command) that produced named sequence of codepoints, such as those in
the above examples, but there's no such feature as of now.  If this
would be a useful addition, please suggest that on the Emacs issue
tracker (using "M-x report-emacs-bug"), and please include with your
request the sources where we could find such named sequences to
support.

Thanks.

From unicode at unicode.org  Fri Aug 24 10:09:02 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 24 Aug 2018 16:09:02 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl> <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
Message-ID: <17627212.30661.1535123342455.JavaMail.defaultUser@defaultHost>

Hi
An approach that you might like to consider in relation to fonts is that it is possible to have in a font a Description field that consists of plain text.
It is stored twice in the font, in two different ways, one of which is just plain text, possibly just ASCII.
So if you had text such as
$$$PUAB
and so on in that Description field than a software application could search for all occurrences of $$$ and gather information for each set of data in that way, without needing separate OpenType tables.
As an example of how information can be stored in the Description field here is a link to a font that I made years ago.
If you download the font and open it is WordPad, the text can be read.
The direct link is as follows.
www.users.globalnet.co.uk/~ngo/SPANGBLU.TTF
The font is also linked from the following web page, about a quarter of the way down the page.
http://www.users.globalnet.co.uk/~ngo/fonts.htm
The web pages encoded in the font are for three of the songs linked from the following page.
http://www.users.globalnet.co.uk/~ngo/song0001.htm
Best regards,
William Overington
Friday 24 August 2018
----Original message----
>From : unicode at unicode.org
Date : 2018/08/21 - 19:23 (GMTDT)
To : unicode at unicode.org
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 10:21 AM, Janusz S. Bie? via Unicode <unicode at unicode.org> wrote:
I think PUA users should provide the
properties of the characters used in a form analogical to the Unicode
itself, and the software should be able to use this additional
information.
I already provide this myself for my uses of the PUA as well as the CSUR and any vendor-specific agreements I can find:
http://www.kreativekorp.com/charset/PUADATA/
Of course there is no way to get software to use this information. I have entertained the idea of being able to embed this information into the font itself as OpenType tables, e.g.:
PUAB -> Blocks.txt
PUAC -> CaseFolding.txt
PUAW -> EastAsianWidth.txt
PUAL -> LineBreak.txt
PUAD -> UnicodeData.txt
I've actually invented table names for the majority of UCD files, but those are probably the most relevant. The table names for the more obscure files get rather... creative, e.g.:
PUA[ -> BidiBrackets.txt
PUA] -> BidiMirroring.txt
That alone may get some people to think twice about this idea. :P
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180824/b84bc0ea/attachment.html>

From unicode at unicode.org  Fri Aug 24 11:40:07 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 24 Aug 2018 18:40:07 +0200
Subject: Private Use areas
In-Reply-To: <83d0u7aoy8.fsf@gnu.org> (Eli Zaretskii's message of "Fri, 24 Aug
 2018 16:12:15 +0300")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
 <86va80987c.fsf@mimuw.edu.pl> <83d0u7aoy8.fsf@gnu.org>
Message-ID: <86va7z90rc.fsf@mimuw.edu.pl>

On Fri, Aug 24 2018 at 16:12 +0300, eliz at gnu.org writes:
>> From: jsbien at mimuw.edu.pl (Janusz S. Bie?)
>> Cc: unicode at unicode.org,  richard.wordingham at ntlworld.com
>> Date: Thu, 23 Aug 2018 21:47:03 +0200
>> 
>> I'm very glad you join the discussion.
>
> I'm sorry for not joining sooner.  In my defense, I missed the
> reference to Emacs, and the rest of the discussion is not really
> interesting for me, as using PUA for new characters is not something I
> have interest in or experience with.

I don't think you missed anything important.

>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL LETTER
>> A WITH MACRON AND BREVE [MUFI] should yield the character with the code
>> E010. I can provide the list of names and codes.
>
> So you'd like to extend "C-x 8 RET" to recognize names of additional
> characters and associate them with codepoints in the PUA area?  That
> shouldn't be hard to add.

I would prefer extensibility over efficiency, I don't mind loading PUA
information from a source declared somehow in .emacs.d., so I can
change/expand the list of characters from time to time.

> But is that all? won't you also want to tell Emacs about the
> properties of those characters?

Personally I would like additionally to be able to change the case of a
letter or string, and I am willing to prepare the necessary information
for MUFI characters.

Displaying other properties would be nice, but for me this is not
crucial. Moreover, somebody has to prepare the data...

> or be able to set up fonts for displaying them?

It would be nice. I haven't asked for it because I typeset my texst with
XeTeX or LuaTeX and the input is more important for me than rendering.

> IOW, would it be okay to have these
> characters be "second-class citizens" in Emacs?

For me it would be acceptable.

BTW, I just got perhaps a crazy idea: what about treating a PUA
declaration (as you probably noticed, there may be conficting ones) as a
separate coding system? Of course some mechanism for escaping the
standard PUA interpretation would be needed.

>
>> > It is true that the Unicode related data is produced at build time,
>> > but only some of that is actually recorded in the Emacs binary, the
>> > rest is loaded upon demand.  But all the data is stored in data
>> > structures that are mutable, given some Lisp programming.
>> 
>> I never was fluent in Lisp programming and by now I forgot almost
>> everything I knew, so it's not a task for me. I was thinking about
>> submitting a feature request, but I forgot also the proper procedures to
>> do it.
>
> The proper procedure is to type "M-x report-emacs-bug RET" and then
> describe the feature(s) you'd like to see added/improved.

I will definitely remember now :-)

>
>> Moreover I had the impression that I'm the only person who needs
>> it...
>
> That shouldn't stop you.  Many a feature in Emacs started as a request
> from a single individual.
>
>> > (It is not clear to me which part of the Unicode data you would like
>> > to change; are you talking about adding characters to the list of
>> > those defined by Unicode?  If you are using the PUA codepoints, it's
>> > possible that you will need to update Emacs's notion of PUA as well.)
>> 
>> Yes, I would like the PUA codepoints to be handled analogically as the
>> proper ones. What do you mean by Emacs's notion of PUA?
>
> Emacs knows about the PUA regions of the Unicode code-space, and
> treats those codepoints specially.  The features you request will
> probably need to affect the PUA region as well, because the codepoints
> you use should no longer be treated as PUA.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Fri Aug 24 12:10:02 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 24 Aug 2018 19:10:02 +0200
Subject: Emacs Verbose Character Entry (was Private Use Areas)
In-Reply-To: <83a7pbamhd.fsf@gnu.org> (Eli Zaretskii via Unicode's message of
 "Fri, 24 Aug 2018 17:05:34 +0300")
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <db4a72d1-2db2-4c31-38dc-1e14648fc40e@att.net>
 <20180821145651.75orx5kfrtlzhfel@angband.pl>
 <86h8jnab4o.fsf@mimuw.edu.pl>
 <CAH=y87aS6Qxdro3YuZ6cj3hLuUoSi6sU_L=S8Rhr-M6HccJZzw@mail.gmail.com>
 <86ftz5cmh0.fsf@mimuw.edu.pl> <20180823171105.058ac317@JRWUBU2>
 <86lg8x9bqb.fsf@mimuw.edu.pl> <83r2ioao5g.fsf@gnu.org>
 <86va80987c.fsf@mimuw.edu.pl> <20180823221510.54c6c43f@JRWUBU2>
 <83a7pbamhd.fsf@gnu.org>
Message-ID: <86in3z8zdh.fsf@mimuw.edu.pl>

On Thu, Aug 23 2018 at 22:15 +0100, unicode at unicode.org writes:
> On Thu, 23 Aug 2018 21:47:03 +0200
> "Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:
>
>> My needs are very simple, for example C-x 8 Return LATIN CAPITAL
>> LETTER A WITH MACRON AND BREVE [MUFI] should yield the character with
>> the code E010. I can provide the list of names and codes.
>
> While it should obviously yield, if anything, <U+0100, U+0306> or
> <U+0041, U+0304, U+0306> for 'LATIN CAPITAL LETTER A WITH MACRON AND
> BREVE',

In my opinion there is no question what

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE'

should yield, because the name should be absent on the name list.

My example concerns names like

'LATIN CAPITAL LETTER A WITH MACRON AND BREVE [MUFI]'
'COMBINING ABBREVIATION MARK SUPERSCRIPT UR ROUND R FORM [MUFI]'

etc.

[...]

> The Emacs command "C-x 8 RET" expects the name of a single codepoint.

It's OK and in my opinion it should stay this way.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Fri Aug 24 14:09:37 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 24 Aug 2018 20:09:37 +0100 (BST)
Subject: Thoughts on working with the Emoji Subcommittee (was Re:
 Thoughts on Emoji Selection Process)
In-Reply-To: <17939746.41561.1535136460751.JavaMail.root@webmail17.bt.ext.cpcloud.co.uk>
References: <17939746.41561.1535136460751.JavaMail.root@webmail17.bt.ext.cpcloud.co.uk>
Message-ID: <5778280.42338.1535137777588.JavaMail.defaultUser@defaultHost>

Julian Bradfield wrote:

> Not that I want to hear any more about William's unmentionables; I just wish emoji were equally unmentionable.

Well, as you mention them perhaps the moderator will allow the following, particularly as it relates to Japanese and Japanese has been mentioned elsewhere in this thread.

In Chapter 34 of my novel there is a poem and it is at one time described as being performed in Japanese.

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_034.pdf

I know almost nothing about Japanese, yet as Japanese script is so very different from Latin script I feel that it provides a good test to include in my research. I am trying to learn more about Japanese so replies to this post are welcome please.

I wondered about round-tripping the poem from English to Japanese and back to English.

So I tried two experiments, designed so that the round-tripping was specifically not using the same translation method in each of the two directions.

Experiment one. English to Japanese in Bing Translate and then copy and paste so as to translate from Japanese to English using Google Translate.

Experiment two. English to Japanese in Google Translate and then copy and paste so as to translate from Japanese to English using Bing Translate.

These worked well.

Experiment two had the additional benefit of a lady reading out the poem.

I am wondering if Chapter 34 could be the basis for a short play as part of the evening entertainment at the Internationalization & Unicode? Conference (IUC) 42, with the parts played by various delegates to the conference.

That could be great and maybe a video could be made of the performance and the video published.

The performance of the poem in Japanese could be spectacular.

Clearly, expert translation would be needed so as to have a good show.

William Overington

Friday 24 August 2018


From unicode at unicode.org  Sun Aug 26 18:10:23 2018
From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode)
Date: Mon, 27 Aug 2018 00:10:23 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
Message-ID: <792311568.784627.1535325024023@mail2.virginmedia.com>


> On 21 August 2018 at 01:04 "Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:
> 
>     It is kind of a bummer, though, that you can't experiment (easily?  or at all?) in the PUA with scripts that have complex behavior, or even not-so-complex behavior like accents & combining marks, or RTL direction (here, also, am I speaking true?  Is there a block of RTL PUA also?  I guess there's always RLO, but meh.)  Still, maybe it doesn't really matter much: your special-purpose font can treat any codepoint any way it likes, right?
> 
>     ~mark
> 
> 
Back in 2006, I was typing the Tai Tham script (then being proposed as the Lanna script) using the PUA and exploring the issue of selecting between what are now <MEDIAL RA> and <SAKOT, RA> based on the preceding character and between what are now <SIGN AA> and <SIGN TALL AA> based on the preceding base character and its subscripts.  I was also looking at using variation selectors to override the rules.  I was using SIL Graphite fonts when they was getting intermittent support in OpenOffice and Firefox - my main display engine was WorldPad.  Nowadays, SIL Graphite seems to be securely supported in LibreOffice and Firefox.  Now, back then, Graphite was at least attempting to support RTL; I would expect the RTL support to work well by now.

On the other hand, experimenting with OpenType is much harder.  The best I've found is transcoding to a Latin range and using an ssxx feature to convert the Latin glyphs back to those for the complex script.  I do that to render Tai Tham in Internet Explorer 11 on Windows 7; this complex scheme is a fallback for when the rendering engine fails.

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/f7dd4b2b/attachment.html>

From unicode at unicode.org  Mon Aug 27 09:22:15 2018
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 27 Aug 2018 14:22:15 +0000
Subject: Private Use areas (was: Re: Thoughts on working with the Emoji
 Subcommittee (was ...))
In-Reply-To: <20180821110156.453c129a@JRWUBU2>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
Message-ID: <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>

Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90? and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case).

Cf. UAX 50.


Peter

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Richard Wordingham via Unicode
Sent: Tuesday, August 21, 2018 3:02 AM
To: unicode at unicode.org
Subject: Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

On Tue, 21 Aug 2018 08:53:18 +0800
via Unicode <unicode at unicode.org> wrote:

> On 2018-08-21 08:04, Mark E. Shoulson via Unicode wrote:

> > Still, maybe it
> > doesn't really matter much: your special-purpose font can treat any 
> > codepoint any way it likes, right?

> Not all properties come from the font. For example a Zhuang character 
> PUA font, which supplements CJK ideographs, does not rotate characters 
> 90 degrees, when change from RTL to vertical display of text.

Isn't that supposed to be treated by an OpenType feature such as 'vert'?  Or does the rendering stack get in the way?

However, one might need reflowing text to be about 40% WJ.

Richard.


From unicode at unicode.org  Mon Aug 27 03:59:43 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 27 Aug 2018 09:59:43 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
Message-ID: <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>

Hi
How about the following method.
In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context.
http://www.unicode.org/charts/PDF/U2460.pdf
Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9.
Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it.
Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted.
Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters.
Maybe other circled numbers in the range 10 through to 19 would have special meanings.
This method would keep everything within plane zero.
William Overington
Monday 27 August 2018
----Original message----
>From : unicode at unicode.org
Date : 2018/08/21 - 23:23 (GMTDT)
To : doug at ewellic.org
Cc : unicode at unicode.org
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode <unicode at unicode.org> wrote:
Ken Whistler wrote:
> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 
I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 
As would I.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/564cfcdd/attachment.html>

From unicode at unicode.org  Mon Aug 27 15:20:31 2018
From: unicode at unicode.org (Peter Constable via Unicode)
Date: Mon, 27 Aug 2018 20:20:31 +0000
Subject: Private Use areas
In-Reply-To: <MW2PR2101MB1065B83EE9D095845A878F2BD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
 <MW2PR2101MB1065B83EE9D095845A878F2BD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
Message-ID: <MW2PR2101MB1065EB5088B9FDE938A1CFEAD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>

This was meant to go to the list.

From: Peter Constable
Sent: Monday, August 27, 2018 12:33 PM
To: wjgo_10009 at btinternet.com; jameskasskrv at gmail.com; richard.wordingham at ntlworld.com; mark at kli.org; beckiergb at gmail.com; verdy_p at wanadoo.fr
Subject: RE: Private Use areas

That sounds like a non-conformant use of characters in the U+24xx block.


Peter

From: Unicode <unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>> On Behalf Of William_J_G Overington via Unicode
Sent: Monday, August 27, 2018 2:00 AM
To: jameskasskrv at gmail.com<mailto:jameskasskrv at gmail.com>; richard.wordingham at ntlworld.com<mailto:richard.wordingham at ntlworld.com>; mark at kli.org<mailto:mark at kli.org>; beckiergb at gmail.com<mailto:beckiergb at gmail.com>; verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Private Use areas

Hi

How about the following method.

In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters as used in their present context.

http://www.unicode.org/charts/PDF/U2460.pdf

Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9.

Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am suggesting using it.

Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted.

Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence of enclosed alphanumeric characters.

Maybe other circled numbers in the range 10 through to 19 would have special meanings.

This method would keep everything within plane zero.

William Overington

Monday 27 August 2018


----Original message----
From : unicode at unicode.org<mailto:unicode at unicode.org>
Date : 2018/08/21 - 23:23 (GMTDT)
To : doug at ewellic.org<mailto:doug at ewellic.org>
Cc : unicode at unicode.org<mailto:unicode at unicode.org>
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:
Ken Whistler wrote:

> The way forward for folks who want to do this kind thing is:
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points.

I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project.

As would I.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/ddf09e65/attachment.html>

From unicode at unicode.org  Mon Aug 27 15:31:08 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 27 Aug 2018 12:31:08 -0800
Subject: Private Use areas
In-Reply-To: <MW2PR2101MB1065EB5088B9FDE938A1CFEAD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
 <MW2PR2101MB1065B83EE9D095845A878F2BD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <MW2PR2101MB1065EB5088B9FDE938A1CFEAD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
Message-ID: <CABPY6Z2Z9aAFPNVjNmDv=0gszxDt7i6t1Q45ZTsMGZFUxwy5_g@mail.gmail.com>

Peter Constable wrote,

> That sounds like a non-conformant use of characters in the U+24xx block.

Non-conformant?  Well, it's probably overkill anyway.  A simpler
method of identifying which PUA convention is being used for a file
would be to either have the first line of the file being something
like [PUA00001] or to have the file name be something like
MYFILE.TXTPUA00001.  Where "PUA00001" equals the CSUR.  Other numbers
(PUA00002, PUA00003, etc.) for other PUA conventions.

If a user has thousands of files using PUA characters, and all the
files are using the same PUA convention, why would each file need to
contain metadata for each PUA character used within?  (Rhetorical)

The "prior agreement" part about PUA usage means the user would know
in advance how to display the text properly.

From unicode at unicode.org  Mon Aug 27 15:44:39 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 27 Aug 2018 21:44:39 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
Message-ID: <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>

Here is the reply that I sent to Peter Constable and to the other people to whom he wrote.
Unlike for Mr Constable and for many other people, all of my posts have to be passed by the moderator, and I know why that is the situation. Though that situation was not imposed by a named official of Unicode Inc. acting in a stated official capacity.
So my opportunities to defend my ideas are conditional.
William Overington
Monday 27 August 2018
----Original message----
>From : wjgo_10009 at btinternet.com
Date : 2018/08/27 - 21:18 (GMTDT)
To : beckiergb at gmail.com, verdy_p at wanadoo.fr, petercon at microsoft.com, wjgo_10009 at btinternet.com, mark at kli.org, kenwhistler at att.net, richard.wordingham at ntlworld.com, jameskasskrv at gmail.com
Subject : Re: Private Use areas
Well, it is a pity that you did not send your reply to the Unicode mailing list.
> That sounds like a non-conformant use of characters in the U+24xx block.
Well, you are an expert on these things and I do not understand as to with what it would be non-conformant.
It seems to me that for many years some people have wanted a way to convey information about the meaning of Private Use Area characters used in a document in an unobtrusive way within the document. The format that I am suggesting could be the basis of a way to do that.
I really do not understand the problem.
Ken Whistler wrote:
>>> > 1. Define a *protocol* for reliable interchange of custom character property information about PUA code points.
Some people use XML for things where two characters are used in a different manner.
A quick downbeat quip comment about my ideas with no explanation is not helpful and might because of your standing cause some people not to consider the idea even-handedly for concern of offending you.
I am reminded of a British film of the 1955 called The Colditz Story. 
 
It used to be one of the regular films on the television years ago. 
 
I do not know whether it was ever shown in America, maybe, or maybe it is just a British thing. 
 
https://www.youtube.com/results?search_query=The+Colditz+story 
 
https://en.wikipedia.org/wiki/The_Colditz_Story 
 
The reason why I am reminded of that film is that one of the British prisoners devises a plan for a group of British prisoners to escape from Colditz disguised as German officers and just walk out of the gate. This is ridiculed as impossible because it has been tried before at various prisoner of war camps and the people have always been detected as British prisoners. The man suggesting the scheme then points out that the detection is because there is clearly something questionable about the direction from which the disguised prisoners arrive, such as from a prisoners' hut, that is the problem, not the quality of the disguises or the basic soundness of the idea. The man then suggests that they walk out of the German Officers' mess building. Please bear in mind that walking out of the door of the mess building does not mean actually being in the mess, it is a matter of going down the flight of stairs from a storage area, (the stairs having been accessed from under the stage of the castle theatre) walking past the entrance to the dining room and then out of the door, supposedly on their way back, after dinner, to their billets in the village. This done while a concert put on by some others of the prisoners, and attended by the senior German officers, is going on in the castle theatre. 
 
So, it is the bit about an idea coming from the wrong direction that reminds me of the film. 
 
https://www.youtube.com/watch?v=0eeSYvxVFUw 
 
https://www.youtube.com/watch?v=iY8jMkIbwDM 
 
https://www.youtube.com/watch?v=QxHsElyFsTI
 William Overington
Monday 27 August 2018
----Original message----
>From : petercon at microsoft.com
Date : 2018/08/27 - 20:33 (GMTDT)
To : wjgo_10009 at btinternet.com, jameskasskrv at gmail.com, richard.wordingham at ntlworld.com, mark at kli.org, beckiergb at gmail.com, verdy_p at wanadoo.fr
Subject : RE: Private Use areas
That sounds like a non-conformant use of characters in the U+24xx block.
 
 
Peter
 
From: Unicode <unicode-bounces at unicode.org> On Behalf Of
William_J_G Overington via Unicode
Sent: Monday, August 27, 2018 2:00 AM
To: jameskasskrv at gmail.com; richard.wordingham at ntlworld.com; mark at kli.org; beckiergb at gmail.com; verdy_p at wanadoo.fr
Cc: unicode at unicode.org
Subject: Re: Private Use areas
 
Hi
 
How about the following method.
 
In a text file that contains text that uses Private Use Area characters, start the file with a sequence of Enclosed Alphanumeric characters from regular Unicode, that sequence containing the metadata relating to those Private Use Area characters
 as used in their present context.
 
http://www.unicode.org/charts/PDF/U2460.pdf
 
Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters U+24B6 .. U+24E9.
 
Use U+2473 as if it were a circled space. The use of 20 to mean a space often occurs in web addresses. I know that there it is hexadecimal and here it is decimal but it has the same look of being an encoded space and so that is why I am
 suggesting using it.
 
Start the sequence with PUAINFO encoded using seven circled Latin letters and any character other than a carriage return or a line feed shows that the sequence has ended. The use of PUAINFO encoded using seven circled Latin letters at the
 start of the sequence is so that text using enclosed alphanumeric characters for another purpose would not become disrupted.
 
Then a suitable software application can read the text file and then, either automatically or after the clicking of a button, extract metadata information from the sequence of enclosed alphanumeric characters and not display the sequence
 of enclosed alphanumeric characters.
 
Maybe other circled numbers in the range 10 through to 19 would have special meanings.
 
This method would keep everything within plane zero.
 
William Overington
 
Monday 27 August 2018
 
 
----Original message----
>From : unicode at unicode.org
Date : 2018/08/21 - 23:23 (GMTDT)
To : doug at ewellic.org
Cc : unicode at unicode.org
Subject : Re: Private Use areas
On Tue, Aug 21, 2018 at 3:02 PM Doug Ewell via Unicode <unicode at unicode.org> wrote:
Ken Whistler wrote:
> The way forward for folks who want to do this kind thing is: 
>
> 1. Define a *protocol* for reliable interchange of custom character
> property information about PUA code points. 
I've often thought that would be a great idea. You can't get to steps 2
and 3 without step 1. I'd gladly participate in such a project. 
 
As would I.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/b883b728/attachment.html>

From unicode at unicode.org  Mon Aug 27 16:18:31 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Mon, 27 Aug 2018 13:18:31 -0800
Subject: Private Use areas
In-Reply-To: <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
Message-ID: <CABPY6Z1k0ddBvKGh4Lts-dPpLh82u5YkBnQOLh3UDVoVuoZq9w@mail.gmail.com>

William Overington wrote,


On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington
<wjgo_10009 at btinternet.com> wrote:

> Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters
> U+24B6 .. U+24E9.
>
> Use U+2473 as if it were a circled space.

??????????????????????????????
??????????????????????


From unicode at unicode.org  Mon Aug 27 16:20:26 2018
From: unicode at unicode.org (Rebecca Bettencourt via Unicode)
Date: Mon, 27 Aug 2018 14:20:26 -0700
Subject: Private Use areas
In-Reply-To: <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
Message-ID: <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>

>
> > That sounds like a non-conformant use of characters in the U+24xx block.
>
> Well, you are an expert on these things and I do not understand as to with
> what it would be non-conformant.
>
>
A conformant process must interpret ??????? as the characters ??????? and
not as a signal to process what follows as anything other than plain text.

What you are proposing is a higher-level protocol, whether you realize it
or not. Unfortunately your higher-level protocol has a serious flaw in that
it cannot represent the string "???????". Also, seeing a bunch of circled
alphanumeric characters in a document ???????????????????????.

There are plenty of already-existing higher-level protocols (you mentioned
one: XML) that could be used to provide information about PUA characters,
and they are all much better suited to that purpose than what you are
proposing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/d5a6410b/attachment.html>

From unicode at unicode.org  Mon Aug 27 16:26:14 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Mon, 27 Aug 2018 22:26:14 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <CABPY6Z2Z9aAFPNVjNmDv=0gszxDt7i6t1Q45ZTsMGZFUxwy5_g@mail.gmail.com>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
 <MW2PR2101MB1065B83EE9D095845A878F2BD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <MW2PR2101MB1065EB5088B9FDE938A1CFEAD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <CABPY6Z2Z9aAFPNVjNmDv=0gszxDt7i6t1Q45ZTsMGZFUxwy5_g@mail.gmail.com>
Message-ID: <20519295.43788.1535405174739.JavaMail.defaultUser@defaultHost>

James Kass wrote:

> If a user has thousands of files using PUA characters, and all the files are using the same PUA convention, why would each file need to contain metadata for each PUA character used within?  (Rhetorical)

Because each such file would then be self-contained and free-standing.

Such metadata need not necessarily be a huge quantity of data.

William Overington

Monday 27 August 2018


From unicode at unicode.org  Mon Aug 27 19:09:17 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 27 Aug 2018 20:09:17 -0400
Subject: Private Use areas
In-Reply-To: <CABPY6Z1k0ddBvKGh4Lts-dPpLh82u5YkBnQOLh3UDVoVuoZq9w@mail.gmail.com>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
 <CABPY6Z1k0ddBvKGh4Lts-dPpLh82u5YkBnQOLh3UDVoVuoZq9w@mail.gmail.com>
Message-ID: <6a83a5f9-5127-cfe4-9ca2-dc4f25d9b1dd@kli.org>

On 08/27/2018 05:18 PM, James Kass via Unicode wrote:
> William Overington wrote,
>
>
>
> On Mon, Aug 27, 2018 at 12:59 AM, William_J_G Overington
> <wjgo_10009 at btinternet.com> wrote:
>
>> Use circled digits U+24EA, U+2460 .. U+2469, and Circled Latin letters
>> U+24B6 .. U+24E9.
>>
>> Use U+2473 as if it were a circled space.
> ??????????????????????????????
> ??????????????????????

And what's wrong with the ASCII digits?


~mark


From unicode at unicode.org  Mon Aug 27 19:44:57 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Mon, 27 Aug 2018 20:44:57 -0400
Subject: Private Use areas
In-Reply-To: <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
Message-ID: <a3a9d648-8c53-6fa9-da11-4da5839fab48@kli.org>

But there's nothing wrong with proposing a higher-level protocol; 
indeed, that's what Ken Whistler was saying: you need a protocol to 
transmit? this information.? It's metadata, so it will perforce be a 
higher-level protocol of some kind, whether transmitting actually 
out-of-band or reserving a piece of the file for metadata.? That's 
fine.? I'm not sure what the advantage is of using circled characters 
instead of plain old ascii.? You have to set off your reserved area 
somehow, and I don't think using circled chars is the least obtrusive 
way to do it.? You could use XML; that would be pretty well-suited to 
the task, but maybe it's overkill.? If all you need is to reference some 
"standard" PUA interpretation (per James Kass' take on this, not William 
Overington's), then just a header like "[PUA00001]" would work just 
fine.? (Compare emacs with things like "-*- encoding: utf-8 -*-" or 
whatever.)

For larger chunks of meta-info, XML might be a good choice, but even 
then, it could be an XML *header* to an otherwise ordinary text file.? 
Yes, you'd have to delimit it somehow, and probably have a top header (a 
"magic number") to signal the protocol, but that's doable.? For 
applications not supporting this protocol, such a setup is probably 
easier for the eye to skip past (even if it's long) than a bunch of 
circled letters.

A protocol like that is outside of Unicode's scope (just like XML is), 
but it's certainly something you could write up and try to standardize 
and get used, with or without the support of ISO. People are coming up 
with file formats all the time (and if you really want to used circled 
characters, go ahead.? That's something for you to consider in the 
design phase of the project).

~mark


On 08/27/2018 05:20 PM, Rebecca Bettencourt via Unicode wrote:
>
>             > That sounds like a non-conformant use of characters in
>             the U+24xx block.
>
>             Well, you are an expert on these things and I do not
>             understand as to with what it would be non-conformant.
>
>
> A conformant process must interpret ??????? as the characters???????? 
> and not as a signal to process what follows as anything other than 
> plain text.
>
> What you are proposing is a higher-level protocol, whether you realize 
> it or not. Unfortunately your higher-level protocol has a serious flaw 
> in that it cannot represent the string "???????". Also, seeing a bunch 
> of circled alphanumeric characters in a document ???????????????????????.
>
> There are plenty of already-existing higher-level protocols (you 
> mentioned one: XML) that could be used to provide information about 
> PUA characters, and they are all much better suited to that purpose 
> than what you are proposing.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180827/4e09141f/attachment.html>

From unicode at unicode.org  Tue Aug 28 05:27:28 2018
From: unicode at unicode.org (Asmus Freytag via Unicode)
Date: Tue, 28 Aug 2018 03:27:28 -0700
Subject: Private Use areas
In-Reply-To: <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
Message-ID: <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180828/0d39d93c/attachment.html>

From unicode at unicode.org  Tue Aug 28 05:44:58 2018
From: unicode at unicode.org (Cosmin Apreutesei via Unicode)
Date: Tue, 28 Aug 2018 13:44:58 +0300
Subject: Line wrapping of mixed LTR/RTL text
Message-ID: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>

Hello everyone,

I'm having a bit of trouble implementing line wrapping with bidi and I
would like to ask for some advice or hints on what is the proper way
to do this.

UAX#9 section 3.4 says that bidi reordering should be done after line
wrapping. But in order to do line wrapping correctly I need to be able
to visually ignore some whitespace, and I'm not sure exactly which
whitespace must be ignored.

There is this sentence in UAX#9 which provides a clue: "[...] trailing
whitespace will appear at the visual end of the line (in the paragraph
direction).". I'm not sure what that means, but by doing some tests
with fribidi and libunibreak I noticed that the whitespace always
sticks to the logical end of the word (so visually to the right for
LTR runs and to the left for RTL runs), regardless of the base
paragraph direction. Is it safe to use this assumption and always
remove the whitespace at the logical end of the last word of the line?
Or is it more complicated than that?

Quick example showing the problem. The following text:

??????? ABC DEF

with RTL base direction would wrap (for a certain line width) as:

ABC  ???????
DEF

with two spaces between the Latin and Arabic text, one from the Latin
text and one from the Arabic text. Since the line logically ends with
the "C" and LTR direction, I should have to probably remove the space
after the "C" (and, as a rule, just remove the whitespace at the
logical end of the word, regardless of paragraph's direction or word's
direction). Is this the right way to do it?

Screenshots attached.

Thanks!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1.png
Type: image/png
Size: 12005 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180828/caac4976/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2.png
Type: image/png
Size: 14359 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20180828/caac4976/attachment-0001.png>

From unicode at unicode.org  Tue Aug 28 03:26:12 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Tue, 28 Aug 2018 09:26:12 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk>
References: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk>
Message-ID: <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost>

Hi
 
Mark E. Shoulson wrote:
 
> I'm not sure what the advantage is of using circled characters instead of plain old ascii.
 
My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters.
 
My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format.
 
William Overington
 
Tuesday 28 August 2018
 

From unicode at unicode.org  Tue Aug 28 10:24:28 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Tue, 28 Aug 2018 16:24:28 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <CABPY6Z2Z9aAFPNVjNmDv=0gszxDt7i6t1Q45ZTsMGZFUxwy5_g@mail.gmail.com>
References: <20755305.5008.1535359804198.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
 <7994272.5428.1535360383556.JavaMail.defaultUser@defaultHost>
 <MW2PR2101MB1065B83EE9D095845A878F2BD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <MW2PR2101MB1065EB5088B9FDE938A1CFEAD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <CABPY6Z2Z9aAFPNVjNmDv=0gszxDt7i6t1Q45ZTsMGZFUxwy5_g@mail.gmail.com>
Message-ID: <31723478.26849.1535469868137.JavaMail.defaultUser@defaultHost>

James Kass wrote:

> Non-conformant?  Well, it's probably overkill anyway.  A simpler method of identifying which PUA convention is being used for a file
would be to either have the first line of the file being something like [PUA00001] or to have the file name be something like MYFILE.TXTPUA00001.  Where "PUA00001" equals the CSUR.  Other numbers (PUA00002, PUA00003, etc.) for other PUA conventions.

The problem that then arises is that a registry is needed for what those numbers mean, such as PUA01728. So what if someone writes explaining his designs for glyphs for the language of the people who live in the northern part of the fifth planet from the sun in the science fiction novel he is writing? Is registration granted instantly upon request or is there a threshold of some sort? What if lots of people do that, including some people wanting a registry code number for the various emoji that they want? If there is a threshold of proving usage and so on, or of showing that the designs have been produced AT a business or AT a college or whatever, then the system will only work for some users of the Private Use Areas.

My opinion is that the system needs to be free-standing, with each usage possibly self-contained or with an external reference to a document that is available. Care would need to be taken to send a copy of any such document to deposit libraries such as The British Library so as to ensure long-term conservation.

William Overington

Tuesday 28 August 2018


From unicode at unicode.org  Tue Aug 28 10:58:54 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Tue, 28 Aug 2018 16:58:54 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
 <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>
Message-ID: <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost>

Asmus Freytag wrote:

> There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome.

I am thinking of such an ad-hoc special purpose markup language.

I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display.

I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property.

It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10.

I am wondering how many PUA property variables there would need to be set for the system to be useful.

The sequence could start with all of those PUA property values set at their default values so only those that needed changing need be explicitly set, though others could be explicitly set to the default values if a record were desired. 

William Overington
 
Tuesday 28 August 2018


From unicode at unicode.org  Tue Aug 28 11:28:25 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 28 Aug 2018 19:28:25 +0300
Subject: Line wrapping of mixed LTR/RTL text
In-Reply-To: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
 (message from Cosmin Apreutesei via Unicode on Tue, 28 Aug 2018
 13:44:58 +0300)
References: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
Message-ID: <834lfe4frq.fsf@gnu.org>

> Date: Tue, 28 Aug 2018 13:44:58 +0300
> From: Cosmin Apreutesei via Unicode <unicode at unicode.org>
> 
> There is this sentence in UAX#9 which provides a clue: "[...] trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction).". I'm not sure what that means, but by doing some tests
> with fribidi and libunibreak I noticed that the whitespace always
> sticks to the logical end of the word (so visually to the right for
> LTR runs and to the left for RTL runs), regardless of the base
> paragraph direction.

That is not so if the line ends after the whitespace: in that case the
whitespace is trailing, and will appear at the visual end of the
line.  Only if you add some character after the whitespace will the
whitespace "jump" to the other side of the word.

> Quick example showing the problem. The following text:
> 
> ??????? ABC DEF
> 
> with RTL base direction would wrap (for a certain line width) as:
> 
> ABC  ???????
> DEF
> 
> with two spaces between the Latin and Arabic text, one from the Latin
> text and one from the Arabic text.

No, it should show the space after ABC to the left of ABC,
i.e. immediately before the line end.

What UAX#9 tells you is that you need to decide that the line will
wrap after the space that follows "ABC", the reorder the line as if it
ended after that space, which will produce this:

??????? ABC 

(with the trailing space to the left of "ABC").  Then you should
display "DEF" on the next line.

IOW, the correct order is:

  . find levels
  . wrap in logical order
  . reorder wrapped lines


From unicode at unicode.org  Tue Aug 28 11:43:01 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Tue, 28 Aug 2018 09:43:01 -0700
Subject: Private Use areas
Message-ID: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com>

On August 23, 2011, Asmus Freytag wrote:

> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>> Of all applications, a word processor or DTP application would want
>> to know more about the properties of characters than just whether
>> they are RTL. Line breaking, word breaking, and case mapping come to
>> mind.
>>
>> I would think the format used by standard UCD files, or the XML
>> equivalent, would be preferable to making one up:
>
> The right answer would follow the XML format of the UCD.
>
> That's the only format that allows all necessary information contained
> in one file, and it would leverage of any effort that users of the
> main UCD have made in parsing the XML format.
>
> An XML format shold also be flexible in that you can add/remove not
> just characters, but properties as needed.
>
> The worst thing do do, other than designing something from scratch,
> would be to replicate the UnicodeData.txt layout with its random, but
> fixed collection of properties and insanely many semi-colons. None of
> the existing UCD txt files carries all the needed data in a single
> file.

I don't know if or how I responded 7 years ago, but at least today, I
think this is an excellent suggestion.

If the goal is to encourage vendors to support PUA assignments, using an
exceedingly well-defined format (UAX #42) sitting atop one of the most
widely used base formats ever (XML), with all property information in a
single repository (per PUA scheme), would be great encouragement. I've
devised lots of novel file formats and I think this is one use case
where that would be a real hindrance.

Storing this information in a font, by hook or crook, would lock users
of those PUA characters into that font. At that rate, you might as well
use ASCII-hacked fonts, as we did 25 years ago.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From unicode at unicode.org  Tue Aug 28 12:07:51 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Tue, 28 Aug 2018 19:07:51 +0200
Subject: Line wrapping of mixed LTR/RTL text
In-Reply-To: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
References: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
Message-ID: <CAGa7JC2n085KG1VbYRbjHGqL+=M6xmFAoq=LCBOKdev57pNZ-A@mail.gmail.com>

The space encoded just before the logical end of line or linewrap (in the
middle of the displayed line) has to be moved at end of the physical line
(in the paragraph direction), it should not be kept in the middle.

If you need to force a linewrap on a non-breaking space (because there's no
other break opportunity to wrap the line elsewhere), then treat that
non-breaking space as a regular breaking space which will also be moved at
end of the row (after the margin on the ending side of the paragraph), and
choose the last non-breaking space on the row; usually, all spaces present
at linewraps (including non-breaking spaces) are compacted. But there are
other style policies that will force the linewrap preferably after a
trailing punctuation or a separator punctuation, or before a leading
punctuation, or just after the last unbreakable cluster that can fit the
row (including ion the middle of words at arbitrary position if there's no
hyphenation process or the script does not support hyphenation, such as
sinograms and kanas).

Where to insert linewraps is very fuzzy and depends on the rendering
context and capabilities of the target device (you cannot scroll a piece of
printed paper, but you can scroll a display with a scrollbar or using
navigation cursors in a width-restricted input field)

Le mar. 28 ao?t 2018 ? 16:34, Cosmin Apreutesei via Unicode <
unicode at unicode.org> a ?crit :

> Hello everyone,
>
> I'm having a bit of trouble implementing line wrapping with bidi and I
> would like to ask for some advice or hints on what is the proper way
> to do this.
>
> UAX#9 section 3.4 says that bidi reordering should be done after line
> wrapping. But in order to do line wrapping correctly I need to be able
> to visually ignore some whitespace, and I'm not sure exactly which
> whitespace must be ignored.
>
> There is this sentence in UAX#9 which provides a clue: "[...] trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction).". I'm not sure what that means, but by doing some tests
> with fribidi and libunibreak I noticed that the whitespace always
> sticks to the logical end of the word (so visually to the right for
> LTR runs and to the left for RTL runs), regardless of the base
> paragraph direction. Is it safe to use this assumption and always
> remove the whitespace at the logical end of the last word of the line?
> Or is it more complicated than that?
>
> Quick example showing the problem. The following text:
>
> ??????? ABC DEF
>
> with RTL base direction would wrap (for a certain line width) as:
>
> ABC  ???????
> DEF
>
> with two spaces between the Latin and Arabic text, one from the Latin
> text and one from the Arabic text. Since the line logically ends with
> the "C" and LTR direction, I should have to probably remove the space
> after the "C" (and, as a rule, just remove the whitespace at the
> logical end of the word, regardless of paragraph's direction or word's
> direction). Is this the right way to do it?
>
> Screenshots attached.
>
> Thanks!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180828/e392b7cf/attachment.html>

From unicode at unicode.org  Tue Aug 28 12:13:49 2018
From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode)
Date: Tue, 28 Aug 2018 18:13:49 +0100 (BST)
Subject: Private Use areas - Vertical Text
In-Reply-To: <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
Message-ID: <1421005745.806742.1535476429135@mail2.virginmedia.com>


> 
>     On 27 August 2018 at 15:22 Peter Constable via Unicode <unicode at unicode.org> wrote:
> 
>     Layout engines that support CJK vertical layout do not rely on the 'vert' feature to rotate glyphs for CJK ideographs, but rather rotate the glyph 90? and switch to using vertical glyph metrics. The 'vert' feature is used to substitute vertical alternate glyphs as needed, such as for punctuation that isn't automatically rotated (and would probably need a differently-positioned alternate in any case).
> 
>     Cf. UAX 50.
> 

There have been some pretty confused statements. I believe the observed problem is that PUA characters for Zhuang CJK ideographs get rotated when displayed vertically rather than left-to-right.

Unicode is doing what it can in this matter:

(a) Zhuang PUA characters are being made individually obsolete.

(b) By default, PUA characters have the value of Vertical_orientation=upright as do CJK ideographs.

For CJK ideographs, it is not clear to me when the vert feature (if present) would be applied.  Is it only for some codepoints (vo=tu), or is it for all that the engine expects to be displayed ?upright? in vertical text?  The vrtr feature (if present) would be applied when glyphs are to be rotated.  Is it for all such glyphs, or only those for which rotation is expected to be inadequate (vo=tr)?  It seems that feature vrt2 is to be applied to all glyphs; perhaps rotation is the default behaviour when there is no look-up value for a glyph that the engine expects to be rotated.  The truly difficult case would be when there is no attempt to apply a look-up ? possibly vrtr would not apply to /p{vo=r}.

I would expect that defining the lookup vrt2 or vrtr to map Zhuang glyphs to themselves (or something prerotated) would cure the problem.  This would not work for sequences of Zhuang ideographs treated as RTL text - but that is unlikely to happen.

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180828/ac969c9e/attachment.html>

From unicode at unicode.org  Tue Aug 28 13:28:58 2018
From: unicode at unicode.org (Cosmin Apreutesei via Unicode)
Date: Tue, 28 Aug 2018 21:28:58 +0300
Subject: Line wrapping of mixed LTR/RTL text
In-Reply-To: <834lfe4frq.fsf@gnu.org>
References: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
 <834lfe4frq.fsf@gnu.org>
Message-ID: <CAKJdRannA54ov-Qq9ZKnhdmC=c_EGdoK+5MGN7=WUqkrE+wjHA@mail.gmail.com>

Hi Eli, thanks for answering! I think I'm getting closer. Just a few
more clarifications if you please.

> That is not so if the line ends after the whitespace: in that case the
> whitespace is trailing, and will appear at the visual end of the
> line.

So only if it's a soft break I should indeed remove the last logical
space, if it's before a hard break then leave it alone.

> Only if you add some character after the whitespace will the
> whitespace "jump" to the other side of the word.

... because the hard break just turned into a soft break and the newly
typed character will appear on the next line with a hard line break
after it, right?

> No, it should show the space after ABC to the left of ABC,
> i.e. immediately before the line end.

Just to make sure, this moving of the last space at the visual end of
the line can only be experienced with a moving cursor, right? I mean
as far as displaying goes (and as far as line width computation for
the purposes of line wrapping goes), that space is just removed,
right?  I'm trying to infer the purpose of moving that space to the
end of the line instead of just removing it: is the idea to always
provide a cursor at the visual end of the line so that typing can
continue there or is there more to it?

> What UAX#9 tells you is that you need to decide that the line will
> wrap after the space that follows "ABC"

... but when computing the line width I should not include the width
of that space, right? since it will not take space in the box in the
end.

>, then reorder the line as if it
> ended after that space, which will produce this:
>
> ??????? ABC
>
> (with the trailing space to the left of "ABC").  Then you should
> display "DEF" on the next line.

You mean it will produce this:

" ABC ???????"


From unicode at unicode.org  Tue Aug 28 13:33:14 2018
From: unicode at unicode.org (Cosmin Apreutesei via Unicode)
Date: Tue, 28 Aug 2018 21:33:14 +0300
Subject: Line wrapping of mixed LTR/RTL text
In-Reply-To: <CAGa7JC2n085KG1VbYRbjHGqL+=M6xmFAoq=LCBOKdev57pNZ-A@mail.gmail.com>
References: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
 <CAGa7JC2n085KG1VbYRbjHGqL+=M6xmFAoq=LCBOKdev57pNZ-A@mail.gmail.com>
Message-ID: <CAKJdRa=b+HP23ub4EMq4qAET3=U8cKPArFxNox9ZbcrzfH6shg@mail.gmail.com>

Hi Philippe,

> The space encoded just before the logical end of line or linewrap (in the middle of the displayed line) has to be moved at end of the physical line (in the paragraph direction), it should not be kept in the middle.

Ok, that seem to confirm what Eli is saying and it clarifies that
sentence from UAX#9. Thanks!

From unicode at unicode.org  Tue Aug 28 13:48:10 2018
From: unicode at unicode.org (Eli Zaretskii via Unicode)
Date: Tue, 28 Aug 2018 21:48:10 +0300
Subject: Line wrapping of mixed LTR/RTL text
In-Reply-To: <CAKJdRannA54ov-Qq9ZKnhdmC=c_EGdoK+5MGN7=WUqkrE+wjHA@mail.gmail.com>
 (message from Cosmin Apreutesei on Tue, 28 Aug 2018 21:28:58 +0300)
References: <CAKJdRamNcaPoeqPhr7Mx16dRLBnkyx+azKgOigaDFZtAza16kQ@mail.gmail.com>
 <834lfe4frq.fsf@gnu.org>
 <CAKJdRannA54ov-Qq9ZKnhdmC=c_EGdoK+5MGN7=WUqkrE+wjHA@mail.gmail.com>
Message-ID: <83tvne2uqd.fsf@gnu.org>

> From: Cosmin Apreutesei <cosmin.apreutesei at gmail.com>
> Date: Tue, 28 Aug 2018 21:28:58 +0300
> Cc: unicode at unicode.org
> 
> > That is not so if the line ends after the whitespace: in that case the
> > whitespace is trailing, and will appear at the visual end of the
> > line.
> 
> So only if it's a soft break I should indeed remove the last logical
> space, if it's before a hard break then leave it alone.

Actually, you don't have to remove it, you could leave it.  It's only
an aesthetic issue.

> > No, it should show the space after ABC to the left of ABC,
> > i.e. immediately before the line end.
> 
> Just to make sure, this moving of the last space at the visual end of
> the line can only be experienced with a moving cursor, right? I mean
> as far as displaying goes (and as far as line width computation for
> the purposes of line wrapping goes), that space is just removed,
> right?

As I said, not necessarily.  But it is definitely there when you
reorder characters for display.

> I'm trying to infer the purpose of moving that space to the
> end of the line instead of just removing it

If you remove trailing space, then you need to see it being trailing
before you remove it.  That is the purpose of moving it.

> > What UAX#9 tells you is that you need to decide that the line will
> > wrap after the space that follows "ABC"
> 
> ... but when computing the line width I should not include the width
> of that space, right? since it will not take space in the box in the
> end.

If you will remove the space, then yes.

> You mean it will produce this:
> 
> " ABC ???????"

Yes.

From unicode at unicode.org  Tue Aug 28 23:04:31 2018
From: unicode at unicode.org (via Unicode)
Date: Wed, 29 Aug 2018 12:04:31 +0800
Subject: Private Use areas - Vertical Text
In-Reply-To: <1421005745.806742.1535476429135@mail2.virginmedia.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
Message-ID: <970787d82640279efdd541f02e39a1bd@koremail.com>

Dear Richard and Peter,

apologies for the lack of clarity. Let me try to explain below.

On 2018-08-29 01:13, WORDINGHAM RICHARD via Unicode wrote:
>> On 27 August 2018 at 15:22 Peter Constable via Unicode
>> <unicode at unicode.org> wrote:
>> 
>> Layout engines that support CJK vertical layout do not rely on the
>> 'vert' feature to rotate glyphs for CJK ideographs, but rather
>> rotate the glyph 90? and switch to using vertical glyph metrics.
>> The 'vert' feature is used to substitute vertical alternate glyphs
>> as needed, such as for punctuation that isn't automatically rotated
>> (and would probably need a differently-positioned alternate in any
>> case).
>> 
>> Cf. UAX 50.
> 
> There have been some pretty confused statements. I believe the
> observed problem is that PUA characters for Zhuang CJK ideographs get
> rotated when displayed vertically rather than left-to-right.
> 

Yes, as Richard says when CJK Zhuang text is displayed vertically whilst 
the Zhuang characters in Unicode remain upright, but those with PUA 
codepoints are rotated 90?. This is because the PUA characters are 
treated like English text, which are correctly rotated 90?. The 
orientation of the CJK characters in this case appears to depend on 
which block they belong to. As Peter points out this does not seem to 
match UAX 50.

> Unicode is doing what it can in this matter:
> 
> (a) Zhuang PUA characters are being made individually obsolete.
> 

Yes and No. Whilst a thousand Zhuang characters have been enocoded and 
two thousand have been submitted via IRG, however the number of PUA 
Zhuang characters is about the same or increasing. In 2006 when started 
just under 6k PUA points were used, presently there are over 8k, over 6k 
of which have not been submitted, and the earliest any future 
submissions can be encoded is 2026. That being said the number of more 
common Zhuang characters needing PUA support is coming down. So whilst 
individual characters are being resolved, the need for PUA Zhuang 
characters remains, and will so for decades to come.

> (b) By default, PUA characters have the value of
> Vertical_orientation=upright as do CJK ideographs.
> 

Noted above.

Regards
John

> For CJK ideographs, it is not clear to me when the vert feature (if
> present) would be applied.  Is it only for some codepoints (vo=tu), or
> is it for all that the engine expects to be displayed 'upright' in
> vertical text?  The vrtr feature (if present) would be applied when
> glyphs are to be rotated.  Is it for all such glyphs, or only those
> for which rotation is expected to be inadequate (vo=tr)?  It seems
> that feature vrt2 is to be applied to all glyphs; perhaps rotation is
> the default behaviour when there is no look-up value for a glyph that
> the engine expects to be rotated.  The truly difficult case would be
> when there is no attempt to apply a look-up - possibly vrtr would not
> apply to /p{vo=r}.
> 
> I would expect that defining the lookup vrt2 or vrtr to map Zhuang
> glyphs to themselves (or something prerotated) would cure the problem.
>  This would not work for sequences of Zhuang ideographs treated as RTL
> text - but that is unlikely to happen.
> 
> Richard.


From unicode at unicode.org  Wed Aug 29 00:47:47 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Wed, 29 Aug 2018 07:47:47 +0200
Subject: Private Use areas
References: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com>
Message-ID: <86h8jdaflo.fsf@mimuw.edu.pl>

On Tue, Aug 28 2018 at  9:43 -0700, unicode at unicode.org writes:
> On August 23, 2011, Asmus Freytag wrote:
>
>> On 8/23/2011 7:22 AM, Doug Ewell wrote:
>>> Of all applications, a word processor or DTP application would want
>>> to know more about the properties of characters than just whether
>>> they are RTL. Line breaking, word breaking, and case mapping come to
>>> mind.
>>>
>>> I would think the format used by standard UCD files, or the XML
>>> equivalent, would be preferable to making one up:

Right. I was not so quick to state this so early, but 2 years ago I
wrote to the MUFI list:


--8<---------------cut here---------------start------------->8---
On Sat, Jan 02 2016 at 12:35 CET, odd.haugen at uib.no writes:

[...]

> Note the permanent URI at the University Library in Bergen. This will
> in all likelihood be the last recommendation of its kind (and
> certainly the last edited by the undersigned), so please look out for
> new solutions (databases or the like) on the MUFI web site!

I think that one of the forms, perhaps even the primary one, should
follow the original Unicode Character Database and the
output of Unibook (http://www.unicode.org/unibook/).

The idea can be tested by converting the present recommendation to this
form. Unfortunately I'm unable to contribute myself to this task.

One of the advantages would be that the various character browsers can
be adapted relatively easily to provide info about the MUFI characters.

A simpler variant of this idea is to use Unibook-like format to
document fonts. A quick-and-dirty tools for this purpose has been
prepared by a student of mine:

https://bitbucket.org/jsbien/fntsample-fork-with-ucd-comments/
https://bitbucket.org/jsbien/unicode-ucd-parser

A sample output of the tools is available at

https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf

(the font is also quick-and-dirty and unfinished work).

--8<---------------cut here---------------end--------------->8---

Unfortunately there was no reaction.

>>
>> The right answer would follow the XML format of the UCD.
>>
>> That's the only format that allows all necessary information contained
>> in one file,

For me necessary are also comments and crossreferences contained in
NamesList.txt. Do I understand correctly that only "ISO Comment
properties" are included in the file?

>> and it would leverage of any effort that users of the
>> main UCD have made in parsing the XML format.
>>
>> An XML format shold also be flexible in that you can add/remove not
>> just characters, but properties as needed.
>>
>> The worst thing do do, other than designing something from scratch,
>> would be to replicate the UnicodeData.txt layout with its random, but
>> fixed collection of properties and insanely many semi-colons. None of
>> the existing UCD txt files carries all the needed data in a single
>> file.
>
> I don't know if or how I responded 7 years ago, but at least today, I
> think this is an excellent suggestion.
>
> If the goal is to encourage vendors to support PUA assignments, using an
> exceedingly well-defined format (UAX #42) sitting atop one of the most
> widely used base formats ever (XML), with all property information in a
> single repository (per PUA scheme), would be great encouragement.

I think we need also the data in the format acceptable by UniBook.

> I've devised lots of novel file formats and I think this is one use
> case where that would be a real hindrance.

> Storing this information in a font, by hook or crook, would lock users
> of those PUA characters into that font. At that rate, you might as well
> use ASCII-hacked fonts, as we did 25 years ago.

Storing the information in a font is inappropriate not only for the
technical reasons, as I wrote recently (on Thu, Aug 23 2018)

> Fonts are for *rendering*, new characters and variants are more and
> more often needed for *input* of real life old texts with sufficient
> precision.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien

From unicode at unicode.org  Wed Aug 29 03:06:36 2018
From: unicode at unicode.org (James Kass via Unicode)
Date: Wed, 29 Aug 2018 00:06:36 -0800
Subject: Private Use areas - Vertical Text
In-Reply-To: <970787d82640279efdd541f02e39a1bd@koremail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
 <970787d82640279efdd541f02e39a1bd@koremail.com>
Message-ID: <CABPY6Z1WkLcBq6sg+XTpS1B0E3eAypWX164mT3jcVHdHH5eHPQ@mail.gmail.com>

John Knightley wrote,

> Yes, as Richard says when CJK Zhuang text is displayed
> vertically whilst the Zhuang characters in Unicode remain
> upright, but those with PUA codepoints are rotated 90?.
> This is because the PUA characters are treated like English
> text, which are correctly rotated 90?. ...
>
> ...
> ... the need for PUA Zhuang characters remains, and will
> so for decades to come.

A possible work-around would be to have two fonts for PUA Zhuang, one
for horizontal text and one for vertical.  The one for the vertical
text would have the glyphs in the font pre-rotated 90? anti-clockwise.
This would require font switching when switching from horizontal to
vertical layout, of course.


From unicode at unicode.org  Wed Aug 29 03:25:43 2018
From: unicode at unicode.org (Andrew West via Unicode)
Date: Wed, 29 Aug 2018 09:25:43 +0100
Subject: Private Use areas - Vertical Text
In-Reply-To: <1421005745.806742.1535476429135@mail2.virginmedia.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
Message-ID: <CALgEMhw5=rhQwT1xNhvUjKGF_mvtVjXos8BhENxnwqmRNVkjtw@mail.gmail.com>

On Tue, 28 Aug 2018 at 18:15, WORDINGHAM RICHARD via Unicode
<unicode at unicode.org> wrote:
>
> Unicode is doing what it can in this matter:
>
> (a) Zhuang PUA characters are being made individually obsolete.

Not by a nebulous entity called "Unicode", or even by the Unicode
Consortium per se, but by the hard work over many years by individual
experts such as John Knightley.

Andrew

From unicode at unicode.org  Wed Aug 29 03:32:57 2018
From: unicode at unicode.org (Andrew West via Unicode)
Date: Wed, 29 Aug 2018 09:32:57 +0100
Subject: Private Use areas - Vertical Text
In-Reply-To: <970787d82640279efdd541f02e39a1bd@koremail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
 <970787d82640279efdd541f02e39a1bd@koremail.com>
Message-ID: <CALgEMhwA-9Vn2YN34+AGKkZivZux9rn+AYWCWkFbb9oxNDn26w@mail.gmail.com>

On Wed, 29 Aug 2018 at 05:07, via Unicode <unicode at unicode.org> wrote:
>
> Yes, as Richard says when CJK Zhuang text is displayed vertically whilst
> the Zhuang characters in Unicode remain upright, but those with PUA
> codepoints are rotated 90?.

John, you did not explain by what mechanism you were trying to display
vertical PUA Zhuang text.

I can display vertically-oriented PUA-encoded CJKVZ ideographs in
vertical layout in web pages using CSS, as demonstrated in this test
page:

http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html

The PUA characters display with correct orientation under Windows 10
on the Edge, Chrome and Firefox browsers. The test page only fails
under IE, but we are not meant to use IE anymore anyway.

Andrew


From unicode at unicode.org  Wed Aug 29 05:18:19 2018
From: unicode at unicode.org (via Unicode)
Date: Wed, 29 Aug 2018 18:18:19 +0800
Subject: Private Use areas - Vertical Text
In-Reply-To: <CALgEMhwA-9Vn2YN34+AGKkZivZux9rn+AYWCWkFbb9oxNDn26w@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
 <970787d82640279efdd541f02e39a1bd@koremail.com>
 <CALgEMhwA-9Vn2YN34+AGKkZivZux9rn+AYWCWkFbb9oxNDn26w@mail.gmail.com>
Message-ID: <da34306f1fff177cfa22f0fcc997b754@koremail.com>

Dear Andrew,

I was using a change horizontal to vertical text feature in office, the 
PUA characters being from plane 15.

Regards
John

On 2018-08-29 16:32, Andrew West via Unicode wrote:
> On Wed, 29 Aug 2018 at 05:07, via Unicode <unicode at unicode.org> wrote:
>> 
>> Yes, as Richard says when CJK Zhuang text is displayed vertically 
>> whilst
>> the Zhuang characters in Unicode remain upright, but those with PUA
>> codepoints are rotated 90?.
> 
> John, you did not explain by what mechanism you were trying to display
> vertical PUA Zhuang text.
> 
> I can display vertically-oriented PUA-encoded CJKVZ ideographs in
> vertical layout in web pages using CSS, as demonstrated in this test
> page:
> 
> http://www.babelstone.co.uk/Fonts/PUA_Vertical_Test.html
> 
> The PUA characters display with correct orientation under Windows 10
> on the Edge, Chrome and Firefox browsers. The test page only fails
> under IE, but we are not meant to use IE anymore anyway.
> 
> Andrew


From unicode at unicode.org  Wed Aug 29 07:05:31 2018
From: unicode at unicode.org (Andrew West via Unicode)
Date: Wed, 29 Aug 2018 13:05:31 +0100
Subject: Private Use areas - Vertical Text
In-Reply-To: <da34306f1fff177cfa22f0fcc997b754@koremail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
 <970787d82640279efdd541f02e39a1bd@koremail.com>
 <CALgEMhwA-9Vn2YN34+AGKkZivZux9rn+AYWCWkFbb9oxNDn26w@mail.gmail.com>
 <da34306f1fff177cfa22f0fcc997b754@koremail.com>
Message-ID: <CALgEMhyMMT+HTRbq9Q17REuPz0ox=S7sYV2qrUBODd5pOY-39A@mail.gmail.com>

On Wed, 29 Aug 2018 at 11:18, <jk at koremail.com> wrote:
>
> I was using a change horizontal to vertical text feature in office, the
> PUA characters being from plane 15.

I tested with Word 2007, and normal PUA characters from my font were
displayed with vertical orientation in a vertical text box, but Plane
15 PUA characters were rotated.

I also tested with Word 2016, and both normal PUA characters and Plane
15 PUA characters were displayed with vertical orientation in a
vertical text box, as you want, although there were vertical spacing
issues with the Plane 15 PUA characters which suggest that the
vertical metrics tables (vhea and vmtx) in the font are not being
applied for Plane 15 characters (or it could be a problem with my
font).

Andrew

From unicode at unicode.org  Wed Aug 29 15:33:18 2018
From: unicode at unicode.org (WORDINGHAM RICHARD via Unicode)
Date: Wed, 29 Aug 2018 21:33:18 +0100 (BST)
Subject: Private Use areas - Vertical Text
In-Reply-To: <CALgEMhyMMT+HTRbq9Q17REuPz0ox=S7sYV2qrUBODd5pOY-39A@mail.gmail.com>
References: <20180820114749.665a7a7059d7ee80bb4d670165c8327d.597c9f0c42.wbe@email03.godaddy.com>
 <CAJ2xs_E7Vb2m5VD5K9LZw6-mX53Aj2Fiai-s=xxdY2Eu+S+1Tw@mail.gmail.com>
 <30d1d69e-85e6-a956-c486-8757eba1a996@kli.org>
 <444142b31601a3fbbdbb765e47cbd125@koremail.com>
 <20180821110156.453c129a@JRWUBU2>
 <MW2PR2101MB106548CC4C482A55FFF5AB5AD50B0@MW2PR2101MB1065.namprd21.prod.outlook.com>
 <1421005745.806742.1535476429135@mail2.virginmedia.com>
 <970787d82640279efdd541f02e39a1bd@koremail.com>
 <CALgEMhwA-9Vn2YN34+AGKkZivZux9rn+AYWCWkFbb9oxNDn26w@mail.gmail.com>
 <da34306f1fff177cfa22f0fcc997b754@koremail.com>
 <CALgEMhyMMT+HTRbq9Q17REuPz0ox=S7sYV2qrUBODd5pOY-39A@mail.gmail.com>
Message-ID: <910040764.839307.1535574798093@mail2.virginmedia.com>


> 
>     On 29 August 2018 at 13:05 Andrew West via Unicode <unicode at unicode.org> wrote:
> 
>     I tested with Word 2007, and normal PUA characters from my font were
> 
>     displayed with vertical orientation in a vertical text box, but Plane
>     15 PUA characters were rotated.
> 

And then the original question is whether a font can suppress this rotation.  For example, it is entirely possible that the rotation could be eliminated by the vrt2 OpenType feature mapping a Zhuang PUA glyph to an identical glyph.

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180829/d6363c0b/attachment.html>

From unicode at unicode.org  Wed Aug 29 16:42:57 2018
From: unicode at unicode.org (Andrew Glass via Unicode)
Date: Wed, 29 Aug 2018 21:42:57 +0000
Subject: Tamil Brahmi Short Mid Vowels
In-Reply-To: <20180721085026.6aa07876@JRWUBU2>
References: <20180721020131.4b22887b@JRWUBU2>
 <CAH-HCWVtjScTa-LZipxYFvEjT1cvjnfqhmc6ffZWiA6YQ6xLBA@mail.gmail.com>
 <20180721085026.6aa07876@JRWUBU2>
Message-ID: <CY4PR21MB06323400D6DDAC81A2613AD98E090@CY4PR21MB0632.namprd21.prod.outlook.com>

Thank you Richard and Shriramana for bringing up this interesting problem.

I agree we need to fix this. I don?t want to fix this with a font hack or change to USE cluster rules or properties. I think the right place to fix this is in the encoding. This might be either a new character for Tamil Brahmi Pu??i ? as Shriramana has proposed (L2/12-226<http://www.unicode.org/L2/L2012/12226-brahmi-two-tamil-char.pdf>) ? or separate characters for Tamil Brahmi Short E and Tamil Brahmi Short O in independent and dependent forms (4 characters total). I?m inclined to think that a visible virama, Tamil Brahmi Pu??i, is the right approach.


Cheers,


Andrew


-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Richard Wordingham via Unicode
Sent: Saturday, July 21, 2018 12:50 AM
To: unicode at unicode.org
Subject: Re: Tamil Brahmi Short Mid Vowels


On Sat, 21 Jul 2018 07:55:51 +0530

Shriramana Sharma via Unicode <unicode at unicode.org<mailto:unicode at unicode.org>> wrote:


> This is a unique problem because this is probably the only case where

> the same script produces conjuncts for one language and not for

> another.


There are and have been similar cases.  Reformed (a.k.a. 'typewriter') Malayalam v. traditional Malayalam comes immediately to mind.  Pre-5.0 Myanamar script was similar, with Pali stacking and Burmese mostly not, though that gives you the precedent of disunifying the invisible stacker and the vowel killer, which I've always considered a bad unification inherited from ISCII.  'Pure' Tai and Pali use stacking quite differently in the Tai Tham script, but some Tai languages use a lot of Pali-style spellings.


> I had asked for a separate Tamil Brahmi virama to be encoded which

> would obviate this problem but that was shot down. Maybe that case

> should be reopened?


Could be messy.  Are you saying that people are relying on fonts being free of conjuncts?  One could use a keyboard with a 'pulli' key that produced <U+11046 BRAHMI VIRAMA, U+200C ZWNJ> - I don't know if people do.


Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180829/1c5fe223/attachment.html>

From unicode at unicode.org  Wed Aug 29 19:27:33 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
Subject: Private Use areas
Message-ID: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>

On 29/08/18 07:55, Janusz S. Bie? via Unicode wrote:
> 
> On Tue, Aug 28 2018 at 9:43 -0700, unicode at unicode.org writes:
> > On August 23, 2011, Asmus Freytag wrote:
> >
> >> On 8/23/2011 7:22 AM, Doug Ewell wrote:
> >>> Of all applications, a word processor or DTP application would want
> >>> to know more about the properties of characters than just whether
> >>> they are RTL. Line breaking, word breaking, and case mapping come to
> >>> mind.
> >>>
> >>> I would think the format used by standard UCD files, or the XML
> >>> equivalent, would be preferable to making one up:
[?]
> >>
> >> The right answer would follow the XML format of the UCD.
> >>
> >> That's the only format that allows all necessary information contained
> >> in one file,
> 
> For me necessary are also comments and crossreferences contained in
> NamesList.txt. Do I understand correctly that only "ISO Comment
> properties" are included in the file?

Even that comment field is obsoleted. But it?s unclear to me what exactly 
it was providing from ISO.

> 
> >> and it would leverage of any effort that users of the
> >> main UCD have made in parsing the XML format.
> >>
> >> An XML format shold also be flexible in that you can add/remove not
> >> just characters, but properties as needed.
> >>
> >> The worst thing do do, other than designing something from scratch,
> >> would be to replicate the UnicodeData.txt layout with its random, but
> >> fixed collection of properties and insanely many semi-colons. None of
> >> the existing UCD txt files carries all the needed data in a single
> >> file.

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. Indeed we could use more information than what is yielded 
by UCD \setminus NamesList.txt (that we may not parse, as per file header).
Given NamesList.txt / Code Charts comments are kept minimal by design, 
one couldn?t simply pop them into XML or whatever, as the result would be 
disappointing and call for completion in the aftermath. Yet another task 
competing with CLDR survey. Reviewing CLDR data is IMO top priority.
There are many flaws to be fixed in many languages including in English.
A lot of useful digest charts are extracted from XML there, and we really 
need to go through the data and correct the many many errors, please.

Unlike XML, human readability of CSV may not be immediate. Yes you simply 
cannot always count the semicolons and remember the property name from 
the value position if it isn?t obvious by itself. But we use spreadsheets. At least 
some people do. That?s where the magic works. 

Looking up things in a spreadsheet is a good way to find out about wrong 
property values. Looks like handling files only programmatically gets
everything screwed up.


Marcel


From unicode at unicode.org  Thu Aug 30 13:27:30 2018
From: unicode at unicode.org (Doug Ewell via Unicode)
Date: Thu, 30 Aug 2018 12:27:30 -0600
Subject: Unicode Digest, Vol 56, Issue 20
In-Reply-To: <mailman.5.1535648401.24359.unicode@unicode.org>
Message-ID: <201808301827.w7UIRbqF028462@unicode.org>

UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file.
So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead.
The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know.
That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives.


--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------Message: 3Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
From: Marcel Schneider via Unicode <unicode at unicode.org>

Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180830/9d10cc5f/attachment.html>

From unicode at unicode.org  Thu Aug 30 16:26:36 2018
From: unicode at unicode.org (Philippe Verdy via Unicode)
Date: Thu, 30 Aug 2018 23:26:36 +0200
Subject: Unicode Digest, Vol 56, Issue 20
In-Reply-To: <201808301827.w7UIRbqF028462@unicode.org>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
Message-ID: <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>

Welel an alternative to XML is JSON which is more compact and
faster/simpler to process; however JSON has no explicit schema, unless the
schema is being made part of the data itself, complicating its structure
(with many levels of arrays of arrays, in which case it becomes less easy
to read by humans, but more adapted to automated processes for fast
processing).

I'd say that the XML alone is enough to generate any JSON-derived dataset
that will conform to the schema an application expects to process fast (and
with just the data it can process, excluding various extensions still not
implemetned). But the fastest implementations are also based on data tables
encoded in code (such as DLL or Java classes), or custom database formats
(such as Berkeley dB) generated also automatically from the XML, without
the processing cost of decompression schemes and parsers.

Still today, even if XML is not the usual format used by applications, it
is still the most interoperable format that allows building all sorts of
applications in all sorts of languages: the cost of parsing is left to an
application builder/compiler. Some apps embed the compilers themselves and
use a stored cache for faster processing: this approach allows easy updates
by detecting changes in the XML source, and then downloading them. But in
CLDR such updates are generally not automated : the general scheme evolves
over time and there are complex dependencies to check so that some data
becomes usable (frequently you need to implement some new algorithms to
follow the processing rules documented in CLDR, or to use data not
completely validated, or to allow aplicatio?ns to provide their overrides
from insufficiently complete datasets in CLDR, even if CLDR provides a root
locale and applcaitions are supposed to follow the BCP47 fallback
resolution rules; applciations also have their own need about which
language codes they use or need, and CLDR provides many locales that many
applications are still not prepared to render correctly, and many
application users complain if an application is partly translated and
contains too many fallbacks to another language, or worse to another
script).


Le jeu. 30 ao?t 2018 ? 20:38, Doug Ewell via Unicode <unicode at unicode.org>
a ?crit :

> UnicodeData.txt was devised long before any of the other UCD data files.
> Though it might seem like a simple enhancement to us, adding a header
> block, or even a single line, would break a lot of existing processes that
> were built long ago to parse this file.
>
> So Unicode can't add a header to this file, and that is the reason the
> format can never be changed (e.g. with more columns). That is why new files
> keep getting created instead.
>
> The XML format could indeed be expanded with more attributes and more
> subsections. Any process that can parse XML can handle unknown stuff like
> this without misinterpreting the stuff it does know.
>
> That's why the only two reasonable options for getting UCD data are to
> read all the tab- and semicolon-delimited files, and be ready for new
> files, or just read the XML. Asking for changes to existing UCD file
> formats is kind of a non-starter, given these two alternatives.
>
>
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
> -------- Original message --------
> Message: 3
> Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode <unicode at unicode.org>
>
> Curiously, UnicodeData.txt is lacking the header line. That makes it
> unflexible.
> I never wondered why the header line is missing, probably because compared
> to the other UCD files, the file looks really odd without a file header
> showing
> at least the version number and datestamp. It?s like the file was made up
> for
> dumb parsers unable to handle comment delimiters, and never to be upgraded
> to do so.
>
> But I like the format, and that?s why at some point I submitted feedback
> asking
> for an extension. [...]
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180830/a3900aa9/attachment.html>

From unicode at unicode.org  Thu Aug 30 17:33:42 2018
From: unicode at unicode.org (Wordingham Richard via Unicode)
Date: Thu, 30 Aug 2018 23:33:42 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <86h8jdaflo.fsf@mimuw.edu.pl>
References: <20180828094301.665a7a7059d7ee80bb4d670165c8327d.32c1b975e2.wbe@email03.godaddy.com>
 <86h8jdaflo.fsf@mimuw.edu.pl>
Message-ID: <1730684724.866931.1535668422391@mail2.virginmedia.com>


> 
>     On 29 August 2018 at 06:47 "Janusz S. Bie? via Unicode" <unicode at unicode.org> wrote:
> 
>         > > 
> >         Storing this information in a font, by hook or crook, would lock users
> >         of those PUA characters into that font. At that rate, you might as well
> >         use ASCII-hacked fonts, as we did 25 years ago.
> > 
> >     > 

I don't see that at all.  The obvious way in the sfnt format, used by OpenType, is as a table consisting entirely of the XML file.  It is quite easy to add a table to an unsigned sfnt font, and even easier to extract a table consisting entirely of UTF-8 text, though ASCII would be even easier, from a font file.

> 
>     Storing the information in a font is inappropriate not only for thetechnical reasons, as I wrote recently (on Thu, Aug 23 2018)
> 
>         > > 
> >         Fonts are for *rendering*, new characters and variants are more and
> >         more often needed for *input* of real life old texts with sufficient
> >         precision.
> > 
> >     > 

1. There are existing methods of associating a font with a text.  Not using a font needs a new scheme for associating a set of PUA properties with a portion of a file.  The font also serves as a code chart.  It can also hold information on how characters combine, which is notoriously beyond the capability of code charts.

2. Registries can vanish.

3. In practice, a file needs to retain an association with a specialist font.  Preserving the font should preserve its content, but there are pruning techniques (e.g. WOFF2) that may remove this content.

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180830/309cd238/attachment.html>

From unicode at unicode.org  Thu Aug 30 18:14:41 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 31 Aug 2018 01:14:41 +0200 (CEST)
Subject: Unicode Digest, Vol 56, Issue 20
Message-ID: <957858186.11079.1535670881197.JavaMail.www@wwinf1d25>

Thank you for looking into this. First, I?m unable to retrieve the publication you are citing, 
but a February thread had nearly the same subject, referring to Vol. 50. How did you 
compute these figures? Is that a code phrase to say: ?The same questions over and 
over again; let?s settle this on the record, as a reference for later inquiries.?

Also, "unicode-request at unicode.org" doesn?t appear to seem to be a valid e-mail address.
That would mean that I?d better send a proposal with an enhancement request to
docsubmit at unicode.org, rather than contribute to the topic while it is being discussed 
on the Unicode Public Mail List?

OK I?ll try to get something out of this, because many people really want things to grow 
better:

On 30/08/18 20:37 Doug Ewell via Unicode wrote:
> 
> UnicodeData.txt was devised long before any of the other UCD data files.

I can?t think of any era in the computer age where file headers were uncommon, 
and where a parser able to process semicolons couldn?t be directed to make sense
of crosshatches. If ever releasing a headerless file was a mistake, implementers 
would be able to anticipate that it might be corrected at some point. Implementations 
are to be updated at every single Unicode release, that?s what I?m able to tell, while 
ignoring the arcanes of frozen APIs.

> Though it might seem like a simple enhancement to us, adding a header block, or even a single line,
> would break a lot of existing processes that were built long ago to parse this file.

They are hopelessly outdated anyway, and most of them would have been replaced with something 
better since a long time. The remainder might not be worth bothering the rest of the world with 
headerless files.

> So Unicode can't add a header to this file, and that is the reason the format can never be changed
> (e.g. with more columns). That is why new files keep getting created instead.

I figured out something like that rationale, and I can also understand that Unicode isn?t going 
to keep releasing headerless files while waiting for a guy telling them not to do so, and then
to suddenly add the missing header. Also I didn?t really ask for that, but suggested adding 
yet another *new* file, not changing the data structure of the existing UnicodeData.txt. 

As of the reference, a Google search for "unicodedataextended.txt" just brought it up:
http://www.unicode.org/review/pri297/

Having said that, I still think that while not parsing a header line in a process is a 
reasonable position if the field structure is known to be stable, not being able to *skip* 
a header is sort of odd.

> The XML format could indeed be expanded with more attributes and more subsections.
> Any process that can parse XML can handle unknown stuff like this without misinterpreting
> the stuff it does know.

Agreed. I?m not questioning XML. But I?m using spreadsheets. I don?t know how many computer
scientists do use spreadsheets. Perhaps we?re not many looking up UnicodeData.txt that way
(I use it in raw text, too, and I look up ucd.nounihan.flat.xml). Generating code in a 
spreadsheet is considered quick-and-dirty. I don?t agree it?s dirty, but it?s quick.

And above all, it appears that doing certain research in spreadsheets is the most efficient 
way to check whether character properties are matching character identity. Using spreadsheet 
software is trivial, so it might be disconsidered and left to non-scientists, while it is 
closer to human experience and allows to do research in nearly no time, by adding columns, 
filters and formulae, that one would probably spend weeks to code in C, Lisp, Perl or Python 
(that I cannot do, so I?m biased).

> That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files,
> and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter,
> given these two alternatives.

Given the above, one can easily understand why I do not agree with being limited to these two
alternatives. 

Given a process must be able to be updated to be able? to grab a newly added small file 
from the UCD, it can as well be updated to be able to skip file comments, and even to be able 
to parse a new *large* file from the UCD.

On the other hand, given Unicode are ready to add new small semicolon-delimited files, 
they might wish to add as well a new *large* semicolon-delimited file to the UCD.
That large file would have a file header and a header line, and be specified as being flexible.
That file might have one hundred fields delimited by 99 semicolons. These 5 million semicolons 
would still be more lightweight than 5 million attribute names plus the XML tags.

The added value is that people using spreadsheets have a handy file to import, rather than 
each individual having to convert a large XML file to a large CSV file, by lack of the latter
being readily provided by Unicode. 

If this discussion has a positive echo, I or somebody else may submit an appropriate proposal.
But I?d prefer not repeating the mistake of not discussing a topic on Unicode Public prior to 
submitting a proposal that is then kindly put on the agenda, but discussed in disfavor and 
dismissed in disgrace twice at UTC meetings. And figure out why I didn?t wish upstream discussion
here? Because I was naively afraid that the unveiled mistakes could reflect badly on some people.

Turned out that nothing reflects badly on anybody. 

(So UnicodeData.txt could as well get its missing header BTW.)

Regards,

Marcel


From unicode at unicode.org  Thu Aug 30 23:58:37 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 31 Aug 2018 06:58:37 +0200 (CEST)
Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
In-Reply-To: <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
Message-ID: <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>

On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
>
> Welel an alternative to XML is JSON which is more compact and faster/simpler to process;

Thanks for pointing the problem and the solution alike. Indeed the main drawback of the XML 
format of UCD is that it results in an ?insane? filesize. ?Insane? was applied to the number
of semicolons in UnicodeData.txt, but that is irrelevant. What is really insane is the filesize
of the XML versions of the UCD. Even without Unihan, it may take up to a minute or so to load 
in a text editor.

> however JSON has no explicit schema, unless the schema is being made part of the data itself,
> complicating its structure (with many levels of arrays of arrays, in which case it becomes
> less easy to read by humans, but more adapted to automated processes for fast processing).
>
> I'd say that the XML alone is enough to generate any JSON-derived dataset that will conform
> to the schema an application expects to process fast
> (and with just the data it can process, excluding various extensions still not implemetned).
> But the fastest implementations are also based on data tables encoded in code
> (such as DLL or Java classes), or custom database formats (such as Berkeley dB)
> generated also automatically from the XML, without the processing cost of decompression schemes
> and parsers.
>
> Still today, even if XML is not the usual format used by applications, it is still
> the most interoperable format that allows building all sorts of applications
> in all sorts of languages: the cost of parsing is left to an application builder/compiler.

I?ve tried an online tool to get ucd.nounihan.flat.xml converted to CSV. The tool is great 
and offers a lot of options, but given the ?insane? file size, my browser was up for over 
two hours of trouble until I shut down the computer manually. From what I could see in 
the result field, there are many bogus values, meaning that their presence is useless in 
the tags of most characters. And while many attributes have cryptic names in order to keep 
the file size minimal, some attributes have overlong values, ie the design is inconsistent.
Eg in every character we read:
jg="No_Joining_Group"
That is bogus. One would need to take them off the tags of most characters, and even 
in the characters where they are relevant, the value would be simply "No". What?s the use 
of abbreviating "Joining Group" to "jg" in the atribute name if in the value it is written out?
And I?m quoting from U+0000. 
Further many values are set to a crosshatch, instead of simply being removed from the 
characters where they are empty. Then the many instances of "undetermined script" 
resulting in *two* attribues with "Zyyy" value. Then in almost each character we?re told that 
it is not a whitespace, not a dash, not a hyphen, and not a quotation mark:
Dash="N" WSpace="N" Hyphen="N" QMark="N"
One couldn?t tell that UCD does actually benefit from the flexibility of XML, given that many 
attributes are systematically present even where they are useless.
Perhaps ucd-*.xml would be two thirds, half, or one third their actual size if they were 
properly designed.

> Some apps embed the compilers themselves and use a stored cache for faster processing:
> this approach allows easy updates by detecting changes in the XML source, and then
> downloading them.
>
> But in CLDR such updates are generally not automated : the general scheme evolves over time
> and there are complex dependencies to check so that some data becomes usable

Should probably read *un*usable.

> (frequently you need to implement some new algorithms to follow the processing rules
> documented in CLDR, or to use data not completely validated, or to allow aplicatio?ns
> to provide their overrides from insufficiently complete datasets in CLDR,
> even if CLDR provides a root locale and applcaitions are supposed to follow the BCP47
> fallback resolution rules;
> applciations also have their own need about which language codes they use or need,
> and CLDR provides many locales that many applications are still not prepared to render correctly,
> and many application users complain if an application is partly translated and contains
> too many fallbacks to another language, or worse to another script).

So the case is even worse than what I could see when looking into CLDR. Many countries, 
including France, don?t care about the data of their own locale in CLDR, but I?m not going 
to vent about that on Unicode Public, because that involves language offices and authorities, 
and would have political entanglements.

Staying technical, I can tell so far about the file header of UnicodeData.txt 
that I can see zero technical reasons not to add it. Processes using the file to generate
an overview of Unicode also use other files and are thus able to process comments correctly,
whereas those processes using UnicodeData to look up character properties provided in the file 
would start searching the code point. (Perhaps there are compilers building DLLs from the file.)

Le?jeu. 30 ao?t 2018 ??20:38, Doug Ewell via Unicode  a ?crit?:
>


UnicodeData.txt was devised long before any of the other UCD data files. Though it might seem like a simple enhancement to us, adding a header block, or even a single line, would break a lot of existing processes that were built long ago to parse this file.

>
So Unicode can't add a header to this file, and that is the reason the format can never be changed (e.g. with more columns). That is why new files keep getting created instead.

>
The XML format could indeed be expanded with more attributes and more subsections. Any process that can parse XML can handle unknown stuff like this without misinterpreting the stuff it does know.

>
That's why the only two reasonable options for getting UCD data are to read all the tab- and semicolon-delimited files, and be ready for new files, or just read the XML. Asking for changes to existing UCD file formats is kind of a non-starter, given these two alternatives.

>

>

>
--
Doug Ewell | Thornton, CO, US | ewellic.org


>

-------- Original message --------
Message: 3

Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode 
> 
>
Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]


From unicode at unicode.org  Fri Aug 31 00:20:34 2018
From: unicode at unicode.org (Janusz S. =?utf-8?Q?Bie=C5=84?= via Unicode)
Date: Fri, 31 Aug 2018 07:20:34 +0200
Subject: CLDR (was: Private Use areas)
In-Reply-To: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25> (Marcel
 Schneider via Unicode's message of "Thu, 30 Aug 2018 02:27:33 +0200
 (CEST)")
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
Message-ID: <86sh2v3ye5.fsf@mimuw.edu.pl>

On Thu, Aug 30 2018 at  2:27 +0200, unicode at unicode.org writes:

[...]

> Given NamesList.txt / Code Charts comments are kept minimal by design, 
> one couldn?t simply pop them into XML or whatever, as the result would be 
> disappointing and call for completion in the aftermath. Yet another task 
> competing with CLDR survey.

Please elaborate. It's not clear for me what do you mean.

> Reviewing CLDR data is IMO top priority.
> There are many flaws to be fixed in many languages including in English.
> A lot of useful digest charts are extracted from XML there,

Which XML? where?

> and we really 
> need to go through the data and correct the many many errors, please.

Some time ago I tried to have a close look at the Polish locale and
found the CLDR site prohibitively confusing.

Best regards

Janusz

-- 
             ,   
Janusz S. Bien
emeryt (emeritus)
https://sites.google.com/view/jsbien


From unicode at unicode.org  Fri Aug 31 01:19:53 2018
From: unicode at unicode.org (Marius Spix via Unicode)
Date: Fri, 31 Aug 2018 08:19:53 +0200
Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
In-Reply-To: <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
Message-ID: <20180831081953.68476d36@spixxi>

A good compromise between human readability, machine processability and
filesize would be using YAML.

Unlike JSON, YAML supports comments, anchors and references, multiple
documents in a file and several other features.

Regards,

Marius Spix


On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
wrote:

> On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> >
> > Welel an alternative to XML is JSON which is more compact and
> > faster/simpler to process;
> 
> Thanks for pointing the problem and the solution alike. Indeed the
> main drawback of the XML format of UCD is that it results in an
> ?insane? filesize. ?Insane? was applied to the number of semicolons
> in UnicodeData.txt, but that is irrelevant. What is really insane is
> the filesize of the XML versions of the UCD. Even without Unihan, it
> may take up to a minute or so to load in a text editor.
> 
> > however JSON has no explicit schema, unless the schema is being
> > made part of the data itself, complicating its structure (with many
> > levels of arrays of arrays, in which case it becomes less easy to
> > read by humans, but more adapted to automated processes for fast
> > processing).
> >
> > I'd say that the XML alone is enough to generate any JSON-derived
> > dataset that will conform to the schema an application expects to
> > process fast (and with just the data it can process, excluding
> > various extensions still not implemetned). But the fastest
> > implementations are also based on data tables encoded in code (such
> > as DLL or Java classes), or custom database formats (such as
> > Berkeley dB) generated also automatically from the XML, without the
> > processing cost of decompression schemes and parsers.
> >
> > Still today, even if XML is not the usual format used by
> > applications, it is still the most interoperable format that allows
> > building all sorts of applications in all sorts of languages: the
> > cost of parsing is left to an application builder/compiler.
> 
> I?ve tried an online tool to get ucd.nounihan.flat.xml converted to
> CSV. The tool is great and offers a lot of options, but given the
> ?insane? file size, my browser was up for over two hours of trouble
> until I shut down the computer manually. From what I could see in the
> result field, there are many bogus values, meaning that their
> presence is useless in the tags of most characters. And while many
> attributes have cryptic names in order to keep the file size minimal,
> some attributes have overlong values, ie the design is inconsistent.
> Eg in every character we read: jg="No_Joining_Group" That is bogus.
> One would need to take them off the tags of most characters, and even
> in the characters where they are relevant, the value would be simply
> "No". What?s the use of abbreviating "Joining Group" to "jg" in the
> atribute name if in the value it is written out? And I?m quoting from
> U+0000. Further many values are set to a crosshatch, instead of
> simply being removed from the characters where they are empty. Then
> the many instances of "undetermined script" resulting in *two*
> attribues with "Zyyy" value. Then in almost each character we?re told
> that it is not a whitespace, not a dash, not a hyphen, and not a
> quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn?t
> tell that UCD does actually benefit from the flexibility of XML,
> given that many attributes are systematically present even where they
> are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> third their actual size if they were properly designed.
> 
> > Some apps embed the compilers themselves and use a stored cache for
> > faster processing: this approach allows easy updates by detecting
> > changes in the XML source, and then downloading them.
> >
> > But in CLDR such updates are generally not automated : the general
> > scheme evolves over time and there are complex dependencies to
> > check so that some data becomes usable
> 
> Should probably read *un*usable.
> 
> > (frequently you need to implement some new algorithms to follow the
> > processing rules documented in CLDR, or to use data not completely
> > validated, or to allow aplicatio?ns to provide their overrides from
> > insufficiently complete datasets in CLDR, even if CLDR provides a
> > root locale and applcaitions are supposed to follow the BCP47
> > fallback resolution rules; applciations also have their own need
> > about which language codes they use or need, and CLDR provides many
> > locales that many applications are still not prepared to render
> > correctly, and many application users complain if an application is
> > partly translated and contains too many fallbacks to another
> > language, or worse to another script).
> 
> So the case is even worse than what I could see when looking into
> CLDR. Many countries, including France, don?t care about the data of
> their own locale in CLDR, but I?m not going to vent about that on
> Unicode Public, because that involves language offices and
> authorities, and would have political entanglements.
> 
> Staying technical, I can tell so far about the file header of
> UnicodeData.txt that I can see zero technical reasons not to add it.
> Processes using the file to generate an overview of Unicode also use
> other files and are thus able to process comments correctly, whereas
> those processes using UnicodeData to look up character properties
> provided in the file would start searching the code point. (Perhaps
> there are compilers building DLLs from the file.)
> 
> Le?jeu. 30 ao?t 2018 ??20:38, Doug Ewell via Unicode  a ?crit?:
> >
> 
> 
> UnicodeData.txt was devised long before any of the other UCD data
> files. Though it might seem like a simple enhancement to us, adding a
> header block, or even a single line, would break a lot of existing
> processes that were built long ago to parse this file.
> 
> >
> So Unicode can't add a header to this file, and that is the reason
> the format can never be changed (e.g. with more columns). That is why
> new files keep getting created instead.
> 
> >
> The XML format could indeed be expanded with more attributes and more
> subsections. Any process that can parse XML can handle unknown stuff
> like this without misinterpreting the stuff it does know.
> 
> >
> That's why the only two reasonable options for getting UCD data are
> to read all the tab- and semicolon-delimited files, and be ready for
> new files, or just read the XML. Asking for changes to existing UCD
> file formats is kind of a non-starter, given these two alternatives.
> 
> >
> 
> >
> 
> >
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 
> 
> >
> 
> -------- Original message --------
> Message: 3
> 
> Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> > From: Marcel Schneider via Unicode 
> > 
> >
> Curiously, UnicodeData.txt is lacking the header line. That makes it
> unflexible. I never wondered why the header line is missing, probably
> because compared to the other UCD files, the file looks really odd
> without a file header showing at least the version number and
> datestamp. It?s like the file was made up for dumb parsers unable to
> handle comment delimiters, and never to be upgraded to do so.
> 
> But I like the format, and that?s why at some point I submitted
> feedback asking for an extension. [...]
> 
> 
> 
> 
> 
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 488 bytes
Desc: Digitale Signatur von OpenPGP
URL: <http://unicode.org/pipermail/unicode/attachments/20180831/61f21b73/attachment.pgp>

From unicode at unicode.org  Fri Aug 31 03:27:12 2018
From: unicode at unicode.org (Manuel Strehl via Unicode)
Date: Fri, 31 Aug 2018 10:27:12 +0200
Subject: CLDR (was: Private Use areas)
In-Reply-To: <86sh2v3ye5.fsf@mimuw.edu.pl>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
Message-ID: <CAEZUo2f7nzDgCbOLin--DJ5NgfcqzeeoZPjeVDE4uMwCmrS-vA@mail.gmail.com>

The XML files in these folders:

https://unicode.org/repos/cldr/tags/latest/common/

But I agree. I spent an extreme amount of time to get somewhat used to
cldr.unicode.org and and the data repo, and still I have no clue,
where to find a concrete piece of information without digging into the
site.
Am Fr., 31. Aug. 2018 um 07:22 Uhr schrieb Janusz S. Bie? via Unicode
<unicode at unicode.org>:
>
> On Thu, Aug 30 2018 at  2:27 +0200, unicode at unicode.org writes:
>
> [...]
>
> > Given NamesList.txt / Code Charts comments are kept minimal by design,
> > one couldn?t simply pop them into XML or whatever, as the result would be
> > disappointing and call for completion in the aftermath. Yet another task
> > competing with CLDR survey.
>
> Please elaborate. It's not clear for me what do you mean.
>
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
>
> Which XML? where?
>
> > and we really
> > need to go through the data and correct the many many errors, please.
>
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.
>
> Best regards
>
> Janusz
>
> --
>              ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>


From unicode at unicode.org  Fri Aug 31 03:36:45 2018
From: unicode at unicode.org (Manuel Strehl via Unicode)
Date: Fri, 31 Aug 2018 10:36:45 +0200
Subject: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
In-Reply-To: <20180831081953.68476d36@spixxi>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
Message-ID: <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>

To handle the UCD XML file a streaming parser like Expat is necessary.

For codepoints.net I use that data to stuff everything in a MySQL
database. If anyone is interested, the code for that is Open Source:

https://github.com/Codepoints/unicode2mysql/

The example for handling the large XML file can be found here:

https://github.com/Codepoints/unicode2mysql/blob/master/bin/ucd_to_sql.py

For me it's currently much easier to have all the data in a single
place, e.g. a large XML file, than spread over a multitude of files
_with different ad-hoc syntaxes_.

The situation would possibly be different, though, if the UCD data
would be split in several files of the same format. (Be it JSON, CSV,
YAML, XML, TOML, whatever. Just be consistent.)

Nota bene: That is also true for the emoji data, which consists as of
now of five plain text files with similar but not identical formats.

Cheers,
Manuel
Am Fr., 31. Aug. 2018 um 08:19 Uhr schrieb Marius Spix via Unicode
<unicode at unicode.org>:
>
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
>
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.
>
> Regards,
>
> Marius Spix
>
>
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
>
> > On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> > >
> > > Welel an alternative to XML is JSON which is more compact and
> > > faster/simpler to process;
> >
> > Thanks for pointing the problem and the solution alike. Indeed the
> > main drawback of the XML format of UCD is that it results in an
> > ?insane? filesize. ?Insane? was applied to the number of semicolons
> > in UnicodeData.txt, but that is irrelevant. What is really insane is
> > the filesize of the XML versions of the UCD. Even without Unihan, it
> > may take up to a minute or so to load in a text editor.
> >
> > > however JSON has no explicit schema, unless the schema is being
> > > made part of the data itself, complicating its structure (with many
> > > levels of arrays of arrays, in which case it becomes less easy to
> > > read by humans, but more adapted to automated processes for fast
> > > processing).
> > >
> > > I'd say that the XML alone is enough to generate any JSON-derived
> > > dataset that will conform to the schema an application expects to
> > > process fast (and with just the data it can process, excluding
> > > various extensions still not implemetned). But the fastest
> > > implementations are also based on data tables encoded in code (such
> > > as DLL or Java classes), or custom database formats (such as
> > > Berkeley dB) generated also automatically from the XML, without the
> > > processing cost of decompression schemes and parsers.
> > >
> > > Still today, even if XML is not the usual format used by
> > > applications, it is still the most interoperable format that allows
> > > building all sorts of applications in all sorts of languages: the
> > > cost of parsing is left to an application builder/compiler.
> >
> > I?ve tried an online tool to get ucd.nounihan.flat.xml converted to
> > CSV. The tool is great and offers a lot of options, but given the
> > ?insane? file size, my browser was up for over two hours of trouble
> > until I shut down the computer manually. From what I could see in the
> > result field, there are many bogus values, meaning that their
> > presence is useless in the tags of most characters. And while many
> > attributes have cryptic names in order to keep the file size minimal,
> > some attributes have overlong values, ie the design is inconsistent.
> > Eg in every character we read: jg="No_Joining_Group" That is bogus.
> > One would need to take them off the tags of most characters, and even
> > in the characters where they are relevant, the value would be simply
> > "No". What?s the use of abbreviating "Joining Group" to "jg" in the
> > atribute name if in the value it is written out? And I?m quoting from
> > U+0000. Further many values are set to a crosshatch, instead of
> > simply being removed from the characters where they are empty. Then
> > the many instances of "undetermined script" resulting in *two*
> > attribues with "Zyyy" value. Then in almost each character we?re told
> > that it is not a whitespace, not a dash, not a hyphen, and not a
> > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn?t
> > tell that UCD does actually benefit from the flexibility of XML,
> > given that many attributes are systematically present even where they
> > are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> > third their actual size if they were properly designed.
> >
> > > Some apps embed the compilers themselves and use a stored cache for
> > > faster processing: this approach allows easy updates by detecting
> > > changes in the XML source, and then downloading them.
> > >
> > > But in CLDR such updates are generally not automated : the general
> > > scheme evolves over time and there are complex dependencies to
> > > check so that some data becomes usable
> >
> > Should probably read *un*usable.
> >
> > > (frequently you need to implement some new algorithms to follow the
> > > processing rules documented in CLDR, or to use data not completely
> > > validated, or to allow aplicatio?ns to provide their overrides from
> > > insufficiently complete datasets in CLDR, even if CLDR provides a
> > > root locale and applcaitions are supposed to follow the BCP47
> > > fallback resolution rules; applciations also have their own need
> > > about which language codes they use or need, and CLDR provides many
> > > locales that many applications are still not prepared to render
> > > correctly, and many application users complain if an application is
> > > partly translated and contains too many fallbacks to another
> > > language, or worse to another script).
> >
> > So the case is even worse than what I could see when looking into
> > CLDR. Many countries, including France, don?t care about the data of
> > their own locale in CLDR, but I?m not going to vent about that on
> > Unicode Public, because that involves language offices and
> > authorities, and would have political entanglements.
> >
> > Staying technical, I can tell so far about the file header of
> > UnicodeData.txt that I can see zero technical reasons not to add it.
> > Processes using the file to generate an overview of Unicode also use
> > other files and are thus able to process comments correctly, whereas
> > those processes using UnicodeData to look up character properties
> > provided in the file would start searching the code point. (Perhaps
> > there are compilers building DLLs from the file.)
> >
> > Le jeu. 30 ao?t 2018 ? 20:38, Doug Ewell via Unicode  a ?crit :
> > >
> >
> >
> > UnicodeData.txt was devised long before any of the other UCD data
> > files. Though it might seem like a simple enhancement to us, adding a
> > header block, or even a single line, would break a lot of existing
> > processes that were built long ago to parse this file.
> >
> > >
> > So Unicode can't add a header to this file, and that is the reason
> > the format can never be changed (e.g. with more columns). That is why
> > new files keep getting created instead.
> >
> > >
> > The XML format could indeed be expanded with more attributes and more
> > subsections. Any process that can parse XML can handle unknown stuff
> > like this without misinterpreting the stuff it does know.
> >
> > >
> > That's why the only two reasonable options for getting UCD data are
> > to read all the tab- and semicolon-delimited files, and be ready for
> > new files, or just read the XML. Asking for changes to existing UCD
> > file formats is kind of a non-starter, given these two alternatives.
> >
> > >
> >
> > >
> >
> > >
> > --
> > Doug Ewell | Thornton, CO, US | ewellic.org
> >
> >
> > >
> >
> > -------- Original message --------
> > Message: 3
> >
> > Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> > > From: Marcel Schneider via Unicode
> > >
> > >
> > Curiously, UnicodeData.txt is lacking the header line. That makes it
> > unflexible. I never wondered why the header line is missing, probably
> > because compared to the other UCD files, the file looks really odd
> > without a file header showing at least the version number and
> > datestamp. It?s like the file was made up for dumb parsers unable to
> > handle comment delimiters, and never to be upgraded to do so.
> >
> > But I like the format, and that?s why at some point I submitted
> > feedback asking for an extension. [...]
> >
> >
> >
> >
> >
> >
>


From unicode at unicode.org  Fri Aug 31 05:17:41 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Fri, 31 Aug 2018 12:17:41 +0200 (CEST)
Subject: CLDR (was: Private Use areas)
In-Reply-To: <86sh2v3ye5.fsf@mimuw.edu.pl>
References: <328015574.12420.1535588853098.JavaMail.www@wwinf1d25>
 <86sh2v3ye5.fsf@mimuw.edu.pl>
Message-ID: <1825242108.55692.1535710661909.JavaMail.www@wwinf1d37>

On 31/08/18 07:27 Janusz S. Bie? via Unicode wrote:
[?]
> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> > one couldn?t simply pop them into XML or whatever, as the result would be 
> > disappointing and call for completion in the aftermath. Yet another task 
> > competing with CLDR survey.
> 
> Please elaborate. It's not clear for me what do you mean.

These comments are designed for the Code Charts and as such must not be
disproportionate in exhaustivity. Eg we have lists of related languages ending 
in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt
to be fed in an extensible and unconstrained format (without any constraint 
as of available space, number and length of comments, and so on), any lack 
is felt as a discriminating neglect, and there will be a huge rush adding data.
Yet Unicode hasn?t set up products where that data could be published, ie not 
in the Code Charts (for the abovementioned reason), not in ICU so far as the 
additional information involved does not match a known demand on user side 
(localizing software does not mean providing scholarly exhaustive information
about supported characters). The use will be in character pickers providing 
every available information about a given character. That is why Unicode is
to prioritize CLDR for CLDR users, rather than extra information for the web.

> 
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
> 
> Which XML? where?

More precisely it is LDML, the CLDR-specific XML.
What I called ?digest charts? are the charts found here:

http://www.unicode.org/cldr/charts/34/

The access is via this page:

http://cldr.unicode.org/index/downloads

where the charts are in the Charts column, while the raw data is under SVN Tag.

> 
> > and we really 
> > need to go through the data and correct the many many errors, please.
> 
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.

I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
for the access to the XML data (except when knowing about SubVersioN).
Polish data is found here:

https://www.unicode.org/cldr/charts/34/summary/pl.html

The access is via the top of the "Summary" index page (showing root data):

https://www.unicode.org/cldr/charts/34/summary/root.html

You may wish to particularly check the By-Type charts:

https://www.unicode.org/cldr/charts/34/by_type/index.html

Here I?d suggest to first focus on alphabetic information and on punctuation.

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

Under Latin (table caption, without anchor) we find out what punctuation 
Polish has compared to other locales using the same script.
The exact character appears when hovering the header row.
Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
an error in almost every locale using hyphen. TC is about to correct that.

Further you will see that while Polish is using apostrophe
https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
You may wish to note that from now on, both U+0027 APOSTROPHE and 
U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
U+201D that are already found in CLDR pl.

Note however that according to the information provided by English Wikipedia:
https://en.wikipedia.org/wiki/Quotation_mark#Polish
Polish also uses single quotes, that by contrast are still missing in CLDR.

Now you might understand what I meant when pointing that there are still 
many errors in many languages in CLDR, including in English.

Best regards,

Marcel

> 
> Best regards
> 
> Janusz
> 
> -- 
> , 
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
> 
>


From unicode at unicode.org  Fri Aug 31 12:50:08 2018
From: unicode at unicode.org (Ken Whistler via Unicode)
Date: Fri, 31 Aug 2018 10:50:08 -0700
Subject: UCD in XML or in CSV?
In-Reply-To: <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
Message-ID: <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net>


On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:
> For codepoints.net I use that data to stuff everything in a MySQL
> database.

Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There are 
also other significant sets of data that the UTC maintains about 
characters in other formats, as well, including the data files 
associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, 
etc.), UTS #10 (collation), UTR #25 (a set of math-related property 
values), and UTS #51 (emoji-related). The emoji-related data has now 
strayed into the CLDR space, so a significant amount of the information 
about emoji characters is now carried as CLDR tags. And then there is 
various other information about individual characters (or small sets of 
characters) scattered in the core spec -- some in tables, some not, as 
well as mappings to dozens of external standards. There is no actual 
definition anywhere of what "everything" actually is. Further, it is a 
mistake to assume that every character property just associates a simple 
attribute with a code point. There are multiple types of mappings, 
complex relational and set properties, and so forth.

The UTC attempts to keep a fairly clear line around what constitutes the 
"UCD proper" (including Unihan.zip), in part so that it is actually 
possible to run the tools that create the XML version of the UCD, for 
folks who want to consume a more consistent, single-file format version 
of the data. But be aware that that isn't everything -- nor would there 
be much sense in trying to keep expanding the UCD proper to actually 
represent "everything" in one giant DTD.

Second, one of the main obligations of a standards organization is 
*stability*. People may well object to the ad hoc nature of the UCD data 
files that have been added over the years -- but it is a *stable* 
ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
tweaking formats of data files to meet complaints about one particular 
parsing inconvenience or another. That would create multiple points of 
discontinuity between versions -- worse than just having to deal with 
the ongoing growth in the number of assigned characters and the 
occasional addition of new data files and properties to the UCD.

Keep in mind that there is more to processing the UCD than just 
"latest". People who just focus on grabbing the very latest version of 
the UCD and updating whatever application they have are missing half the 
problem. There are multiple tools out there that parse and use multiple 
*versions* of the UCD. That includes the tooling that is used to 
maintain the UCD (which parses *all* versions), and the tooling that 
creates UCD in XML, which also parses all versions. Then there is 
tooling like unibook, to produce code charts, which also has to adapt to 
multiple versions, and bidi reference code, which also reads multiple 
versions of UCD data files. Those are just examples I know off the top 
of my head. I am sure there are many other instances out there that fit 
this profile. And none of the applications already built to handle 
multiple versions would welcome having to permanently build in tracking 
particular format anomalies between specific versions of the UCD.

Third, please remember that folks who come here complaining about the 
complications of parsing the UCD are a very small percentage of a very 
small percentage of a very small percentage of interested parties. 
Nearly everybody who needs UCD data should be consuming it as a 
secondary source (e.g. for reference via codepoints.net), or as a 
tertiary source (behind specialized API's, regex, etc.), or as an end 
user (just getting behavior they expect for characters in applications). 
Programmers who actually *need* to consume the raw UCD data files and 
write parsers for them directly should actually be able to deal with the 
format complexity -- and, if anything, slowing them down to make them 
think about the reasons for the format complexity might be a good thing, 
as it tends to put the lie to the easy initial assumption that the UCD 
is nothing more than a bunch of simple attributes for all the code points.

--Ken


From unicode at unicode.org  Fri Aug 31 15:02:27 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 31 Aug 2018 16:02:27 -0400
Subject: Private Use areas
In-Reply-To: <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
 <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>
 <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost>
Message-ID: <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org>

On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote:
> Asmus Freytag wrote:
>
>> There are situations where an ad-hoc markup language seems to fulfill a need that is not well served by the existing full-fledged markup languages. You find them in internet "bulletin boards" or services like GitHub, where pure plain text is too restrictive but the required text styles purposefully limited - which makes the syntactic overhead of a full-featured mark-up language burdensome.
> I am thinking of such an ad-hoc special purpose markup language.
>
> I am thinking of something like a special purpose version of the FORTH computer language being used but with no user definitions, no comparison operations and no loops and no compiler. Just a straight run through as if someone were typing commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces between commands. For example, circled R might mean use Right-to-left text display.

That starts to sound no longer "ad-hoc", but that is not a well-defined 
term anyway.? You're essentially describing a special-purpose markup 
language or protocol, or perhaps even programming language.? Which is 
quite reasonable; you should (find some other interested people and) 
work out some of? the details and start writing up parsers and such
> I am thinking that there could be three stacks, one for code points and one for numbers and one for external reference strings such as for accessing a web page or a PDF (Portable Document Format) document or listing an International Standard Book Number and so on. Code points could be entered by circled H followed by circled hexadecimal characters followed by a circled character to indicate Push onto the code point stack. Numbers could be entered in base 10, followed by a circled character to mean Push onto the number stack. A later circled character could mean to take a certain number of code points (maybe just 1, or maybe 0) from the character stack and a certain number of numbers (maybe just 1, or maybe just 0) from the number stack and use them to set some property.
>
> It could all be very lightweight software-wise, just reading the characters of the sequence of circled characters and obeying them one by one just one time only on a single run through, with just a few, such as the circled digits, each having its meaning dependent upon a state variable such as, for a circled digit, whether data entry is currently hexadecimal or base 10.

I still don't see why you're fixated on using circled characters. You're 
already dealing with a markup-language type setup, why not do what other 
markup schemes do?? You reserve three or four characters and use them to 
designate when other characters are not being used in their normal sense 
but are being used as markup.? In XML, when characters are inside '<>' 
tags, they are not "plain text" of the document, but they mean other 
things?perhaps things like "right-to-left" or "reference this web page" 
and so forth, which are exactly the kinds of things you're talking about 
here.? If you don't want to use plain ascii characters because then you 
couldn't express plain ascii in your text, you're left with exactly the 
same problem with circled characters: you can't express circled 
characters in your text.? While that is a smaller problem, it can be 
eliminated altogether by various schemes used by XML or RTF or 
lightweight markup languages.? Reserve a few special characters to give 
meanings to the others, and arrange for ways to escape your handful of 
reserved characters so you can express them.? More straightforward to 
say "you have to escape <, >, and & characters" than to say "you have to 
escape all circled characters."

Anyway, this is clearly a whole new high-level protocol you need (or 
want) to work out, which would *use* Unicode (just like XML and JSON 
do), but doesn't really affect or involve it (Unicode is all about the 
"plain text".? Kind of getting off-topic, but get some people interested 
and start a mailing list to discuss it.? Good luck!

~mark

From unicode at unicode.org  Fri Aug 31 15:11:44 2018
From: unicode at unicode.org (Mark E. Shoulson via Unicode)
Date: Fri, 31 Aug 2018 16:11:44 -0400
Subject: Private Use areas
In-Reply-To: <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost>
References: <4826651.5138.1535444498189.JavaMail.root@webmail11.bt.ext.cpcloud.co.uk>
 <19054743.5414.1535444772290.JavaMail.defaultUser@defaultHost>
Message-ID: <ee81c7ef-c8b0-0d75-2cf6-98c90226d6f7@kli.org>

On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote:
> Hi
>   
> Mark E. Shoulson wrote:
>   
>> I'm not sure what the advantage is of using circled characters instead of plain old ascii.
>   
> My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters.

What if circled characters are used in the text encoded in the file?? 
They're characters too, people use them and all.? Whenever you designate 
some characters to be used in a way outside their normal meaning, you 
have the problem of how to use them *with* their normal meaning.? So 
there are various escaping schemes and all.? So in XML, all characters 
have their normal meanings?except <, >, and &, which mean something 
special and change the interpretations of other nearby characters (so 
"bold" is a word in English that appears in the text, but "<bold>" is 
part of an instruction to the renderer that doesn't appear in the 
text.)? And the price is that those three characters have to be 
expressed differently (&lt; &gt; &amp;).? I don't really see what you 
gain by branding some large swath of unicode ("circled characters") as 
"special" and not meaning their usual selves, and for that matter making 
these hard-to-type characters *necessary* for using your scheme, when 
you could do something like what XML does, and say "everything between < 
and > is to be interpreted specially, and there, these characters have 
the following meanings" and then have some other way of expressing those 
two reserved characters.? (not saying you need to do it XML's way, but 
something like that: reserve a small number of characters that have to 
be escaped, not some huge chunk.)
>   
> My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format.

That's another way of saying that this is a markup format which accepts 
a large variety of plain texts.? Because you ARE talking about making a 
"particular markup format," just a different and new one.

I guess there's not even any reason for me to argue the point, though, 
since it is up to you how to design your markup language, and you can 
take advice (or not) from anyone you like.? Draw up some design, find 
some interested people, start a discussion, and work it out.? (but not 
here; this list is for discussing Unicode.)

~mark

From unicode at unicode.org  Fri Aug 31 15:43:49 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 31 Aug 2018 21:43:49 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
 <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>
 <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost>
 <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org>
Message-ID: <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost>

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018


From unicode at unicode.org  Fri Aug 31 15:59:06 2018
From: unicode at unicode.org (William_J_G Overington via Unicode)
Date: Fri, 31 Aug 2018 21:59:06 +0100 (BST)
Subject: Private Use areas
In-Reply-To: <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost>
References: <2869165.40702.1535399404894.JavaMail.root@webmail27.bt.ext.cpcloud.co.uk>
 <18031518.41865.1535401109738.JavaMail.defaultUser@defaultHost>
 <3180558.42611.1535402679543.JavaMail.defaultUser@defaultHost>
 <CAH=y87a47SvkgQp2e1d9ruCV65W1uEDp-r818Cyg5WNzHyZ19w@mail.gmail.com>
 <d6cfcdc8-05c7-0fe7-cfd5-ca68c275f63f@ix.netcom.com>
 <17197721.29445.1535471934478.JavaMail.defaultUser@defaultHost>
 <8426ec70-6548-15c4-de89-ab67b044c3be@kli.org>
 <21973754.43151.1535748229258.JavaMail.defaultUser@defaultHost>
Message-ID: <16515804.43443.1535749146105.JavaMail.defaultUser@defaultHost>

Hi

I have now found the following document.

http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

William Overington

Friday 31 August 2018


----Original message----
>From : wjgo_10009 at btinternet.com
Date : 2018/08/31 - 21:43 (GMTDT)
To : mark at kli.org, unicode at unicode.org
Subject : Re: Private Use areas

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of particular uses of Private Use Area characters will become implemented as a practicality, and that that format can be applied in practice where desired, and indeed I would be happy to participate in a group project, I do not know enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018


From unicode at unicode.org  Fri Aug 31 23:18:32 2018
From: unicode at unicode.org (Marcel Schneider via Unicode)
Date: Sat, 1 Sep 2018 06:18:32 +0200 (CEST)
Subject: UCD in XML or in CSV?
In-Reply-To: <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net>
References: <mailman.5.1535648401.24359.unicode@unicode.org>
 <201808301827.w7UIRbqF028462@unicode.org>
 <CAGa7JC2r-NV-=76Uvgk6P8uAuZwi0C+-BJvv86Z4WRMVHi8L2A@mail.gmail.com>
 <39633706.52606.1535691517838.JavaMail.www@wwinf1d37>
 <20180831081953.68476d36@spixxi>
 <CAEZUo2d9PVwEB2eF0Vr9yzsF3MBqBS+r79JfbJXP-h=MnTP5gA@mail.gmail.com>
 <76cea59f-f676-8bb8-0380-bb52bdb081fd@att.net>
Message-ID: <1369576173.49.1535775512871.JavaMail.www@wwinf1d31>

On 31/08/18 19:59 Ken Whistler via Unicode wrote:
[?]
> Second, one of the main obligations of a standards organization is 
> *stability*. People may well object to the ad hoc nature of the UCD data 
> files that have been added over the years -- but it is a *stable* 
> ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
> tweaking formats of data files to meet complaints about one particular 
> parsing inconvenience or another. That would create multiple points of 
> discontinuity between versions -- worse than just having to deal with 
> the ongoing growth in the number of assigned characters and the 
> occasional addition of new data files and properties to the UCD.

I did not want to make trouble asking for moving conventions back and forth.
I liked to learn why UnicodeData.txt was released as a draft without a header 
and nothing, given Unicode knew well in advance that the scheme adopted 
at first release would be kept stable for decades or forever. 

Then I?d like to learn how Unicode came to not devise a consistent scheme
for all the UCD files if any such could be devised, so that people could get 
able to assess whether complaints about inconsistencies are well-founded
or not. It is not enough for me that a given adhockery is stable; IMO it should 
also be well-designed, in responsiveness facing history from a standards body.
That is not what one is telling about UnicodeData.txt, although it is the only 
effectively formatted file in UCD for streamlined processing. Was there not 
enough time to think about a header line and a file header? With the header 
line it would be flexible, and all the problems would be solved if specifying 
that parsers should start with counting the field number prior to creating 
storage arrays. We are lacking a real history of Unicode, explaining why 
everybody was in a hurry. ?Authors falling like flies? is the only hint that 
comes to mind.

And given Unicode appear to have missed the hit, to discuss whether it 
would be time to add a more accomplished file for better usability.

> 
> Keep in mind that there is more to processing the UCD than just 
> "latest". People who just focus on grabbing the very latest version of 
> the UCD and updating whatever application they have are missing half the 
> problem. There are multiple tools out there that parse and use multiple 
> *versions* of the UCD. That includes the tooling that is used to 
> maintain the UCD (which parses *all* versions), and the tooling that 
> creates UCD in XML, which also parses all versions. Then there is 
> tooling like unibook, to produce code charts, which also has to adapt to 
> multiple versions, and bidi reference code, which also reads multiple 
> versions of UCD data files. Those are just examples I know off the top 
> of my head. I am sure there are many other instances out there that fit 
> this profile. And none of the applications already built to handle 
> multiple versions would welcome having to permanently build in tracking 
> particular format anomalies between specific versions of the UCD.

That point is clear to me, and even when suggesting to make changes to
BidiMirrored.txt, I had alternatives with a stable existing file and a new 
enhanced file. But what is totally unclear to me is what are old versions 
doing in compiling latest data. Delta is OK, research on particular topic in
old data is OK, but what does it mean to need to parse *all* versions to 
get newest products?
> 
> Third, please remember that folks who come here complaining about the 
> complications of parsing the UCD are a very small percentage of a very 
> small percentage of a very small percentage of interested parties. 
> Nearly everybody who needs UCD data should be consuming it as a 
> secondary source (e.g. for reference via codepoints.net), or as a 
> tertiary source (behind specialized API's, regex, etc.), or as an end 
> user (just getting behavior they expect for characters in applications). 
> Programmers who actually *need* to consume the raw UCD data files and 
> write parsers for them directly should actually be able to deal with the 
> format complexity -- and, if anything, slowing them down to make them 
> think about the reasons for the format complexity might be a good thing, 
> as it tends to put the lie to the easy initial assumption that the UCD 
> is nothing more than a bunch of simple attributes for all the code points.

That makes no sense to me. UCD raw data is and remains a primary source,
I see no way to consume it as a secondary source or as a tertiary source. 
Do you mean to consume it via secondary or tertiary sources? Then we 
actually appear to consume those sources instead of UCD raw data.
These sources are fine for the purpose of getting information about some 
particular code points, but most of these tools I remember don?t allow to 
filter values and compute overviews, nor to add data, as we can do it 
in spreadsheet software. Honestly are we so few people using Excel 
for Unicode data? Even Excel Starter, that I have, is a great tool helping
to perform tasks I fail to get with other tools, even spreadsheet software.
So I beg you to please spare me the tedious conversion from the clumsy
UCD XML file to handy CSV. BTW the former could be cleaned up.

As already said, if this discussion has a positive outcome, a request in 
due form may follow. But I have no time to work out any more papers 
if there is no point.

Regards,

Marcel