From kenwhistler at att.net  Fri May  1 09:17:31 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 01 May 2015 07:17:31 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
Message-ID: <55438AFB.6020000@att.net>


Koji,

Personally, I don't have a horse in this race, because I am not 
responsible for
any linebreaking implementation -- so a change for halfwidth katakana 
wouldn't
matter one way or the other to me.

Secondly, there is no formal stability guarantee constraining Line_Break 
property
values (other than the generic guarantee that the property itself or
existing aliases cannot be *removed* from the standard). Nor is there
any stability guarantee regarding the rest of the algorithm definition 
in UAX #14.
So in principle, the UTC could rewrite it completely. But I doubt that 
that would
be in anybody's interest at this point. ;-)

But as I see it, the way this should work is for the major stakeholders 
who *do*
have implemented linebreaking algorithms depending on UAX #14 working
in released products (and that would include people speaking for various
browsers and for Apple products in general, I think) should be the ones
either pushing for a change, because it would make their behavior more 
correct
and acceptable for Japanese, or pushing back *against* a change, because 
they
depend on UAX #14 stability and would prefer tweaking the behavior in their
implementations, instead. So I'd like to see a formal proposal for a change
(specified *exactly* as to the set of characters affected) brought to 
the UTC,
where implementers and users of ICU could make the case for or against.

The other thing that I think would need to happen here is that any proposal
should also provide suggested wording for UAX #14 which would explain
why halfwidth katakana specifically need to break with the general 
principles
that were used 15 years ago to assign LB classes based on East_Asian_Width
considerations, and instead need to match the LB classes of their
fullwidth katakana counterparts. That should be made explicit in the text
of UAX #14, so somebody else doesn't "discover" another inconsistency
between sets of values and try to change things back later on -- not knowing
the rationale for the values.

Because a well-formed proposal for a change like this involves both
a justification for a property value change *and* a corresponding fix
to annex text, I think this is too late in the cycle to be taken as just
beta feedback for the Version 8.0 release, unfortunately. Because of
the potential hit on existing implementations (and test cases), this needs
full review, and should instead be pushed as an early proposal for
the Version 9.0 release cycle.

--Ken

On 5/1/2015 5:33 AM, Koji Ishii wrote:
> I support Makoto for the change. Nobody should appreciate that behavior, either worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than implementing yet another work around in Chrome, I wish it being fixed finally after 15 years.
>
> If this issue is like 5 people say break and 5 not to, or considering the long life of the bug, 9 say break and 1 say not to, I understand that Ken?s answer might make more sense. However, I?m quite sure that this is a 10-0 issue. Everyone using UAX#14 has to choose from trailer, unnoticed, or won?t fix. I think that kind of things should better be fixed.
>
> Half-width CJK should follow the same line breaking class as their wide counterparts. From that point of view, half-width Hangul being AL is actually correct. (Note that this is not the same as full-width oftentimes having the different classes than their narrow counterparts.)
>
> Half-width punctuations already have correct classes, so they?re fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code points in any CJK legacy encoding. Where had they come from? Logical thinking is to assign the same classes as their wide counterparts, but I can?t be sure without knowing where they came from.
>
> Ken, does this change cause problems in terms of the stability policy?
>
> /koji
>
>
>


From kojiishi at gmail.com  Fri May  1 07:33:49 2015
From: kojiishi at gmail.com (Koji Ishii)
Date: Fri, 1 May 2015 21:33:49 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55403267.9060202@att.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net>
Message-ID: <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>

I support Makoto for the change. Nobody should appreciate that behavior, either worked around locally (Firefox, IE) or unnoticed (Chrome). Rather than implementing yet another work around in Chrome, I wish it being fixed finally after 15 years.

If this issue is like 5 people say break and 5 not to, or considering the long life of the bug, 9 say break and 1 say not to, I understand that Ken?s answer might make more sense. However, I?m quite sure that this is a 10-0 issue. Everyone using UAX#14 has to choose from trailer, unnoticed, or won?t fix. I think that kind of things should better be fixed.

Half-width CJK should follow the same line breaking class as their wide counterparts. From that point of view, half-width Hangul being AL is actually correct. (Note that this is not the same as full-width oftentimes having the different classes than their narrow counterparts.)

Half-width punctuations already have correct classes, so they?re fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but I do not find these code points in any CJK legacy encoding. Where had they come from? Logical thinking is to assign the same classes as their wide counterparts, but I can?t be sure without knowing where they came from.

Ken, does this change cause problems in terms of the stability policy?

/koji


> On Apr 29, 2015, at 10:22, Ken Whistler <kenwhistler at att.net> wrote:
> 
> Taking this thread back to the original question...
> 
> The Line_Break property values for halfwidth katakana (lb=AL)
> and regular katakana (lb=ID) have been stable since they
> were first defined for Unicode 3.0 -- 15 years ago.
> 
> Regardless of whether lb=AL is the optimal assignment for
> the halfwidth katakana, it seems likely to me that trying to
> *change* that Line_Break assignment, just for halfwidth
> katakana, at this late date, would likely be more destabilizing
> for existing implementations, rather than helpful.
> 
> The citations below show *different* behavior between browsers
> for linebreaking around halfwidth katakana. That suggests that
> Firefox and IE11 have already provided tailoring to better match
> expectations. The correct avenue forward, it seems to me, would
> be to pursue bugs against browsers that do not show expected
> behavior, to see if improvements there are feasible, rather than
> to modify the base Line_Break property values that everybody has
> to tailor *from*.
> 
> Note that this is not *just* a Japanese problem nor a matter
> of not matching JIS X 4051. UAX #14 is *not* a direct implementation
> of JIS X 4051 rules, although it is certainly informed by them and
> has many Line_Break values introduced to get default behavior closer to
> the Japanese rules for linebreaking. And the compatibility halfwidth
> characters in the standard also include halfwidth jamo and symbols,
> so any changes also would need to be considered in the context
> of consistency for those and for *Korean* rules, as well as for Japanese.
> 
> --Ken
> 
> On 4/27/2015 10:57 PM, Makoto Kato wrote:
>> Hi, Suzuki-san.  Thank you for reply.
>> 
>>> At present, I have no objection to add halfwidth katakana
>>> to ideographic-class in UAX#14, but I'm unfamiliar with the
>>> (negative) impact caused by the lack of halfwidth katakana
>>> in it. Could you tell me if you know anything?
>> Since half-width katakana isn't ID, it isn't break line like
>> full-wdith katakana.
>> 
>> 
>> Firefox and IE11 define half-width katakana as ID.  The line break of
>> half-width katakana is same as full-width katakana.
>> Chrome doesn't define it as ID.  Half-width katakana isn't line break
>> per character.
>> 
>> Although I read JIS X 4051, it doesn't define that half-width katakana
>> and full-width katakana are differently.
>> 
>> 
>>> I guess, the inclusion or exclusion in other classes, like,
>>> AI, AL, CJ, JL, JV, JT, SA might be quite important to realize
>>> the appropriate line breaking, but the inclusion or exclusion
>>> in ID-class does not seem to be important. If the inclusion
>>> in ID-class is important, more characters (e.g. Bopomofo)
>>> should be considered for full coverage. How do you think of?
>> My discussion is why half-width katanaka character isn't same class of
>> full-width katakana character.  In this case, half-width katakana
>> originally defines as AL at current spec.  So when moving to ID, break
>> rule is strongly changed. (non-break -> break before or after).
>> 
>> 
>> -- Makoto
>> 
>> 
> 


From asmus-inc at ix.netcom.com  Fri May  1 09:47:38 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 01 May 2015 07:47:38 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55438AFB.6020000@att.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net>
Message-ID: <5543920A.7060906@ix.netcom.com>

On 5/1/2015 7:17 AM, Ken Whistler wrote:
>
> Koji,
>
> Personally, I don't have a horse in this race, because I am not 
> responsible for
> any linebreaking implementation -- so a change for halfwidth katakana 
> wouldn't
> matter one way or the other to me.
>
> Secondly, there is no formal stability guarantee constraining 
> Line_Break property
> values (other than the generic guarantee that the property itself or
> existing aliases cannot be *removed* from the standard). Nor is there
> any stability guarantee regarding the rest of the algorithm definition 
> in UAX #14.
> So in principle, the UTC could rewrite it completely. But I doubt that 
> that would
> be in anybody's interest at this point. ;-)
>
> But as I see it, the way this should work is for the major 
> stakeholders who *do*
> have implemented linebreaking algorithms depending on UAX #14 working
> in released products (and that would include people speaking for various
> browsers and for Apple products in general, I think) should be the ones
> either pushing for a change, because it would make their behavior more 
> correct
> and acceptable for Japanese, or pushing back *against* a change, 
> because they
> depend on UAX #14 stability and would prefer tweaking the behavior in 
> their
> implementations, instead. So I'd like to see a formal proposal for a 
> change
> (specified *exactly* as to the set of characters affected) brought to 
> the UTC,
> where implementers and users of ICU could make the case for or against.

I would go further and suggest that UTC make no change until it has 
positively heard from a representative sample of users/implementers.

This kind of seemingly innocuous change does affect implementations but 
implementers are usually not expecting to have the ground shift under 
them after a decade or more of stable property assignments. Silence on 
their part may just as likely be the result of failing to appreciate the 
possibility of adverse outcome than of actual acquiescence.

To the degree that the CSS working group relies on UAX#14 as default in 
some/any situations, it would be imperative to hear from them as well, 
before taking any action.

In principle, this should be the stated procedure by the UTC when making 
any change in long-standing property assignments -- particularly for 
widely deployed scripts.

That said, with proper buy-in from stakeholders, I see no objection to 
making a change.

A./
>
> The other thing that I think would need to happen here is that any 
> proposal
> should also provide suggested wording for UAX #14 which would explain
> why halfwidth katakana specifically need to break with the general 
> principles
> that were used 15 years ago to assign LB classes based on 
> East_Asian_Width
> considerations, and instead need to match the LB classes of their
> fullwidth katakana counterparts. That should be made explicit in the text
> of UAX #14, so somebody else doesn't "discover" another inconsistency
> between sets of values and try to change things back later on -- not 
> knowing
> the rationale for the values.
>
> Because a well-formed proposal for a change like this involves both
> a justification for a property value change *and* a corresponding fix
> to annex text, I think this is too late in the cycle to be taken as just
> beta feedback for the Version 8.0 release, unfortunately. Because of
> the potential hit on existing implementations (and test cases), this 
> needs
> full review, and should instead be pushed as an early proposal for
> the Version 9.0 release cycle.
>
> --Ken
>
> On 5/1/2015 5:33 AM, Koji Ishii wrote:
>> I support Makoto for the change. Nobody should appreciate that 
>> behavior, either worked around locally (Firefox, IE) or unnoticed 
>> (Chrome). Rather than implementing yet another work around in Chrome, 
>> I wish it being fixed finally after 15 years.
>>
>> If this issue is like 5 people say break and 5 not to, or considering 
>> the long life of the bug, 9 say break and 1 say not to, I understand 
>> that Ken?s answer might make more sense. However, I?m quite sure that 
>> this is a 10-0 issue. Everyone using UAX#14 has to choose from 
>> trailer, unnoticed, or won?t fix. I think that kind of things should 
>> better be fixed.
>>
>> Half-width CJK should follow the same line breaking class as their 
>> wide counterparts. From that point of view, half-width Hangul being 
>> AL is actually correct. (Note that this is not the same as full-width 
>> oftentimes having the different classes than their narrow counterparts.)
>>
>> Half-width punctuations already have correct classes, so they?re 
>> fine. Symbols in U+FFE8-FFEE are AL, which looks also incorrect, but 
>> I do not find these code points in any CJK legacy encoding. Where had 
>> they come from? Logical thinking is to assign the same classes as 
>> their wide counterparts, but I can?t be sure without knowing where 
>> they came from.
>>
>> Ken, does this change cause problems in terms of the stability policy?
>>
>> /koji
>>
>>
>>
>
>


From mpsuzuki at hiroshima-u.ac.jp  Fri May  1 10:25:24 2015
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Sat, 02 May 2015 00:25:24 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55438AFB.6020000@att.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net>
Message-ID: <55439AE4.4020109@hiroshima-u.ac.jp>

Dear Ken,

Ken Whistler wrote:
> The other thing that I think would need to happen here is that any proposal
> should also provide suggested wording for UAX #14 which would explain
> why halfwidth katakana specifically need to break with the general 
> principles
> that were used 15 years ago to assign LB classes based on East_Asian_Width
> considerations, and instead need to match the LB classes of their
> fullwidth katakana counterparts. That should be made explicit in the text
> of UAX #14, so somebody else doesn't "discover" another inconsistency
> between sets of values and try to change things back later on -- not 
> knowing
> the rationale for the values.

Excuse me, there is any discussion record how UAX#14 class for
halfwidth-katakana in 15 years ago? If there is such, I want to
see a sample text (of halfwidth-katakana) and expected layout
result for it.

You commented that the UAX#14 class should not be changed but
the tailoring of the line breaking behaviour would solve
the problem (as Firefox and IE11 did). However, some developers
may wonder "there might be a reason why UTC put halfwidth-katakana
to AL - without understanding it, we could not determine whether
the proposed tailoring should be enabled always, or enabled
only for a specific environment (e.g. locale, surrounding text)".

If UTC can supply the "expected layout result for halfwidth-
katakana (used to define the class in current UAX#14)", it
would be helpful for the developers to evaluate the proposed
tailoring algorithm.

Regards,
mpsuzuki

From kenwhistler at att.net  Fri May  1 11:48:11 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 01 May 2015 09:48:11 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55439AE4.4020109@hiroshima-u.ac.jp>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
Message-ID: <5543AE4B.5020904@att.net>

Suzuki-san,

On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>
> Excuse me, there is any discussion record how UAX#14 class for
> halfwidth-katakana in 15 years ago? If there is such, I want to
> see a sample text (of halfwidth-katakana) and expected layout
> result for it.

The *founding* document for the UTC discussion of the initial
Line_Break property values 15 years ago was:

http://www.unicode.org/L2/L1999/99179.pdf

and the corresponding table draft (before approval and conversion
into the final format that was published with UTR #14 -- later
/UAX/ #14) was:

http://www.unicode.org/L2/L1999/99180.pdf

There is nothing different or surprising in terms of values there. The 
halfwidth
katakana were lb=AL and the fullwidth katakana were lb=ID in
that earliest draft, as of 1999.

What is new information, perhaps, is the explicit correlation that can 
be found
in those documents with the East_Asian_Width properties, and the explanation
in L2/99-179 that the EAW property values were explicitly used to
make distinctions for the initial LB values.

There is no sample text or expected layout results from that time period,
because that was not the basis for the original UTC decisions on any of 
this.
Initial LB values were generated based on existing General_Category
and EAW values, using general principles. They were not generated by
examining and specifying in detail the line breaking behavior for
every single script in the standard, and then working back from those
detailed specifications to attempt to create a universal specification
that would replicate all of that detailed behavior. Such an approach
would have been nearly impossible, given the state of all the data,
and might have taken a decade to complete.

That said, Japanese line breaking was no doubt considered as part of
the overall background, because the initial design for UTR #14 was informed
by experience in implementation of line breaking algorithms at Microsoft
in the 90's.

>
> You commented that the UAX#14 class should not be changed but
> the tailoring of the line breaking behaviour would solve
> the problem (as Firefox and IE11 did). However, some developers
> may wonder "there might be a reason why UTC put halfwidth-katakana
> to AL - without understanding it, we could not determine whether
> the proposed tailoring should be enabled always, or enabled
> only for a specific environment (e.g. locale, surrounding text)".

See above, in L2/99-179. *That* was the justification. It had nothing
to do with specific environment, locale, or surrounding text.

>
> If UTC can supply the "expected layout result for halfwidth-
> katakana (used to define the class in current UAX#14)", it
> would be helpful for the developers to evaluate the proposed
> tailoring algorithm.

UAX #14 was never intended to be a detailed, script-by-script
specification of line layout results. It is a default, generic, universal
algorithm for line breaking that does a decent, generic job of
line breaking in generic contexts without tailoring or specific
knowledge of language, locale, or typographical conventions in use.

UAX #14 is not a replacement for full specification of kinsoku
rules for Japanese, in particular. Nor is it intended as any kind
of replacement for JIS X 4051.

Please understand this: UAX #14 does *NOT* tell anyone how
Japanese text *should* line break. Instead, it is Japanese typographers,
users and standardizers who tell implementers of line break
algorithms for Japanese what the expectations for Japanese text should
be, in what contexts. It is then the job of the UTC and of the
platform and application vendors to negotiate the details of
which part of that expected behavior makes sense to try to
cover by tweaking the default line-breaking algorithm and the
Line_Break property values for Unicode characters, and which
part of that expected behavior makes sense to try to cover
by adjusting commonly accessible and agreed upon tailoring
behavior (or public standards like CSS), and finally which part of that
expected behavior should instead be addressed by value-added, proprietary
implementations of high end publishing software.

Regards,

--Ken
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150501/fc41dbb1/attachment.html>

From asmus-inc at ix.netcom.com  Fri May  1 14:12:59 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Fri, 01 May 2015 12:12:59 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <5543AE4B.5020904@att.net>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net>
Message-ID: <5543D03B.80603@ix.netcom.com>

Thank you, Ken, for your dedicated archeological efforts.

I would like to emphasize that, at the time, UAX#14 reflected observed 
behavior, in particular (but not exclusively) for MS products some of 
which (at the time) used an LB algorithm that effectively matched an 
untailored UAX#14.

However, recently, the W3C has spent considerable effort to look into 
different layout-related algorithms and specification. If, in that 
context, a consensus approach is developed that would point to a better 
"default" behavior for untailored UAX#14-style line breaking, I would 
regard that as a critical mass of support to allow UTC to consider 
tinkering with such a long-standing set of property assignments.

This would be true, especially, if it can be demonstrated that (other 
than matching legacy behavior) there's no context that would benefit 
from the existing classification. I note that this was something several 
posters implied.

So, if implementers of the legacy behavior are amenable to achieve this 
by tailoring, and if the change augments the number of situations where 
untailored UAX#14-style line breaking can be used, that would be a win 
that might offset the cost of a disruptive change.

We've heard arguments why the proposed change is technically superior 
for Japanese. We now need to find out whether there are contexts where a 
change would adversely affect users/implementers. Following that, we 
would look for endorsements of the proposal from implementers or other 
standards organizations such as W3C (and, if at all possible, agreement 
from those implementers who use the untailored algorithm now). With 
these three preconditions in place, I would support an effort of the UTC 
to revisit this question.

A./

On 5/1/2015 9:48 AM, Ken Whistler wrote:
> Suzuki-san,
>
> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>
>> Excuse me, there is any discussion record how UAX#14 class for
>> halfwidth-katakana in 15 years ago? If there is such, I want to
>> see a sample text (of halfwidth-katakana) and expected layout
>> result for it.
>
> The *founding* document for the UTC discussion of the initial
> Line_Break property values 15 years ago was:
>
> http://www.unicode.org/L2/L1999/99179.pdf
>
> and the corresponding table draft (before approval and conversion
> into the final format that was published with UTR #14 -- later
> /UAX/ #14) was:
>
> http://www.unicode.org/L2/L1999/99180.pdf
>
> There is nothing different or surprising in terms of values there. The 
> halfwidth
> katakana were lb=AL and the fullwidth katakana were lb=ID in
> that earliest draft, as of 1999.
>
> What is new information, perhaps, is the explicit correlation that can 
> be found
> in those documents with the East_Asian_Width properties, and the 
> explanation
> in L2/99-179 that the EAW property values were explicitly used to
> make distinctions for the initial LB values.
>
> There is no sample text or expected layout results from that time period,
> because that was not the basis for the original UTC decisions on any 
> of this.
> Initial LB values were generated based on existing General_Category
> and EAW values, using general principles. They were not generated by
> examining and specifying in detail the line breaking behavior for
> every single script in the standard, and then working back from those
> detailed specifications to attempt to create a universal specification
> that would replicate all of that detailed behavior. Such an approach
> would have been nearly impossible, given the state of all the data,
> and might have taken a decade to complete.
>
> That said, Japanese line breaking was no doubt considered as part of
> the overall background, because the initial design for UTR #14 was 
> informed
> by experience in implementation of line breaking algorithms at Microsoft
> in the 90's.
>
>>
>> You commented that the UAX#14 class should not be changed but
>> the tailoring of the line breaking behaviour would solve
>> the problem (as Firefox and IE11 did). However, some developers
>> may wonder "there might be a reason why UTC put halfwidth-katakana
>> to AL - without understanding it, we could not determine whether
>> the proposed tailoring should be enabled always, or enabled
>> only for a specific environment (e.g. locale, surrounding text)".
>
> See above, in L2/99-179. *That* was the justification. It had nothing
> to do with specific environment, locale, or surrounding text.
>
>>
>> If UTC can supply the "expected layout result for halfwidth-
>> katakana (used to define the class in current UAX#14)", it
>> would be helpful for the developers to evaluate the proposed
>> tailoring algorithm.
>
> UAX #14 was never intended to be a detailed, script-by-script
> specification of line layout results. It is a default, generic, universal
> algorithm for line breaking that does a decent, generic job of
> line breaking in generic contexts without tailoring or specific
> knowledge of language, locale, or typographical conventions in use.
>
> UAX #14 is not a replacement for full specification of kinsoku
> rules for Japanese, in particular. Nor is it intended as any kind
> of replacement for JIS X 4051.
>
> Please understand this: UAX #14 does *NOT* tell anyone how
> Japanese text *should* line break. Instead, it is Japanese typographers,
> users and standardizers who tell implementers of line break
> algorithms for Japanese what the expectations for Japanese text should
> be, in what contexts. It is then the job of the UTC and of the
> platform and application vendors to negotiate the details of
> which part of that expected behavior makes sense to try to
> cover by tweaking the default line-breaking algorithm and the
> Line_Break property values for Unicode characters, and which
> part of that expected behavior makes sense to try to cover
> by adjusting commonly accessible and agreed upon tailoring
> behavior (or public standards like CSS), and finally which part of that
> expected behavior should instead be addressed by value-added, proprietary
> implementations of high end publishing software.
>
> Regards,
>
> --Ken
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150501/5dbe1e0d/attachment.html>

From kojiishi at gmail.com  Sun May  3 11:47:49 2015
From: kojiishi at gmail.com (Koji Ishii)
Date: Mon, 4 May 2015 01:47:49 +0900
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <5543D03B.80603@ix.netcom.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net>
 <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
Message-ID: <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>

Thank you so much Ken and Asmus for the detailed guides and histories. This
helps me a lot.

In terms of time frame, I don't insist on specific time frame, Unicode 9 is
fine if that works well for all.

I'm not sure how much history and postmortem has to be baked into the
section of UAX#14, hope not much because I'm not familiar with how it was
defined so other than what Ken and Asmus kindly provided in this thread.
But from those information, I feel stronger than before that this was
simply an unfortunate oversight. In the document Ken quoted, F and W are
distinguished, but H and N are not. In '90, East Asian versions of Office
and RichEdit were in my radar and all of them handled halfwidth Katakana as
ID for the line breaking purposes. That's quite understandable given the
amount of code points to work on, given the priority of halfwidth Katakana,
and given the difference of "what line breaking should be" and UAX#14 as
Ken noted, but writing it up as a document doesn't look an easy task.

I agree that implementers and CSS WG should be involved, but given IE and
FF have already tailored, and all MS products as well, I guess it should
not be too hard. I'm in Chrome team now, and the only problem for me to fix
it in Chrome is to justify why Chrome wants to tailor rather than fixing
UAX#14 (and the bug priority...)

Either Makoto or I can bring it up to CSS WG to get back to you.

/koji


On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <asmus-inc at ix.netcom.com>
wrote:

>  Thank you, Ken, for your dedicated archeological efforts.
>
> I would like to emphasize that, at the time, UAX#14 reflected observed
> behavior, in particular (but not exclusively) for MS products some of which
> (at the time) used an LB algorithm that effectively matched an untailored
> UAX#14.
>
> However, recently, the W3C has spent considerable effort to look into
> different layout-related algorithms and specification. If, in that context,
> a consensus approach is developed that would point to a better "default"
> behavior for untailored UAX#14-style line breaking, I would regard that as
> a critical mass of support to allow UTC to consider tinkering with such a
> long-standing set of property assignments.
>
> This would be true, especially, if it can be demonstrated that (other than
> matching legacy behavior) there's no context that would benefit from the
> existing classification. I note that this was something several posters
> implied.
>
> So, if implementers of the legacy behavior are amenable to achieve this by
> tailoring, and if the change augments the number of situations where
> untailored UAX#14-style line breaking can be used, that would be a win that
> might offset the cost of a disruptive change.
>
> We've heard arguments why the proposed change is technically superior for
> Japanese. We now need to find out whether there are contexts where a change
> would adversely affect users/implementers. Following that, we would look
> for endorsements of the proposal from implementers or other standards
> organizations such as W3C (and, if at all possible, agreement from those
> implementers who use the untailored algorithm now). With these three
> preconditions in place, I would support an effort of the UTC to revisit
> this question.
>
> A./
>
>
> On 5/1/2015 9:48 AM, Ken Whistler wrote:
>
> Suzuki-san,
>
> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>
>
> Excuse me, there is any discussion record how UAX#14 class for
> halfwidth-katakana in 15 years ago? If there is such, I want to
> see a sample text (of halfwidth-katakana) and expected layout
> result for it.
>
>
> The *founding* document for the UTC discussion of the initial
> Line_Break property values 15 years ago was:
>
> http://www.unicode.org/L2/L1999/99179.pdf
>
> and the corresponding table draft (before approval and conversion
> into the final format that was published with UTR #14 -- later
> *UAX* #14) was:
>
> http://www.unicode.org/L2/L1999/99180.pdf
>
> There is nothing different or surprising in terms of values there. The
> halfwidth
> katakana were lb=AL and the fullwidth katakana were lb=ID in
> that earliest draft, as of 1999.
>
> What is new information, perhaps, is the explicit correlation that can be
> found
> in those documents with the East_Asian_Width properties, and the
> explanation
> in L2/99-179 that the EAW property values were explicitly used to
> make distinctions for the initial LB values.
>
> There is no sample text or expected layout results from that time period,
> because that was not the basis for the original UTC decisions on any of
> this.
> Initial LB values were generated based on existing General_Category
> and EAW values, using general principles. They were not generated by
> examining and specifying in detail the line breaking behavior for
> every single script in the standard, and then working back from those
> detailed specifications to attempt to create a universal specification
> that would replicate all of that detailed behavior. Such an approach
> would have been nearly impossible, given the state of all the data,
> and might have taken a decade to complete.
>
> That said, Japanese line breaking was no doubt considered as part of
> the overall background, because the initial design for UTR #14 was informed
> by experience in implementation of line breaking algorithms at Microsoft
> in the 90's.
>
>
> You commented that the UAX#14 class should not be changed but
> the tailoring of the line breaking behaviour would solve
> the problem (as Firefox and IE11 did). However, some developers
> may wonder "there might be a reason why UTC put halfwidth-katakana
> to AL - without understanding it, we could not determine whether
> the proposed tailoring should be enabled always, or enabled
> only for a specific environment (e.g. locale, surrounding text)".
>
>
> See above, in L2/99-179. *That* was the justification. It had nothing
> to do with specific environment, locale, or surrounding text.
>
>
> If UTC can supply the "expected layout result for halfwidth-
> katakana (used to define the class in current UAX#14)", it
> would be helpful for the developers to evaluate the proposed
> tailoring algorithm.
>
>
> UAX #14 was never intended to be a detailed, script-by-script
> specification of line layout results. It is a default, generic, universal
> algorithm for line breaking that does a decent, generic job of
> line breaking in generic contexts without tailoring or specific
> knowledge of language, locale, or typographical conventions in use.
>
> UAX #14 is not a replacement for full specification of kinsoku
> rules for Japanese, in particular. Nor is it intended as any kind
> of replacement for JIS X 4051.
>
> Please understand this: UAX #14 does *NOT* tell anyone how
> Japanese text *should* line break. Instead, it is Japanese typographers,
> users and standardizers who tell implementers of line break
> algorithms for Japanese what the expectations for Japanese text should
> be, in what contexts. It is then the job of the UTC and of the
> platform and application vendors to negotiate the details of
> which part of that expected behavior makes sense to try to
> cover by tweaking the default line-breaking algorithm and the
> Line_Break property values for Unicode characters, and which
> part of that expected behavior makes sense to try to cover
> by adjusting commonly accessible and agreed upon tailoring
> behavior (or public standards like CSS), and finally which part of that
> expected behavior should instead be addressed by value-added, proprietary
> implementations of high end publishing software.
>
> Regards,
>
> --Ken
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/3d7103f5/attachment.html>

From asmus-inc at ix.netcom.com  Sun May  3 14:53:19 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sun, 03 May 2015 12:53:19 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
Message-ID: <55467CAF.4080401@ix.netcom.com>

On 5/3/2015 9:47 AM, Koji Ishii wrote:
> Thank you so much Ken and Asmus for the detailed guides and histories. 
> This helps me a lot.
>
> In terms of time frame, I don't insist on specific time frame, Unicode 
> 9 is fine if that works well for all.
>
> I'm not sure how much history and postmortem has to be baked into the 
> section of UAX#14, hope not much because I'm not familiar with how it 
> was defined so other than what Ken and Asmus kindly provided in this 
> thread. But from those information, I feel stronger than before that 
> this was simply an unfortunate oversight. In the document Ken quoted, 
> F and W are distinguished, but H and N are not. In '90, East Asian 
> versions of Office and RichEdit were in my radar and all of them 
> handled halfwidth Katakana as ID for the line breaking purposes. 
> That's quite understandable given the amount of code points to work 
> on, given the priority of halfwidth Katakana, and given the difference 
> of "what line breaking should be" and UAX#14 as Ken noted, but writing 
> it up as a document doesn't look an easy task

Koji,

kana are special in that they are not shared among languages. From that 
perspective, there's nothing wrong with having a "general purpose" 
algorithm support the rules of the target language (unless that would 
add undue complexity, which isn't a consideration here).

Based on the data presented informally here in postings, I find your 
conclusion (oversight) quite believable. The task would therefore be to 
present the same data in a more organized fashion as part of a formal 
proposal. Should be doable.

I think you'd want to focus on survey of modern practice in 
implementations (and if you have data on some of them going back to the 
'90s the better).

 From the historical analysis it's clear that there was a desire to 
create assignments that didn't introduce random inconsistencies between 
LB and EAW properties, but that kind of self-consistency check just 
makes sure that all characters of some group defined by the intersection 
of property subsets are treated the same (unless there's an overriding 
reason to differentiate within). It seems entirely plausible that this 
process misfired  for the characters in question, more likely so, given 
that the earliest drafts of the tables were based on an implementation 
also being created by MS around the same time. That makes any difference 
to other MS products even more likely to be an oversight.

I do want to help UTC establish a precedent of getting changes like that 
endorsed by a representative sample of implementers and key external 
standards (where applicable, in this case that would be CSS), to avoid 
the chance of creating undue disruption (and to increase the chance that 
the resulting modified algorithm is actually usable off-the-shelf, for 
example for "default" or "unknown language" type scenarios.

Hence my insistence that you go out and drum up support. But it looks 
like this should be relatively easy, as there seems to be no strong case 
for maintaining the status quo, other than that it is the status quo.

A./


>
> I agree that implementers and CSS WG should be involved, but given IE 
> and FF have already tailored, and all MS products as well, I guess it 
> should not be too hard. I'm in Chrome team now, and the only problem 
> for me to fix it in Chrome is to justify why Chrome wants to tailor 
> rather than fixing UAX#14 (and the bug priority...)
>
> Either Makoto or I can bring it up to CSS WG to get back to you.
>
> /koji
>
>
> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) 
> <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>
>     Thank you, Ken, for your dedicated archeological efforts.
>
>     I would like to emphasize that, at the time, UAX#14 reflected
>     observed behavior, in particular (but not exclusively) for MS
>     products some of which (at the time) used an LB algorithm that
>     effectively matched an untailored UAX#14.
>
>     However, recently, the W3C has spent considerable effort to look
>     into different layout-related algorithms and specification. If, in
>     that context, a consensus approach is developed that would point
>     to a better "default" behavior for untailored UAX#14-style line
>     breaking, I would regard that as a critical mass of support to
>     allow UTC to consider tinkering with such a long-standing set of
>     property assignments.
>
>     This would be true, especially, if it can be demonstrated that
>     (other than matching legacy behavior) there's no context that
>     would benefit from the existing classification. I note that this
>     was something several posters implied.
>
>     So, if implementers of the legacy behavior are amenable to achieve
>     this by tailoring, and if the change augments the number of
>     situations where untailored UAX#14-style line breaking can be
>     used, that would be a win that might offset the cost of a
>     disruptive change.
>
>     We've heard arguments why the proposed change is technically
>     superior for Japanese. We now need to find out whether there are
>     contexts where a change would adversely affect users/implementers.
>     Following that, we would look for endorsements of the proposal
>     from implementers or other standards organizations such as W3C
>     (and, if at all possible, agreement from those implementers who
>     use the untailored algorithm now). With these three preconditions
>     in place, I would support an effort of the UTC to revisit this
>     question.
>
>     A./
>
>
>     On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>     Suzuki-san,
>>
>>     On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>>
>>>     Excuse me, there is any discussion record how UAX#14 class for
>>>     halfwidth-katakana in 15 years ago? If there is such, I want to
>>>     see a sample text (of halfwidth-katakana) and expected layout
>>>     result for it.
>>
>>     The *founding* document for the UTC discussion of the initial
>>     Line_Break property values 15 years ago was:
>>
>>     http://www.unicode.org/L2/L1999/99179.pdf
>>
>>     and the corresponding table draft (before approval and conversion
>>     into the final format that was published with UTR #14 -- later
>>     /UAX/ #14) was:
>>
>>     http://www.unicode.org/L2/L1999/99180.pdf
>>
>>     There is nothing different or surprising in terms of values
>>     there. The halfwidth
>>     katakana were lb=AL and the fullwidth katakana were lb=ID in
>>     that earliest draft, as of 1999.
>>
>>     What is new information, perhaps, is the explicit correlation
>>     that can be found
>>     in those documents with the East_Asian_Width properties, and the
>>     explanation
>>     in L2/99-179 that the EAW property values were explicitly used to
>>     make distinctions for the initial LB values.
>>
>>     There is no sample text or expected layout results from that time
>>     period,
>>     because that was not the basis for the original UTC decisions on
>>     any of this.
>>     Initial LB values were generated based on existing General_Category
>>     and EAW values, using general principles. They were not generated by
>>     examining and specifying in detail the line breaking behavior for
>>     every single script in the standard, and then working back from those
>>     detailed specifications to attempt to create a universal
>>     specification
>>     that would replicate all of that detailed behavior. Such an approach
>>     would have been nearly impossible, given the state of all the data,
>>     and might have taken a decade to complete.
>>
>>     That said, Japanese line breaking was no doubt considered as part of
>>     the overall background, because the initial design for UTR #14
>>     was informed
>>     by experience in implementation of line breaking algorithms at
>>     Microsoft
>>     in the 90's.
>>
>>>
>>>     You commented that the UAX#14 class should not be changed but
>>>     the tailoring of the line breaking behaviour would solve
>>>     the problem (as Firefox and IE11 did). However, some developers
>>>     may wonder "there might be a reason why UTC put halfwidth-katakana
>>>     to AL - without understanding it, we could not determine whether
>>>     the proposed tailoring should be enabled always, or enabled
>>>     only for a specific environment (e.g. locale, surrounding text)".
>>
>>     See above, in L2/99-179. *That* was the justification. It had nothing
>>     to do with specific environment, locale, or surrounding text.
>>
>>>
>>>     If UTC can supply the "expected layout result for halfwidth-
>>>     katakana (used to define the class in current UAX#14)", it
>>>     would be helpful for the developers to evaluate the proposed
>>>     tailoring algorithm.
>>
>>     UAX #14 was never intended to be a detailed, script-by-script
>>     specification of line layout results. It is a default, generic,
>>     universal
>>     algorithm for line breaking that does a decent, generic job of
>>     line breaking in generic contexts without tailoring or specific
>>     knowledge of language, locale, or typographical conventions in use.
>>
>>     UAX #14 is not a replacement for full specification of kinsoku
>>     rules for Japanese, in particular. Nor is it intended as any kind
>>     of replacement for JIS X 4051.
>>
>>     Please understand this: UAX #14 does *NOT* tell anyone how
>>     Japanese text *should* line break. Instead, it is Japanese
>>     typographers,
>>     users and standardizers who tell implementers of line break
>>     algorithms for Japanese what the expectations for Japanese text
>>     should
>>     be, in what contexts. It is then the job of the UTC and of the
>>     platform and application vendors to negotiate the details of
>>     which part of that expected behavior makes sense to try to
>>     cover by tweaking the default line-breaking algorithm and the
>>     Line_Break property values for Unicode characters, and which
>>     part of that expected behavior makes sense to try to cover
>>     by adjusting commonly accessible and agreed upon tailoring
>>     behavior (or public standards like CSS), and finally which part
>>     of that
>>     expected behavior should instead be addressed by value-added,
>>     proprietary
>>     implementations of high end publishing software.
>>
>>     Regards,
>>
>>     --Ken
>>>
>>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150503/ef32012c/attachment.html>

From richard.wordingham at ntlworld.com  Mon May  4 08:47:33 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 4 May 2015 14:47:33 +0100
Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?=
 =?ISO-8859-1?B?NDY=?=
In-Reply-To: <55314383.5070507@ix.netcom.com>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com>
Message-ID: <20150504144733.53247fcb@JRWUBU2>

On Fri, 17 Apr 2015 10:31:47 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> But permit me to ask one question up front. What would be served by 
> making such a sweeping change at this juncture, after 25 years of 
> established practice?

I suspect the idea is to have a way of unobtrusively supplying the
Bidi_Mirrored value in a character pick-list, namely the use of the
words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'. As I
dimly recall the use of ']a,?b[' to denote an open interval, the
proposed solution is not complete, but the complete solution is not
obvious to me. I for one don't want to have to choose a non-English
locale to type right-to-left text.

Richard.


From richard.wordingham at ntlworld.com  Mon May  4 10:07:52 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 4 May 2015 16:07:52 +0100
Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?=
 =?ISO-8859-1?B?NDY=?=
In-Reply-To: <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net>
Message-ID: <20150504160752.710e1c72@JRWUBU2>

On Mon, 4 May 2015 16:07:31 +0200 (CEST)
marcel.schneider20 at laposte.net wrote:

> The information about OPENING and CLOSING is one part of the 
> Formal Alias issue. The goal is to make the true names better known 
> and to allow people reading English, that is a huge majority, to get 
> at reach the full bandwith of Unicode information in real time. 
> Today, IMHO, the information about (and the availability of)  
> formal aliases seems to be out of reach for much software users who  
> are confronted with when searching for information about characters. 
> It therefore seems to be consistent to make it better available. 
> The same would apply to informative aliases.
> 
> Unicode clearly states in NamesList.txt, that ?this file should not 
> be parsed for machine-readable information?. 
> By the way, all the informative aliases Unicode added for 
> the information of users, implementers and developers, are lost 
> because they seem to be nowhere else in the UCD.

The UCD file you want is ucdxml/ucd.all.grouped.xml or its flat
equivalent.  On the Unicode site, they exists as zip files,
ucdxml/ucd.all.grouped.zip and ucdxml/ucd.all.flat.zip.

Richard.


From asmus-inc at ix.netcom.com  Mon May  4 10:32:46 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 04 May 2015 08:32:46 -0700
Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?=
In-Reply-To: <20150504144733.53247fcb@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
Message-ID: <5547911E.4080507@ix.netcom.com>

On 5/4/2015 6:47 AM, Richard Wordingham wrote:
> On Fri, 17 Apr 2015 10:31:47 -0700
> "Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:
>
>> But permit me to ask one question up front. What would be served by
>> making such a sweeping change at this juncture, after 25 years of
>> established practice?
> I suspect the idea is to have a way of unobtrusively supplying the
> Bidi_Mirrored value in a character pick-list, namely the use of the
> words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'.

Reading this discussion, I sometimes wonder whether people have ever 
heard of character properties? They were invented, because it proved 
impossible (infeasible?) to cram all the useful information about a 
character into its name. If you look in the Unicode Character Database, 
you'll find a lot of useful properties, among them, the information on 
which bracket characters are paired, and which one is the 
opening/closing one in the pair.
>   As I dimly recall the use of ']a, b[' to denote an open interval, the
> proposed solution is not complete, but the complete solution is not
> obvious to me.

Right. There are many conventions for the use of characters. The period 
is another example of a character used for many purposes...

No way to pack all the information into the name, and even character 
properties aren't covering all of them.

> I for one don't want to have to choose a non-English
> locale to type right-to-left text.
Non-sequitur?

A./


From asmus-inc at ix.netcom.com  Mon May  4 10:34:26 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 04 May 2015 08:34:26 -0700
Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?=
In-Reply-To: <20150504160752.710e1c72@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <278590126.2931716.1430748451728.JavaMail.zimbra@laposte.net>
 <20150504160752.710e1c72@JRWUBU2>
Message-ID: <55479182.2060202@ix.netcom.com>

Richard,

as I wrote in my previous message, not knowing the first thing about 
character properties, some people immediately propose to carry all that 
information in the character name...

A./

On 5/4/2015 8:07 AM, Richard Wordingham wrote:
> On Mon, 4 May 2015 16:07:31 +0200 (CEST)
> marcel.schneider20 at laposte.net wrote:
>
>> The information about OPENING and CLOSING is one part of the
>> Formal Alias issue. The goal is to make the true names better known
>> and to allow people reading English, that is a huge majority, to get
>> at reach the full bandwith of Unicode information in real time.
>> Today, IMHO, the information about (and the availability of)
>> formal aliases seems to be out of reach for much software users who
>> are confronted with when searching for information about characters.
>> It therefore seems to be consistent to make it better available.
>> The same would apply to informative aliases.
>>
>> Unicode clearly states in NamesList.txt, that ?this file should not
>> be parsed for machine-readable information?.
>> By the way, all the informative aliases Unicode added for
>> the information of users, implementers and developers, are lost
>> because they seem to be nowhere else in the UCD.
> The UCD file you want is ucdxml/ucd.all.grouped.xml or its flat
> equivalent.  On the Unicode site, they exists as zip files,
> ucdxml/ucd.all.grouped.zip and ucdxml/ucd.all.flat.zip.
>
> Richard.
>
>


From richard.wordingham at ntlworld.com  Mon May  4 11:42:26 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 4 May 2015 17:42:26 +0100
Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?=
 =?ISO-8859-1?B?NDY=?=
In-Reply-To: <5547911E.4080507@ix.netcom.com>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com>
Message-ID: <20150504174226.64433e65@JRWUBU2>

On Mon, 04 May 2015 08:32:46 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> On 5/4/2015 6:47 AM, Richard Wordingham wrote:

> > I suspect the idea is to have a way of unobtrusively supplying the
> > Bidi_Mirrored value in a character pick-list, namely the use of the
> > words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'.
> 
> Reading this discussion, I sometimes wonder whether people have ever 
> heard of character properties?

I believe most ordinary computer users have not heard of them.  Most
people do not knowingly have the UCD to hand, or even UnicodeData.txt.

> No way to pack all the information into the name, and even character 
> properties aren't covering all of them.

Unfortunately, when choosing a character from a character picker, the
most help one is likely to get is the character name.  The name is
actually quite useful when the glyph is not as one expects or the
distinguishing features are not readily visible.

Sometimes, however, the names are distinctly unhelpful. Perhaps
'DEVANAGARI DANDA' should have a correcting alias 'DANDA' (or 'INDIAN
DANDA'?) to reassure people that it is also the Bengali/Tamil etc.
danda. 

> > I for one don't want to have to choose a non-English
> > locale to type right-to-left text.
> Non-sequitur?

No.  The clear issue raised was of knowing whether a character's glyph
would change with the bidi context.  One solution that immediately
comes to mind is to display the character in a pick list according to
the user's locale. Unfortunately, that will not always work.  In these
days of Unicode, locales are primarily useful for determining the user
interface.

Richard.

From eliz at gnu.org  Mon May  4 11:59:26 2015
From: eliz at gnu.org (Eli Zaretskii)
Date: Mon, 04 May 2015 19:59:26 +0300
Subject: NamesList, =?iso-8859-1?Q?Code=A0Charts=2C_ISO=2FIEC=A010646?=
In-Reply-To: <20150504174226.64433e65@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
Message-ID: <83r3qws2hd.fsf@gnu.org>

> Date: Mon, 4 May 2015 17:42:26 +0100
> From: Richard Wordingham <richard.wordingham at ntlworld.com>
> 
> > > I for one don't want to have to choose a non-English
> > > locale to type right-to-left text.
> > Non-sequitur?
> 
> No.  The clear issue raised was of knowing whether a character's glyph
> would change with the bidi context.  One solution that immediately
> comes to mind is to display the character in a pick list according to
> the user's locale.

User's locale has nothing to do with bidi context, so this would be
simply wrong.

From asmus-inc at ix.netcom.com  Mon May  4 12:22:16 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 04 May 2015 10:22:16 -0700
Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?=
In-Reply-To: <20150504174226.64433e65@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
Message-ID: <5547AAC8.6000806@ix.netcom.com>

On 5/4/2015 9:42 AM, Richard Wordingham wrote:
> On Mon, 04 May 2015 08:32:46 -0700
> "Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:
>
>> On 5/4/2015 6:47 AM, Richard Wordingham wrote:
>>> I suspect the idea is to have a way of unobtrusively supplying the
>>> Bidi_Mirrored value in a character pick-list, namely the use of the
>>> words 'OPENING' and 'CLOSING' rather than 'LEFT' and 'RIGHT'.
>> Reading this discussion, I sometimes wonder whether people have ever
>> heard of character properties?
> I believe most ordinary computer users have not heard of them.  Most
> people do not knowingly have the UCD to hand, or even UnicodeData.txt.

But people writing character pickers really should mine these.
>
>> No way to pack all the information into the name, and even character
>> properties aren't covering all of them.
> Unfortunately, when choosing a character from a character picker, the
> most help one is likely to get is the character name.  The name is
> actually quite useful when the glyph is not as one expects or the
> distinguishing features are not readily visible.
>
> Sometimes, however, the names are distinctly unhelpful. Perhaps
> 'DEVANAGARI DANDA' should have a correcting alias 'DANDA' (or 'INDIAN
> DANDA'?) to reassure people that it is also the Bengali/Tamil etc.
> danda.

That's because the creator of your character picker didn't add any value.
>
>>> I for one don't want to have to choose a non-English
>>> locale to type right-to-left text.
>> Non-sequitur?
> No.  The clear issue raised was of knowing whether a character's glyph
> would change with the bidi context.  One solution that immediately
> comes to mind is to display the character in a pick list according to
> the user's locale. Unfortunately, that will not always work.  In these
> days of Unicode, locales are primarily useful for determining the user
> interface.

I still don't follow. If I edit text, then the mirroring happens in real 
time. If it doesn't come out as expected, I can change character (or use 
markup).

A./
>
> Richard.
>


From verdy_p at wanadoo.fr  Mon May  4 12:32:37 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 4 May 2015 19:32:37 +0200
Subject: NamesList, Code Charts, ISO/IEC 10646
In-Reply-To: <20150504174226.64433e65@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
Message-ID: <CAGa7JC0qDQoLdsNCZW0k4-auYk1JH7xy6Mbrh4irHKMKKN7_FQ@mail.gmail.com>

2015-05-04 18:42 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> > No way to pack all the information into the name, and even character
> > properties aren't covering all of them.
>
> Unfortunately, when choosing a character from a character picker, the
> most help one is likely to get is the character name.  The name is
> actually quite useful when the glyph is not as one expects or the
> distinguishing features are not readily visible.
>

Character pickers are applications and not in scope of the standard itself.
It's up to the developers of these applications to provide the necessary
localisations according to the expectations of their users for a particular
language, script, and/or country/region or even dialectal variant.

You cannot have a single normative character name (in fact not really a
name, but a technical identifier) that will match all users expectations in
all cultures. So the Unicode and ISO/IEC 10646 have only chosen to use and
publich a single stable identifier throughout the standardization process;
even if it is bad, it will be kept. These names are not designed to be even
suitable for all English users (and just consider how CJK sinograms are
named, they are not suitable for anyone...).

There are open projects (outside Unicode and even outside CLDR itself) to
provide common character names in various locales.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/6be64bb3/attachment.html>

From richard.wordingham at ntlworld.com  Mon May  4 12:39:56 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 4 May 2015 18:39:56 +0100
Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?=
 =?ISO-8859-1?B?NDY=?=
In-Reply-To: <83r3qws2hd.fsf@gnu.org>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <83r3qws2hd.fsf@gnu.org>
Message-ID: <20150504183956.191ecac6@JRWUBU2>

On Mon, 04 May 2015 19:59:26 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> > Date: Mon, 4 May 2015 17:42:26 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>

> > The clear issue raised was of knowing whether a character's
> > glyph would change with the bidi context.  One solution that
> > immediately comes to mind is to display the character in a pick
> > list according to the user's locale.

> User's locale has nothing to do with bidi context, so this would be
> simply wrong.

If paragraph embedding level is determined by an overriding profile but
there is nothing explicit, should not the locale determine the
directionality?  If so, the locale will often work for indicating
whether a character should be displayed in its left-to-right form or
its right-to-left form.

Are you, for example, suggesting that a code chart in Arabic should
display U+0028 LEFT PARENTHESIS and U+0029 RIGHT PARENTHESIS using the
same glyphs as for one in the English language?

Richard.

From asmus-inc at ix.netcom.com  Mon May  4 12:49:12 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 04 May 2015 10:49:12 -0700
Subject: NamesList, Code Charts, ISO/IEC 10646
In-Reply-To: <CAGa7JC0qDQoLdsNCZW0k4-auYk1JH7xy6Mbrh4irHKMKKN7_FQ@mail.gmail.com>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <CAGa7JC0qDQoLdsNCZW0k4-auYk1JH7xy6Mbrh4irHKMKKN7_FQ@mail.gmail.com>
Message-ID: <5547B118.10506@ix.netcom.com>

On 5/4/2015 10:32 AM, Philippe Verdy wrote:
> 2015-05-04 18:42 GMT+02:00 Richard Wordingham 
> <richard.wordingham at ntlworld.com 
> <mailto:richard.wordingham at ntlworld.com>>:
>
>     > No way to pack all the information into the name, and even character
>     > properties aren't covering all of them.
>
>     Unfortunately, when choosing a character from a character picker, the
>     most help one is likely to get is the character name.  The name is
>     actually quite useful when the glyph is not as one expects or the
>     distinguishing features are not readily visible.
>
>
> Character pickers are applications and not in scope of the standard 
> itself. It's up to the developers of these applications to provide the 
> necessary...

... additions that make their product usable, including any...

> ..localisations according to the expectations of their users for a 
> particular language, script, and/or country/region or even dialectal 
> variant.
>
> You cannot have a single normative character name (in fact not really 
> a name, but a technical identifier) that will match all users 
> expectations in all cultures.

Right.

> So the Unicode and ISO/IEC 10646 have only chosen to use and publich a 
> single stable identifier throughout the standardization process; even 
> if it is bad, it will be kept. These names are not designed to be even 
> suitable for all English users (and just consider how CJK sinograms 
> are named, they are not suitable for anyone...).
>
> There are open projects (outside Unicode and even outside CLDR itself) 
> to provide common character names in various locales.

I'm sure there are - there may even be work on a character picker, but 
do you have any links?

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/65c1f727/attachment.html>

From asmus-inc at ix.netcom.com  Mon May  4 13:02:59 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Mon, 04 May 2015 11:02:59 -0700
Subject: NamesList, =?UTF-8?B?Q29kZcKgQ2hhcnRzLCBJU08vSUVDwqAxMDY0Ng==?=
In-Reply-To: <20150504183956.191ecac6@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <83r3qws2hd.fsf@gnu.org> <20150504183956.191ecac6@JRWUBU2>
Message-ID: <5547B453.1080806@ix.netcom.com>

On 5/4/2015 10:39 AM, Richard Wordingham wrote:
>> User's locale has nothing to do with bidi context, so this would be
>> simply wrong.
> If paragraph embedding level is determined by an overriding profile but
> there is nothing explicit, should not the locale determine the
> directionality?  If so, the locale will often work for indicating
> whether a character should be displayed in its left-to-right form or
> its right-to-left form.
>
> Are you, for example, suggesting that a code chart in Arabic should
> display U+0028 LEFT PARENTHESIS and U+0029 RIGHT PARENTHESIS using the
> same glyphs as for one in the English language?
>

The disconnect is that a "character picker", to continue your example, 
could opt to show the shape that would be chosen based on the direction 
context of the text input location (caret position).

That has nothing to do with the "language" of the names or the "locale" 
of the user.

A./

From verdy_p at wanadoo.fr  Mon May  4 13:12:38 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 4 May 2015 20:12:38 +0200
Subject: NamesList, Code Charts, ISO/IEC 10646
In-Reply-To: <5547B118.10506@ix.netcom.com>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <CAGa7JC0qDQoLdsNCZW0k4-auYk1JH7xy6Mbrh4irHKMKKN7_FQ@mail.gmail.com>
 <5547B118.10506@ix.netcom.com>
Message-ID: <CAGa7JC3HpSCxjubAO5446UB5VPOOM=ZJts_mRtj_dpuajvC5AQ@mail.gmail.com>

2015-05-04 19:49 GMT+02:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:

>  On 5/4/2015 10:32 AM, Philippe Verdy wrote:
>
>  So the Unicode and ISO/IEC 10646 have only chosen to use and publish a
> single stable identifier throughout the standardization process; even if it
> is bad, it will be kept. These names are not designed to be even suitable
> for all English users (and just consider how CJK sinograms are named, they
> are not suitable for anyone...).
>
>
>  There are open projects (outside Unicode and even outside CLDR itself)
> to provide common character names in various locales.
>
>
> I'm sure there are - there may even be work on a character picker, but do
> you have any links?
>

That list is wide open, some projects will start others will end. Freqently
they will change the names shown in previous versions...
But you may just stat by looking in Wikipedia that frequently has articles
in lots of languages, and that provide external links. All editions are
also listing various aliases.

Even during the standardisation process, there were multiple names
discussed, but for tracking discussions and allowing plain text searches to
find the related discussions, before the character was finally encoded, the
technical identifier coming from a formal proposal was kept.
Sometimes for some characers there were competing proposals, but once one
of these formal has passed an early stage of balloting, this name is stable
and should not change (unless an alias was already listed in the accepted
proposal and it has been found that it was more frequently used in other
early discussions.
A limited number of proposed names are considered, and proper localisation
is definitely not a goal at this early stage: it would have been impossible
to produce the standard and encode so many characters if it was needed to
provide accurate names matching exactly the mosts frequent uses (or some
more rare uses, or future uses that will be made once the character will be
encoded).

For getting lists of character pickers, we have the choice in various kind
of applications: accessories for desktop OSes, word processor tools, web
sites, wikis, articles in online forums and blogs, books and facsimiles
(PDF, DejaVu, photos...), spreadsheets, input method editors and custom
keyboard layouts for onscreen input (or input on touch devices...). The
choice is unlimited and expands everyday. Even without developing
applications, users are inventive and will name the characters as they want
in their informal discussions, mails, chats, SMS, tweets...

The Unicode namelists are just a basic set of properties, and its names are
just technical identifiers part of these properties where translation (or
even translatability, even in English) is definitely not a goal.

Another way to say it: ? You don't like these "names" ? Great! in fact none
of us really like them. Develop your own list of names, publish it, and try
convincing others to use your list! ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/75a0c3d7/attachment.html>

From richard.wordingham at ntlworld.com  Mon May  4 13:51:13 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 4 May 2015 19:51:13 +0100
Subject: NamesList, =?ISO-8859-1?B?Q29kZaBDaGFydHMsIElTTy9JRUOgMTA2?=
 =?ISO-8859-1?B?NDY=?=
In-Reply-To: <5547AAC8.6000806@ix.netcom.com>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <5547AAC8.6000806@ix.netcom.com>
Message-ID: <20150504195113.6b2844bf@JRWUBU2>

On Mon, 04 May 2015 10:22:16 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> On 5/4/2015 9:42 AM, Richard Wordingham wrote:
> > On Mon, 04 May 2015 08:32:46 -0700
> > "Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> >> Reading this discussion, I sometimes wonder whether people have
> >> ever heard of character properties?

> > I believe most ordinary computer users have not heard of them.  Most
> > people do not knowingly have the UCD to hand, or even
> > UnicodeData.txt.

> But people writing character pickers really should mine these.

I agree.  However, some don't even give the character name, which
lack can be really annoying with some diacritics.  My work around is to
look up the codepoint from UnicodeData.txt.

> > One solution that
> > immediately comes to mind is to display the character in a pick
> > list according to the user's locale. Unfortunately, that will not
> > always work.  In these days of Unicode, locales are primarily
> > useful for determining the user interface.

> I still don't follow. If I edit text, then the mirroring happens in
> real time. If it doesn't come out as expected, I can change character
> (or use markup).

In a perfect world, perhaps 'in real time', but not immediately.
Assuming that paragraph embedding is not set to right-to-left, if I
type <U+0041 LATIN CAPITAL LETTER A, U+0628 ARABIC LETTER BEH, U+0028
LEFT PARENTHESIS, U+062C ARABIC LETTER JEEM>, on typing jeem the glyph
for U+0028 changes from concave on the right to concave on the left,
and moves from the right to the left of beh.

The idea of displaying the glyphs according to the context of the
insertion point does have much merit, but is not so straightforward if
the character picker is a separate application and so lacks the
information.  Of course, the context is not as simple as left-to-right
v. right-to-left, especially for brackets.

Richard.

From verdy_p at wanadoo.fr  Mon May  4 14:09:47 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 4 May 2015 21:09:47 +0200
Subject: NamesList, Code Charts, ISO/IEC 10646
In-Reply-To: <20150504195113.6b2844bf@JRWUBU2>
References: <1672337588.11213982.1429290504770.JavaMail.zimbra@laposte.net>
 <55314383.5070507@ix.netcom.com> <20150504144733.53247fcb@JRWUBU2>
 <5547911E.4080507@ix.netcom.com> <20150504174226.64433e65@JRWUBU2>
 <5547AAC8.6000806@ix.netcom.com> <20150504195113.6b2844bf@JRWUBU2>
Message-ID: <CAGa7JC0Msw76hXM5OROEv0bz+xD6jrtcr7jK6AJJ_QcOTQkKCw@mail.gmail.com>

2015-05-04 20:51 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Mon, 04 May 2015 10:22:16 -0700
> "Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:
> > But people writing character pickers really should mine these.
>
> I agree.  However, some don't even give the character name, which
> lack can be really annoying with some diacritics.  My work around is to
> look up the codepoint from UnicodeData.txt.
>

Ideally a perfect character picker displaying names should allow users to
personalize these names, and possibly even save them online to a cloud with
user preferences, possibly with a sharing option allowing to feed a
database per locale, with votes/ratings, so that this database will
progressively be able to return names that have the best agreements.

Such tool should of course use a locally cache when it will query names
from the shared database, and should also contain an option to update it
entirely from a snapshot (just like we perform regular software updates).

Such systems do exist for various applications in other domain, e.g. for
rating web sites or their security/risk per domain name, or for mail
blacklists. It could be used as well for a localized database of character
names. The database would also be able to list known aliases (by sorting
them in rating order and extracting the top 10).

With that system it would be even easier to perform plain-text searches of
character names using more user-friendly descriptions, capable of finding
related characters, or characters suitable for some usage: the character
picker would then list all these characters found by name, or part of their
name.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/ae588164/attachment.html>

From pedberg at apple.com  Mon May  4 16:19:21 2015
From: pedberg at apple.com (Peter Edberg)
Date: Mon, 04 May 2015 14:19:21 -0700
Subject: Why doesn't Ideographic (ID) in UAX#14 have half-width katakana?
In-Reply-To: <55467CAF.4080401@ix.netcom.com>
References: <553EEE6D.2020004@ga2.so-net.ne.jp>
 <553EFB2E.3010808@hiroshima-u.ac.jp>
 <CAP0dOsEMB=5v_5smD6x2YTmBGqConxfSrBxBrjX5Q8qaGAyqSA@mail.gmail.com>
 <55403267.9060202@att.net> <DF8A002F-C070-4663-874A-97478FEB9FCA@gmail.com>
 <55438AFB.6020000@att.net> <55439AE4.4020109@hiroshima-u.ac.jp>
 <5543AE4B.5020904@att.net> <5543D03B.80603@ix.netcom.com>
 <CAN9ydbUjuF1RC7gCoef0mOTx0W9eVOv4v07jCt2_a-_TFyAqRQ@mail.gmail.com>
 <55467CAF.4080401@ix.netcom.com>
Message-ID: <4A7F11C4-2F5B-4156-844C-C21E543FCFC7@apple.com>

I have been checking with various groups at Apple. The consensus here is that we would like to see the linebreak value for halfwidth katakana changed to ID.

- Peter E


> On May 3, 2015, at 12:53 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com> wrote:
> 
> On 5/3/2015 9:47 AM, Koji Ishii wrote:
>> Thank you so much Ken and Asmus for the detailed guides and histories. This helps me a lot.
>> 
>> In terms of time frame, I don't insist on specific time frame, Unicode 9 is fine if that works well for all.
>> 
>> I'm not sure how much history and postmortem has to be baked into the section of UAX#14, hope not much because I'm not familiar with how it was defined so other than what Ken             and Asmus kindly provided in this thread. But from those information, I feel stronger than before that this was simply an unfortunate oversight. In the document Ken quoted, F and W are distinguished, but H and N are not. In '90, East Asian versions of Office and RichEdit were in my radar and all of them handled halfwidth Katakana as ID for the line breaking purposes. That's quite understandable given the amount of code points to work on, given the priority of halfwidth Katakana, and given the difference of "what line breaking should be" and UAX#14 as Ken noted, but writing it up as a document doesn't look an easy task 
> 
> Koji,
> 
> kana are special in that they are not shared among languages. From that perspective, there's nothing wrong with having a "general purpose" algorithm support the rules of the target language (unless that would add undue complexity, which isn't a consideration here).
> 
> Based on the data presented informally here in postings, I find your conclusion (oversight) quite believable. The task would therefore be to present the same data in a more organized fashion as part of a formal proposal. Should be doable.
> 
> I think you'd want to focus on survey of modern practice in implementations (and if you have data on some of them going back to the '90s the better).
> 
> From the historical analysis it's clear that there was a desire to create assignments that didn't introduce random inconsistencies between LB and EAW properties, but that kind of self-consistency check just makes sure that all characters of some group defined by the intersection of property subsets are treated the same (unless there's an overriding reason to differentiate within). It seems entirely plausible that this process misfired  for the characters in question, more likely so, given that the earliest drafts of the tables were based on an implementation also being created by MS around the same time. That makes any difference to other MS products even more likely to be an oversight.
> 
> I do want to help UTC establish a precedent of getting changes like that endorsed by a representative sample of implementers and key external standards (where applicable, in this case that would be CSS), to avoid the chance of creating undue disruption (and to increase the chance that the resulting modified algorithm is actually usable off-the-shelf, for example for "default" or "unknown language" type scenarios.
> 
> Hence my insistence that you go out and drum up support. But it looks like this should be relatively easy, as there seems to be no strong case for maintaining the status quo, other than that it is the status quo.
> 
> A./
> 
> 
>> 
>> I agree that implementers and CSS WG should be involved, but given IE and FF have already tailored, and all MS products as well, I guess it should not be too hard. I'm in Chrome team now, and the only problem for me to fix it in Chrome is to justify why Chrome wants to tailor rather than fixing UAX#14 (and the bug priority...)
>> 
>> Either Makoto or I can bring it up to CSS WG to get back to you.
>> 
>> /koji
>> 
>> 
>> On Sat, May 2, 2015 at 4:12 AM, Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>> Thank you, Ken, for your dedicated archeological efforts.
>> 
>> I would like to emphasize that, at the time, UAX#14 reflected observed behavior, in particular (but not exclusively) for MS products some of which (at the time) used an LB algorithm that effectively matched an untailored UAX#14.
>> 
>> However, recently, the W3C has spent considerable effort to look into different layout-related algorithms and specification. If, in that context, a consensus approach is developed that would point to a better "default" behavior for untailored UAX#14-style line breaking, I would regard that as a critical mass of support to allow UTC to consider tinkering with such a long-standing set of property assignments.
>> 
>> This would be true, especially, if it can be demonstrated that (other than matching legacy behavior) there's no context that would benefit from the existing classification. I note that this was something several posters implied.
>> 
>> So, if implementers of the legacy behavior are amenable to achieve this by tailoring, and if the change augments the number of situations where untailored UAX#14-style line breaking can be used, that would be a win that might offset the cost of a disruptive change.
>> 
>> We've heard arguments why the proposed change is technically superior for Japanese. We now need to find out whether there are contexts where a change would adversely affect users/implementers. Following that, we would look for endorsements of the proposal from implementers or other standards organizations such as W3C (and, if at all possible, agreement from those implementers who use the untailored algorithm now). With these three preconditions in place, I would support an effort of the UTC to revisit this question.
>> 
>> A./
>> 
>> 
>> On 5/1/2015 9:48 AM, Ken Whistler wrote:
>>> Suzuki-san,
>>> 
>>> On 5/1/2015 8:25 AM, suzuki toshiya wrote:
>>>> 
>>>> Excuse me, there is any discussion record how UAX#14 class for 
>>>> halfwidth-katakana in 15 years ago? If there is such, I want to 
>>>> see a sample text (of halfwidth-katakana) and expected layout 
>>>> result for it. 
>>> 
>>> The *founding* document for the UTC discussion of the initial
>>> Line_Break property values 15 years ago was:
>>> 
>>> http://www.unicode.org/L2/L1999/99179.pdf <http://www.unicode.org/L2/L1999/99179.pdf>
>>> 
>>> and the corresponding table draft (before approval and conversion
>>> into the final format that was published with UTR #14 -- later
>>> UAX #14) was:
>>> 
>>> http://www.unicode.org/L2/L1999/99180.pdf <http://www.unicode.org/L2/L1999/99180.pdf>
>>> 
>>> There is nothing different or surprising in terms of values there. The halfwidth
>>> katakana were lb=AL and the fullwidth katakana were lb=ID in
>>> that earliest draft, as of 1999.
>>> 
>>> What is new information, perhaps, is the explicit correlation that can be found
>>> in those documents with the East_Asian_Width properties, and the explanation
>>> in L2/99-179 that the EAW property values were explicitly used to
>>> make distinctions for the initial LB values.
>>> 
>>> There is no sample text or expected layout results from that time period,
>>> because that was not the basis for the original UTC decisions on any of this.
>>> Initial LB values were generated based on existing General_Category
>>> and EAW values, using general principles. They were not generated by
>>> examining and specifying in detail the line breaking behavior for
>>> every single script in the standard, and then working back from those
>>> detailed specifications to attempt to create a universal specification
>>> that would replicate all of that detailed behavior. Such an approach
>>> would have been nearly impossible, given the state of all the data,
>>> and might have taken a decade to complete.
>>> 
>>> That said, Japanese line breaking was no doubt considered as part of
>>> the overall background, because the initial design for UTR #14 was informed
>>> by experience in implementation of line breaking algorithms at Microsoft
>>> in the 90's.
>>> 
>>>> 
>>>> You commented that the UAX#14 class should not be changed but 
>>>> the tailoring of the line breaking behaviour would solve 
>>>> the problem (as Firefox and IE11 did). However, some developers 
>>>> may wonder "there might be a reason why UTC put halfwidth-katakana 
>>>> to AL - without understanding it, we could not determine whether 
>>>> the proposed tailoring should be enabled always, or enabled 
>>>> only for a specific environment (e.g. locale, surrounding text)". 
>>> 
>>> See above, in L2/99-179. *That* was the justification. It had nothing
>>> to do with specific environment, locale, or surrounding text.
>>> 
>>>> 
>>>> If UTC can supply the "expected layout result for halfwidth- 
>>>> katakana (used to define the class in current UAX#14)", it 
>>>> would be helpful for the developers to evaluate the proposed 
>>>> tailoring algorithm.
>>> 
>>> UAX #14 was never intended to be a detailed, script-by-script
>>> specification of line layout results. It is a default, generic, universal
>>> algorithm for line breaking that does a decent, generic job of
>>> line breaking in generic contexts without tailoring or specific
>>> knowledge of language, locale, or typographical conventions in use.
>>> 
>>> UAX #14 is not a replacement for full specification of kinsoku
>>> rules for Japanese, in particular. Nor is it intended as any kind
>>> of replacement for JIS X 4051.
>>> 
>>> Please understand this: UAX #14 does *NOT* tell anyone how
>>> Japanese text *should* line break. Instead, it is Japanese typographers,
>>> users and standardizers who tell implementers of line break
>>> algorithms for Japanese what the expectations for Japanese text should
>>> be, in what contexts. It is then the job of the UTC and of the
>>> platform and application vendors to negotiate the details of
>>> which part of that expected behavior makes sense to try to
>>> cover by tweaking the default line-breaking algorithm and the
>>> Line_Break property values for Unicode characters, and which
>>> part of that expected behavior makes sense to try to cover
>>> by adjusting commonly accessible and agreed upon tailoring
>>> behavior (or public standards like CSS), and finally which part of that
>>> expected behavior should instead be addressed by value-added, proprietary
>>> implementations of high end publishing software.
>>> 
>>> Regards,
>>> 
>>> --Ken
>>>> 
>>>> 
>>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150504/bb152182/attachment.html>

From costello at mitre.org  Thu May  7 07:46:03 2015
From: costello at mitre.org (Costello, Roger L.)
Date: Thu, 7 May 2015 12:46:03 +0000
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
Message-ID: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>

Hi Folks,

The JSON specification says that a character may be escaped using this notation: \uXXXX    (XXXX are four hex digits)

However, not every four hex digits corresponds to a Unicode character. 

Are there tools to scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character?

/Roger


From doug at ewellic.org  Thu May  7 12:49:17 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 07 May 2015 10:49:17 -0700
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode =?UTF-8?Q?character=3F?=
Message-ID: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net>

"Costello, Roger L." <Costello at mitre dot org> wrote:

> Are there tools to scan a JSON document to detect the presence of
> \uXXXX, where XXXX does not correspond to any Unicode character?

A tool like this would need to scan the Unicode Character Database, for
some given version, to determine which code points have been allocated
to a coded character in that version and which have not.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From mark at macchiato.com  Thu May  7 13:33:54 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 7 May 2015 11:33:54 -0700
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net>
References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net>
Message-ID: <CAJ2xs_HaBEB6ndvgtEbqG8k0TRUtoJ4cyf52xfiOMmLtXrRThA@mail.gmail.com>

?The simplest approach would be to use ICU in a little program that scans
the file. For example, you could write a little Java program that would
scan the file, and turn any any sequence of (\uXXXX)+ into a String, then
test that string with:

static final UnicodeSet OK = new
UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze();
...
// inside the scanning function
boolean isOk? = OK.containsAll(slashUString);

It is key that it has to grab the entire sequence of \uXXXX in a row;
otherwise it will get the wrong answer.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Thu, May 7, 2015 at 10:49 AM, Doug Ewell <doug at ewellic.org> wrote:

> "Costello, Roger L." <Costello at mitre dot org> wrote:
>
> > Are there tools to scan a JSON document to detect the presence of
> > \uXXXX, where XXXX does not correspond to any Unicode character?
>
> A tool like this would need to scan the Unicode Character Database, for
> some given version, to determine which code points have been allocated
> to a coded character in that version and which have not.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/f3e52124/attachment.html>

From senn at maya.com  Thu May  7 14:23:49 2015
From: senn at maya.com (Jeff Senn)
Date: Thu, 7 May 2015 15:23:49 -0400
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <CAJ2xs_HaBEB6ndvgtEbqG8k0TRUtoJ4cyf52xfiOMmLtXrRThA@mail.gmail.com>
References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net>
 <CAJ2xs_HaBEB6ndvgtEbqG8k0TRUtoJ4cyf52xfiOMmLtXrRThA@mail.gmail.com>
Message-ID: <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com>

While this may not change the OP's need for such tool, I read the JSON specification as allowing all codepoints 0x0000 - 0xffff regardless of whether they map to "valid" unicode characters.  The allowed use of quoted utf-16 surrogate pairs for characters with codepoints > 0xffff (without also specifying that unpaired surrogates are invalid) is troubling on the margin, and complicates such a validation.

Another complication is that a "JSON document" might itself be non-ascii (utf8, 16 or 32) and have unicode characters as literals within quoted strings...

Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted...

> On May 7, 2015, at 2:33 PM, Mark Davis ?? <mark at macchiato.com> wrote:
> 
> ?The simplest approach would be to use ICU in a little program that scans the file. For example, you could write a little Java program that would scan the file, and turn any any sequence of (\uXXXX)+ into a String, then test that string with:
> 
> static final UnicodeSet OK = new UnicodeSet("[^[:unassigned:][:surrogate:]]]").freeze();
> ...
> // inside the scanning function
> boolean isOk? = OK.containsAll(slashUString);
> 
> It is key that it has to grab the entire sequence of \uXXXX in a row; otherwise it will get the wrong answer.
> 
> 
> Mark <https://google.com/+MarkDavis>
> 
> ? Il meglio ? l?inimico del bene ?
> 
> On Thu, May 7, 2015 at 10:49 AM, Doug Ewell <doug at ewellic.org <mailto:doug at ewellic.org>> wrote:
> "Costello, Roger L." <Costello at mitre dot org> wrote:
> 
> > Are there tools to scan a JSON document to detect the presence of
> > \uXXXX, where XXXX does not correspond to any Unicode character?
> 
> A tool like this would need to scan the Unicode Character Database, for
> some given version, to determine which code points have been allocated
> to a coded character in that version and which have not.
> 
> --
> Doug Ewell | http://ewellic.org <http://ewellic.org/> | Thornton, CO ????
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/174f3ae2/attachment.html>

From daniel.buenzli at erratique.ch  Thu May  7 14:35:00 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Thu, 7 May 2015 21:35:00 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
Message-ID: <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>

Le jeudi, 7 mai 2015 ? 14:46, Costello, Roger L. a ?crit :
> The JSON specification says that a character may be escaped using this notation: \uXXXX (XXXX are four hex digits)
>  
> However, not every four hex digits corresponds to a Unicode character.

If we refer to the wording of RFC 7159, they are using imprecise terminology. They are meaning "any code point in U+0000 to U+FFFF" (since you need escaped surrogate pairs to be able to escape scalar values not in the BMP). You can understand their definition of a "character that may be escaped" by this sentence of section 7 [1]:  

  "Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF) then it may  be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point."

However if you are concerned about wrong surrogate sequences or lone surrogate characters (of which the standard has sadly nothing to say about [2]), I have written a best-effort json parser [3] that reports them and allows you to continue by replacing the offending escape sequences by U+FFFD. There's a test command line tool named jsontrip in the distribution that allows you among other things to report these errors. For example:  

> echo '["\uDEAD"]' | jsontrip
-:1.2-1.8: illegal escape, U+DEAD lone low surrogate


Best,

Daniel

[1] https://tools.ietf.org/html/rfc7159#section-7
[2] https://tools.ietf.org/html/rfc7159#section-8.2
[3] http://erratique.ch/software/jsonm


From daniel.buenzli at erratique.ch  Thu May  7 14:59:11 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Thu, 7 May 2015 21:59:11 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com>
References: <20150507104917.665a7a7059d7ee80bb4d670165c8327d.9376fc2991.wbe@email03.secureserver.net>
 <CAJ2xs_HaBEB6ndvgtEbqG8k0TRUtoJ4cyf52xfiOMmLtXrRThA@mail.gmail.com>
 <48A335D4-6350-47E2-AB9D-DB1CBA19D9CA@maya.com>
Message-ID: <5CAE0752A9EF48178D1EFAD08B817590@erratique.ch>

Le jeudi, 7 mai 2015 ? 21:23, Jeff Senn a ?crit :
> Not to mention the ambiguous case of a surrogate pair where half is literal and the other half quoted...

I don't think this is an issue. It's not ambiguous: the standard mentions that JSON text shall be encoded in UTF-8, UTF-16 or UTF-32 so what you get in this case is an (UTF-16) character stream decoding error.

Best,

Daniel


From markus.icu at gmail.com  Thu May  7 14:59:54 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 7 May 2015 12:59:54 -0700
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
Message-ID: <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>

I assume that the JSON spec deliberately allows anything that Java and
JavaScript allow. In particular, there is no requirement for a Java String
or JavaScript string to contain "text", or well-formed UTF-16, or only
assigned characters. Some code stores binary data (sequence of arbitrary
16-bit unsigned integers) in a "string", just because it is easy and fairly
efficient to transport.

You should "validate" *text* only when you are certain that it is indeed
text. And when you do validate, you might want to be narrower than
"assigned character"; for example, you might require Unicode identifiers or
XML NMTOKENS or whatever. Also remember that "assigned" and "identifier"
and such depend on the version of Unicode your library currently implements.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150507/64062c47/attachment.html>

From daniel.buenzli at erratique.ch  Thu May  7 15:29:27 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Thu, 7 May 2015 22:29:27 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
Message-ID: <BFE6990842064BFDA2228516717491D5@erratique.ch>

Le jeudi, 7 mai 2015 ? 21:59, Markus Scherer a ?crit :
> I assume that the JSON spec deliberately allows anything that Java and JavaScript allow. In particular, there is no requirement for a Java String or JavaScript string to contain "text", or well-formed UTF-16, or only assigned characters.  

> Some code stores binary data (sequence of arbitrary 16-bit unsigned integers) in a "string", just because it is easy and fairly efficient to transport.
>  
> You should "validate" *text* only when you are certain that it is indeed text.
Section 8.2 [1] of the spec specifically says that only strings that represent sequences of Unicode scalar values (they say "characters") are interoperable and that strings that do not represent such sequences like "\uDEAD" can lead to unpredictable behaviour.  

If you want to transmit binary data reliably in json you must apply some form of binary to Unicode scalar value encoding (like in most text based interchange formats).  

Best,

Daniel

[1] https://tools.ietf.org/html/rfc7159#section-8.2


From verdy_p at wanadoo.fr  Thu May  7 19:16:25 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 8 May 2015 02:16:25 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <BFE6990842064BFDA2228516717491D5@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
Message-ID: <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>

It would be more exact to say that JSON strings, just like strings in
Javascript and Java or many programming languages are just binary streams
of 16-bit code units. The transport syntax of JSON does not even require
that the textual syntax itself must be encoded in UTF-16, and in most cases
it will be transported as UTF-8.
So before processing a "text/json" content type, you have first to
determine an appropriate character encoding to decode this syntax (in HTTP
you would use a MIME header to specify the charset effectively used, but
the "text/json" MIME type by default uses UTF-8.
Then the JSON processor will decode this text and will remap it to an
internal UTF-16 encoding (for characters that are not escaped) and the
"\uXXXX" will be decoded as plain 16-bit code units. The result will be a
stream of 16-bit code units, which can then externally be outpout and
encoded or stored in any convenient encoding that preserves this stream,
EVEN if this is not valid UTF-16.
If you need a validation of UTF-16 this is not the job of JSON itself (or
Java or Javascript or similar) but dependant on the application using the
JSON data: some of them will reject the stream as invalid because they
expect their input to be a valid UTF (not necessarily UTF-16 or UTF-8), or
they may even restrict more the allowed characer set they support (e.g.
restrict to just ASCII, or support some other encodings such as GSM
encoding for SMS, or just use the lowest 8 bits of each code unit).

JSON by itself is neutral, it just assumes in its syntax that any binary
stream of 16-bit code unit is encodable and transportable: it could be even
used to transport executable binary code or bitmap images data (such as
JPEG or PNG), provided that there's a way to represent the effective binary
length (when it is not an exact multiple of 16 bits) with additional data
transmited in the JSON encoded data (however the most common way for such
binary data is to store them in JSON using Base64, for example with the
"data:" URL-encoding scheme: this scheme is commonly used in CSS which can
be safely embedded in JSON strings)...

I don't think this is a bad thing of JSON: JSON strings are NOT equivalent
to text (and not all text is also valid Unicode text when it uses specific
encodings whose character entities don't have a one-to-one mapping in the
UCS, for example with private-use characters that require an external
agreement if we want to map them to PUA in the UCS, or if the encoding maps
them to non-characters of the UCS), even if there's a "assumed" encoding
only for characters that are not reserved by the JSON syntax and not
represented as escaped sequences (this assumption is also based an an
external greement for the encoding used in the transport).

2015-05-07 22:29 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le jeudi, 7 mai 2015 ? 21:59, Markus Scherer a ?crit :
> > I assume that the JSON spec deliberately allows anything that Java and
> JavaScript allow. In particular, there is no requirement for a Java String
> or JavaScript string to contain "text", or well-formed UTF-16, or only
> assigned characters.
>
> > Some code stores binary data (sequence of arbitrary 16-bit unsigned
> integers) in a "string", just because it is easy and fairly efficient to
> transport.
> >
> > You should "validate" *text* only when you are certain that it is indeed
> text.
> Section 8.2 [1] of the spec specifically says that only strings that
> represent sequences of Unicode scalar values (they say "characters") are
> interoperable and that strings that do not represent such sequences like
> "\uDEAD" can lead to unpredictable behaviour.
>
> If you want to transmit binary data reliably in json you must apply some
> form of binary to Unicode scalar value encoding (like in most text based
> interchange formats).
>
> Best,
>
> Daniel
>
> [1] https://tools.ietf.org/html/rfc7159#section-8.2
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/566b3c26/attachment.html>

From daniel.buenzli at erratique.ch  Thu May  7 20:22:01 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Fri, 8 May 2015 03:22:01 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
Message-ID: <406345450D52417C9DEE234A6C0662A2@erratique.ch>

Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit :
> It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units.

I suggest you have a careful read at RFC 7159 as it specifically implies that this is not the model it supports (albeit using broken or let's say ambiguous/imprecise Unicode terminology).
  
> Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16.

I don't know where you get this from but you won't find any mention of this in the standard. We are dealing with text, Unicode scalar values, not encodings. At the risk of repeating myself, read section 8.2 of RFC 7159.

Best,

Daniel


From verdy_p at wanadoo.fr  Thu May  7 22:08:21 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 8 May 2015 05:08:21 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <406345450D52417C9DEE234A6C0662A2@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
Message-ID: <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>

The RFC is jsut informative not normative, and thez effective usage and
implementations just support JSON as plain 16-bit streams, even if the
transport syntax requires encoding it in plain-text (using some UTF, not
necessarily UTF-8 even if this is the default).
Try by yourself, you can perfectly send JSON text containing '\uFFFF'
(non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON
implementation complaining about one or the other, when receiving the JSON
stream and using it in Javascript, you'll see no missing code unit or
replaced code units and no exception as well.

2015-05-08 3:22 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit :
> > It would be more exact to say that JSON strings, just like strings in
> Javascript and Java or many programming languages are just binary streams
> of 16-bit code units.
>
> I suggest you have a careful read at RFC 7159 as it specifically implies
> that this is not the model it supports (albeit using broken or let's say
> ambiguous/imprecise Unicode terminology).
>
> > Then the JSON processor will decode this text and will remap it to an
> internal UTF-16 encoding (for characters that are not escaped) and the
> "\uXXXX" will be decoded as plain 16-bit code units. The result will be a
> stream of 16-bit code units, which can then externally be outpout and
> encoded or stored in any convenient encoding that preserves this stream,
> EVEN if this is not valid UTF-16.
>
> I don't know where you get this from but you won't find any mention of
> this in the standard. We are dealing with text, Unicode scalar values, not
> encodings. At the risk of repeating myself, read section 8.2 of RFC 7159.
>
> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/08becaa4/attachment.html>

From verdy_p at wanadoo.fr  Thu May  7 22:12:19 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 8 May 2015 05:12:19 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
Message-ID: <CAGa7JC0Zvqz-2CJJYO0f7cY2x6NbXjVmxHXO8JixaQXBCTX0eA@mail.gmail.com>

2015-05-08 5:08 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Try by yourself, you can perfectly send JSON text containing '\uFFFF'
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON
> implementation complaining about one or the other, when receiving the JSON
> stream and using it in Javascript, you'll see no missing code unit or
> replaced code units and no exception as well.
>

typo : replace F800 by D800 of course
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/71221f7e/attachment.html>

From petercon at microsoft.com  Fri May  8 00:14:41 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 8 May 2015 05:14:41 +0000
Subject: Script / font support in Windows 10
Message-ID: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>

This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/372158fa/attachment.html>

From marc at keyman.com  Fri May  8 00:27:27 2015
From: marc at keyman.com (Marc Durdin)
Date: Fri, 8 May 2015 05:27:27 +0000
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <1CEDD746887FFF4B834688E7AF5FDA5A73B83158@federation.tavultesoft.local>

That page doesn't appear to be visible outside Microsoft.  Public link is https://msdn.microsoft.com/en-us/bb688099 I think.

Marc

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Friday, 8 May 2015 3:15 PM
To: unicode at unicode.org
Subject: Script / font support in Windows 10

This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/1f5c119f/attachment.html>

From petercon at microsoft.com  Fri May  8 00:29:18 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 8 May 2015 05:29:18 +0000
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>

Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Thursday, May 7, 2015 10:15 PM
To: unicode at unicode.org
Subject: Script / font support in Windows 10

This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/510caab4/attachment.html>

From costello at mitre.org  Fri May  8 04:27:03 2015
From: costello at mitre.org (Costello, Roger L.)
Date: Fri, 8 May 2015 09:27:03 +0000
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
Message-ID: <DM2PR09MB0351C9A263608A78E2FB01FDC8DE0@DM2PR09MB0351.namprd09.prod.outlook.com>

Philippe Verdy wrote:


?  implementations just support JSON as plain 16-bit streams

?  Try by yourself, you can perfectly send JSON text containing

?   '\uFFFF' (non-character) or '\uD800' (unpaired surrogate) and

?  I've not seen any JSON implementation complaining about one

?  or the other

Okay, I gave it a try. I created this string which contains binary data (sequence of arbitrary unsigned integers):

"
________________________________
??}g?? "

When I validated that string against this JSON Schema:

{
   "type" : "string"
}

using this online validator: https://json-schema-validator.herokuapp.com/

I got an error: Invalid JSON: parse error, line 1

I am pretty sure that Daniel is correct, JSON cannot contain arbitrary bit streams.


?  The RFC is just informative not normative

Interesting! What does that mean? JSON vendors are free to ignore the JSON RFC and do as they see fit?

/Roger

From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 07, 2015 11:08 PM
To: Daniel B?nzli
Cc: Unicode at unicode.org; Costello, Roger L.; Markus Scherer
Subject: Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

The RFC is jsut informative not normative, and thez effective usage and implementations just support JSON as plain 16-bit streams, even if the transport syntax requires encoding it in plain-text (using some UTF, not necessarily UTF-8 even if this is the default).
Try by yourself, you can perfectly send JSON text containing '\uFFFF' (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON implementation complaining about one or the other, when receiving the JSON stream and using it in Javascript, you'll see no missing code unit or replaced code units and no exception as well.

2015-05-08 3:22 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch<mailto:daniel.buenzli at erratique.ch>>:
Le vendredi, 8 mai 2015 ? 02:16, Philippe Verdy a ?crit :
> It would be more exact to say that JSON strings, just like strings in Javascript and Java or many programming languages are just binary streams of 16-bit code units.

I suggest you have a careful read at RFC 7159 as it specifically implies that this is not the model it supports (albeit using broken or let's say ambiguous/imprecise Unicode terminology).

> Then the JSON processor will decode this text and will remap it to an internal UTF-16 encoding (for characters that are not escaped) and the "\uXXXX" will be decoded as plain 16-bit code units. The result will be a stream of 16-bit code units, which can then externally be outpout and encoded or stored in any convenient encoding that preserves this stream, EVEN if this is not valid UTF-16.

I don't know where you get this from but you won't find any mention of this in the standard. We are dealing with text, Unicode scalar values, not encodings. At the risk of repeating myself, read section 8.2 of RFC 7159.

Best,

Daniel


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/7e57d592/attachment.html>

From daniel.buenzli at erratique.ch  Fri May  8 06:04:08 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Fri, 8 May 2015 13:04:08 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
Message-ID: <A69D3A601C8242D4A522F1A595460400@erratique.ch>

Le vendredi, 8 mai 2015 ? 05:08, Philippe Verdy a ?crit :
> The RFC is jsut informative not normative,  

RFC 7159 is not informational, it is a proposed standard.  

> Try by yourself, you can perfectly send JSON text containing '\uFFFF' (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON implementation complaining about one or the other,  
Well now you have (mine). The RFC is very clear that we are dealing with *text-based* data not *binary* data. Maybe programming languages that represent their Unicode strings as possibly invalid UTF-16 sequences will happily input this but as section 8.2 mentions that may not be the case everywhere, software receiving these values  "might return different values for the length of a string value or even suffer fatal runtime exceptions".  

Best,

Daniel


From verdy_p at wanadoo.fr  Fri May  8 06:48:38 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 8 May 2015 13:48:38 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <A69D3A601C8242D4A522F1A595460400@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
Message-ID: <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>

JSON came initially from Javascript, and it is used extensively with
Javascript. My tests with their JSON parser is that any string that is
valdi for Javascript is also valid in JSON (no exception raised, no
replaced characters, no deleted characters even if there are unpaired
surrogates or non-characters like '\uFFFF').
The RFC is deviating from the currently running implementations.


2015-05-08 13:04 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le vendredi, 8 mai 2015 ? 05:08, Philippe Verdy a ?crit :
> > The RFC is jsut informative not normative,
>
> RFC 7159 is not informational, it is a proposed standard.
>
> > Try by yourself, you can perfectly send JSON text containing '\uFFFF'
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen any JSON
> implementation complaining about one or the other,
> Well now you have (mine). The RFC is very clear that we are dealing with
> *text-based* data not *binary* data. Maybe programming languages that
> represent their Unicode strings as possibly invalid UTF-16 sequences will
> happily input this but as section 8.2 mentions that may not be the case
> everywhere, software receiving these values  "might return different values
> for the length of a string value or even suffer fatal runtime exceptions".
>
> Best,
>
> Daniel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/ac89a6c5/attachment.html>

From verdy_p at wanadoo.fr  Fri May  8 06:57:20 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 8 May 2015 13:57:20 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <DM2PR09MB0351C9A263608A78E2FB01FDC8DE0@DM2PR09MB0351.namprd09.prod.outlook.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <DM2PR09MB0351C9A263608A78E2FB01FDC8DE0@DM2PR09MB0351.namprd09.prod.outlook.com>
Message-ID: <CAGa7JC3f4OP2P+UH-OUoQysjik8KVnDUv5yZyUNF-zcRM2=kiA@mail.gmail.com>

2015-05-08 11:27 GMT+02:00 Costello, Roger L. <costello at mitre.org>:

>  Okay, I gave it a try. I created this string which contains binary data
> (sequence of arbitrary unsigned integers):
>
>
>
> "
> ------------------------------
> ??}g?? "
>
>
> I did not say that these data had not to be properly escaped. With
escaping (\uXXXX) it works with arbitrary sequences of 16-bit code-units.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/c1a243f4/attachment.html>

From daniel.buenzli at erratique.ch  Fri May  8 07:32:51 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Fri, 8 May 2015 14:32:51 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
Message-ID: <821476CFD30C4A6C95CA6319394C723C@erratique.ch>

Le vendredi, 8 mai 2015 ? 13:48, Philippe Verdy a ?crit :
> JSON came initially from Javascript, and it is used extensively with Javascript.  

But not *only* for a long time now.
  
> The RFC is deviating from the currently running implementations.

Well did you test them all ? There's quite a big list here http://www.json.org. Taking a random one mentioned on that page leads me to http://golang.org/pkg/encoding/json/ in which they say that they replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising since apparently go's strings as text are UTF-8 encoded so when you need to produce your results as UTF-8 then you don't have a lot of solutions... error and/or U+FFFD.   

In any case deviating or not, that's for good since it would be insane to impose JavaScript's string as a data structure for an interchange format that intents to be universal and *textual*.
  
Best,

Daniel


From petercon at microsoft.com  Fri May  8 09:15:55 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 8 May 2015 14:15:55 +0000
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>

I think this is the right public link:

https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx


From: Peter Constable
Sent: Thursday, May 7, 2015 10:29 PM
To: Peter Constable; unicode at unicode.org
Subject: RE: Script / font support in Windows 10

Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Thursday, May 7, 2015 10:15 PM
To: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Script / font support in Windows 10

This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/4e96b47b/attachment.html>

From doug at ewellic.org  Fri May  8 09:41:49 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 08 May 2015 07:41:49 -0700
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode =?UTF-8?Q?character=3F?=
Message-ID: <20150508074149.665a7a7059d7ee80bb4d670165c8327d.c8e098d352.wbe@email03.secureserver.net>

I interpreted Roger Costello's original question literally, that he
wanted to find instances of '\uXXXX' that do not represent an ASSIGNED
Unicode character. Apologies if this discussion is really about
something else.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From mark at macchiato.com  Fri May  8 11:04:00 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 8 May 2015 09:04:00 -0700
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <CAJ2xs_E4psfr4rxvNY=7O8LFxgmk0VJczQ3VsLsZ88oZtzVoLw@mail.gmail.com>

Thanks!


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Fri, May 8, 2015 at 7:15 AM, Peter Constable <petercon at microsoft.com>
wrote:

>  I think this is the right public link:
>
>
>
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx
>
>
>
>
>
> *From:* Peter Constable
> *Sent:* Thursday, May 7, 2015 10:29 PM
> *To:* Peter Constable; unicode at unicode.org
> *Subject:* RE: Script / font support in Windows 10
>
>
>
> Oops? my bad: maybe it isn?t on live servers yet. It will be soon. I?ll
> update with the public link when it is.
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org
> <unicode-bounces at unicode.org>] *On Behalf Of *Peter Constable
> *Sent:* Thursday, May 7, 2015 10:15 PM
> *To:* unicode at unicode.org
> *Subject:* Script / font support in Windows 10
>
>
>
> This page on MSDN that provides an overview of Windows support for
> different scripts has now been updated for Windows 10:
>
>
>
> https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099
>
>
>
>
>
>
>
> Peter
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/47a86e2b/attachment.html>

From richard.wordingham at ntlworld.com  Fri May  8 11:49:47 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 8 May 2015 17:49:47 +0100
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <20150508174947.2fca36c4@JRWUBU2>

On Fri, 8 May 2015 14:15:55 +0000
Peter Constable <petercon at microsoft.com> wrote:

> I think this is the right public link:
> 
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx

Does this confirm the intention of Microsoft that at some stage the
Universal Shaping Engine (USE) in Windows 10 will support the Tai Tham
script?  In February we discovered that the USE didn't support
syllable-final SAKOT+consonant - the commonest and eponymous use of
U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character
in the Tai Tham script.  For example, we can't write the name of the
city of 'Chiang Rai' in the Tai Tham script using the USE.

Richard.

From Andrew.Glass at microsoft.com  Fri May  8 12:16:01 2015
From: Andrew.Glass at microsoft.com (Andrew Glass (WINDOWS))
Date: Fri, 8 May 2015 17:16:01 +0000
Subject: Script / font support in Windows 10
In-Reply-To: <20150508174947.2fca36c4@JRWUBU2>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <20150508174947.2fca36c4@JRWUBU2>
Message-ID: <BN1PR03MB1392F731B7D854F717643BE8EDE0@BN1PR03MB139.namprd03.prod.outlook.com>

Hi Richard,

I agree that there is some work to be done to ensure correct display of Tai Tham. That work may involve changes to USE in a future update. We will have a panel on Universal Shaping at the upcoming IUC conference. That will be a good opportunity for a discussion between implementers and font developers. If you are able to attend that would be great. If not, we can certainly go through the proposed changes you have sent.

Cheers,

Andrew


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Friday, May 8, 2015 9:50 AM
To: unicode at unicode.org
Subject: Re: Script / font support in Windows 10

On Fri, 8 May 2015 14:15:55 +0000
Peter Constable <petercon at microsoft.com> wrote:

> I think this is the right public link:
> 
> https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx

Does this confirm the intention of Microsoft that at some stage the Universal Shaping Engine (USE) in Windows 10 will support the Tai Tham script?  In February we discovered that the USE didn't support syllable-final SAKOT+consonant - the commonest and eponymous use of
U+1A60 TAI THAM SIGN SAKOT, which may well be the commonest character
in the Tai Tham script.  For example, we can't write the name of the city of 'Chiang Rai' in the Tai Tham script using the USE.

Richard.


From richard.wordingham at ntlworld.com  Fri May  8 15:27:18 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 8 May 2015 21:27:18 +0100
Subject: Script / font support in Windows 10
In-Reply-To: <BN1PR03MB1392F731B7D854F717643BE8EDE0@BN1PR03MB139.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <20150508174947.2fca36c4@JRWUBU2>
 <BN1PR03MB1392F731B7D854F717643BE8EDE0@BN1PR03MB139.namprd03.prod.outlook.com>
Message-ID: <20150508212718.2f6a48b6@JRWUBU2>

On Fri, 8 May 2015 17:16:01 +0000
"Andrew Glass (WINDOWS)" <Andrew.Glass at microsoft.com> wrote:

> I agree that there is some work to be done to ensure correct display
> of Tai Tham. That work may involve changes to USE in a future update.

That's as I understood it, which I is why I was surprised by the
degree of commitment in the overview.  I did wonder if the overview had
been written long ago, so its author was unaware of there being issues
with USE and Tai Tham.

For example, I got the impression that you had contemplated cloning USE
and modifying that clone for Tai Tham, so as to keep the USE simpler.
(In the meantime, it may make sense to use the USE for Tai Tham, and let
the font clean up the inappropriate dotted circles.  I currently do
that for applications that use old versions of HarfBuzz.) Also, I
hadn't expected you to commit to a timetable.

Richard.

From richard.wordingham at ntlworld.com  Fri May  8 15:47:46 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 8 May 2015 21:47:46 +0100
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to
 a Unicode character?
In-Reply-To: <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
Message-ID: <20150508214746.7570e528@JRWUBU2>

On Fri, 8 May 2015 05:08:21 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Try by yourself, you can perfectly send JSON text containing '\uFFFF'
> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
> any JSON implementation complaining about one or the other, when
> receiving the JSON stream and using it in Javascript, you'll see no
> missing code unit or replaced code units and no exception as well.

Unicode Consortium standards and recommendations allow non-characters
to be sent; as far as I can make out, they are just not to be thought of
as unstandardised graphic characters.

Richard.

From doug at ewellic.org  Fri May  8 17:37:57 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 08 May 2015 15:37:57 -0700
Subject: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...)
Message-ID: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

>> Try by yourself, you can perfectly send JSON text containing '\uFFFF'
>> (non-character) or '\uF800' (unpaired surrogate) and I've not seen
>> any JSON implementation complaining about one or the other, when
>> receiving the JSON stream and using it in Javascript, you'll see no
>> missing code unit or replaced code units and no exception as well.
>
> Unicode Consortium standards and recommendations allow non-characters
> to be sent; as far as I can make out, they are just not to be thought
> of as unstandardised graphic characters.

As I understand it, from a purely Unicode standpoint, there are
differences here between noncharacters and unpaired surrogates.

Noncharacters are Unicode scalar values, while unpaired surrogates are
not. This means noncharacters may appear in a well-formed UTF-8, -16, or
-32 string, while unpaired surrogates may not. They may both be part of
a "Unicode string" which does not claim to be in any given encoding
form.

Authoritative corrections are welcome to help solidify my understanding.

I don't wish to get involved in debates over JSON. I've read RFC 7159
and I know what it says.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From daniel.buenzli at erratique.ch  Fri May  8 19:26:59 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Sat, 9 May 2015 02:26:59 +0200
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
Message-ID: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>

Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit :
> Noncharacters are Unicode scalar values,

Non characters are Unicode scalar values by definitions D14 and D76.
  
> while unpaired surrogates are not.

All surrogates code points are not Unicode scalar values by D71, D73 and D76.
  
> This means noncharacters may appear in a well-formed UTF-8, -16, or
> -32 string,

It take "appear" to mean "be encoded". Yes, any Unicode encoding forms allows to interchange all scalar values by D79.

(However noncharacters are not designed to be openly interchanged see "Restricted interchange" on p. 31. of 7.0.0)

> while unpaired surrogates may not.
All surrogate code points *paired or not* cannot be encoded in UTF-{8,16,32} by D92, D91, D90. All these encoding forms, by definition, assign only Unicode scalar values to code units sequences (see also the already mentioned p. 31. which clarifies this).

However in UTF-16 code unit sequences may contain surrogate pairs (that taken together represent a Unicode scalar value).

> They may both be part of a "Unicode string" which does not claim to be in any given encoding
> form.

Not sure what you mean by that. So I let someone else answer.  

Best,

Daniel  


From verdy_p at wanadoo.fr  Fri May  8 19:33:20 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 02:33:20 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
Message-ID: <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>

2015-05-08 14:32 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le vendredi, 8 mai 2015 ? 13:48, Philippe Verdy a ?crit :
> > JSON came initially from Javascript, and it is used extensively with
> Javascript.
>
> But not *only* for a long time now.
>
> > The RFC is deviating from the currently running implementations.
>
> Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
>

I've already saif that JSON is UTF-8 encoded by default, but this does not
mean that JSON invalidates the escape sequence '\uD800' isolated in a
string.
For this reason JSON strings are not restricted by the textual encoding of
its syntaxic representation.

So no error returned, no replacement by U+FFFD and even unpaired surrogates
are possible, provided that they are escaped.
Basically JSON strings remain equivalent to Javascript strings where
'\uD800' is also a perfectly valid "string".

I make the difference between a "string" and plain-text.

And if the RFC had not been so confusive by mixing terms (notably the term
"code point", it would have may be become a standard. For now it is just a
tentative attempt to standardize it, but it does not work with existing
implementation which have started since the begining as a data
serialization format based on Javascript syntax (with only the removal of
items that are not pure data, such as functions/methods, and more complex
objects like Javascript regexp literals (functionaly equivalent to an
object constructor), object references... keeping only strings, numbers,
and only two structures: ordered arrays and unordered associative arrays
(also called dictionaries and that are also including ordered arrays
considered as associative using number keys, thus reducing it to only one
effetctive structure even if ordered arrays have also a simpler syntaxic
sugar to represent them in a more compact way).

If you mean that JSON string "\uD800" is invalid, it is not longer a data
serialization for Javascript, or other languages also using JSON as a
possible syntax for serializing data into plain-text. JSON was created
because XML (the alternative) was too verbose and had restrictions in its
"text" elements. It seems that the RFC just wants to apply to JSON the same
restrictions as found in XML, but it deviates JSON from its objective, and
I'm convinced that such restrictions are not enforced at all in many JSON
implementations that do not attempt to validate if the value of the
represented string a valid plain-text. JSON is only transforming strings
into valid plain-text representation using an encoding syntax using
separators and escape sequences, nothing else.

If the RFC wants to add such restrictions, it is mixing two layers: the
syntaxic (plain text) layer and the lower layer for the internally
represented values which are just a stream of code units.

And the only difference in that case is the behavior for isolated/unpaired
surrogates (not restricted in Javascript or many languages defining
"strings", but restricted in plain-text, but JSON is there to offer the
serializatrion scheme allowing strings to be safely converted to plain-text)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/42cb4031/attachment.html>

From daniel.buenzli at erratique.ch  Fri May  8 20:27:20 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Sat, 9 May 2015 03:27:20 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
 <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
Message-ID: <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>

Le samedi, 9 mai 2015 ? 02:33, Philippe Verdy a ?crit :
> 2015-05-08 14:32 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch (mailto:daniel.buenzli at erratique.ch)>:
> > Well did you test them all ? There's quite a big list here http://www.json.org. Taking a random one mentioned on that page leads me to http://golang.org/pkg/encoding/json/ in which they say that they replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very surprising since apparently go's strings as text are UTF-8 encoded so when you need to produce your results as UTF-8 then you don't have a lot of solutions... error and/or U+FFFD.
>  
>  
> I've already saif that JSON is UTF-8 encoded by default, but this does not mean that JSON invalidates the escape sequence '\uD800' isolated in a string.

You didn't get what I said. When a parser returns a JSON string it just parsed and that it wants to give it back to the programmer using the native string of the language and that these strings happen to be UTF-8 encoded in this language, then in presence of such lone surrogates you are stuck and need to do something as you cannot encode them in the UTF-8 string.  

(I understand that in *your* interpretation this should not happen since I should define a special data type to represent these JSON strings so that they behave like JavaScript strings; that would be indeed very practical, none of my language native string tools can be used on that?)
  
Anyways, we are largely OT at this point.  

Best,

Daniel


From richard.wordingham at ntlworld.com  Fri May  8 22:13:52 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 9 May 2015 04:13:52 +0100
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
Message-ID: <20150509041352.60c24989@JRWUBU2>

On Sat, 9 May 2015 02:26:59 +0200
Daniel B?nzli <daniel.buenzli at erratique.ch> wrote:

> Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit :
> > Noncharacters are Unicode scalar values,

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That didn't stop their being openly interchanged.

> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.

> Not sure what you mean by that. So I let someone else answer.  

There are a number of phrases whose declared meanings cannot be
deduced from the individual words.  A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values.  However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively.  This definition has some odd consequences:

A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding.  An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.

All strings of unsigned 16-bit values are Unicode 16-bit strings.  Not
all (Unicode) 16-bit strings are UTF-16 strings.

Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.

I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.

Richard.


From richard.wordingham at ntlworld.com  Fri May  8 22:42:13 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 9 May 2015 04:42:13 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
Message-ID: <20150509044213.28b48ac8@JRWUBU2>

On Sat, 9 May 2015 02:26:59 +0200
Daniel B?nzli <daniel.buenzli at erratique.ch> wrote:

> Le samedi, 9 mai 2015 ? 00:37, Doug Ewell a ?crit :

> > This means noncharacters may appear in a well-formed UTF-8, -16, or
> > -32 string,

> It take "appear" to mean "be encoded". Yes, any Unicode encoding
> forms allows to interchange all scalar values by D79.

> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)

That is irrelevant, for JSON is not restricted to open interchange.

Richard.


From verdy_p at wanadoo.fr  Fri May  8 23:13:33 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 06:13:33 +0200
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <20150509041352.60c24989@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
Message-ID: <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>

2015-05-09 5:13 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I can't think of a practical use for the specific concepts of Unicode
> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
> essentially the same as 16-bit strings, and Unicode 32-bit strings are
> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
> pedantry; there are more useful categories of 8-bit strings that are
> not UTF-8 strings.
>

And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in fact
that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
forbidden.

So the concept of "Unicode string" is in fact the same as valid Unicode
text: it is a subset of possible strings, restricted by validation rules:
- for 8-bit strings (UTF-8) there are other constraints (not all bytes are
acceptable and some pairs of bytes are also restricted, and final bytes
cannot occur alone)
- for 16-bit strings (UTF-16), the only constraint is on isolated/unpaired
surrogates
- for 32-bit strings (UTF-32), the only constaint is on the two allowed
ranges of encoded code points (U+0000..U+D7FF and U+E000..U+10FFFF).

For being "plain-text" there are additional restrictions: non-characters
are also excluded, and only a small subset of controls (basically tabs and
newlines) is allowed (the other controls, including U+0000 are restricted
for private protocols and not designed for plain text... except
specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
2022 or Videotext which need these controls in fact to represent characters
into sequences, possibly with contextual encoding).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/2862bab6/attachment.html>

From verdy_p at wanadoo.fr  Fri May  8 23:24:36 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 06:24:36 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
 <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
 <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>
Message-ID: <CAGa7JC0QMqn0T=aN79=BFyVOUv9cgXLVf3m-OXe+DxknVm5BfQ@mail.gmail.com>

2015-05-09 3:27 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le samedi, 9 mai 2015 ? 02:33, Philippe Verdy a ?crit :
> > 2015-05-08 14:32 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch
> (mailto:daniel.buenzli at erratique.ch)>:
> > > Well did you test them all ? There's quite a big list here
> http://www.json.org. Taking a random one mentioned on that page leads me
> to http://golang.org/pkg/encoding/json/ in which they say that they
> replace invalid UTF-16 surrogate pairs by U+FFFD. This is really not very
> surprising since apparently go's strings as text are UTF-8 encoded so when
> you need to produce your results as UTF-8 then you don't have a lot of
> solutions... error and/or U+FFFD.
> >
> >
> > I've already saif that JSON is UTF-8 encoded by default, but this does
> not mean that JSON invalidates the escape sequence '\uD800' isolated in a
> string.
>
> You didn't get what I said. When a parser returns a JSON string it just
> parsed and that it wants to give it back to the programmer using the native
> string of the language and that these strings happen to be UTF-8 encoded in
> this language, then in presence of such lone surrogates you are stuck and
> need to do something as you cannot encode them in the UTF-8 string.
>

You are not stuck! You can still regenerate a valid JSON output encoded in
UTF-8: it will once again use escape sequences (which are also needed if
your text contains quotation marks used to delimit the JSON strings in its
syntax.

Unlike UTF-8, JSON has never been designed to restrict its strings to have
its represented values to be only plain-text, it is a only a serialization
of "strings" to valid plain-text using a custom syntax.

There's absolutely no need to restrict strings values to the same
validation rules and the same subset as the set of acceptable plain-text:
this is not the same layer: one is the string level (in fact not bound to
any character encoding and not restricted to text), another is the
plain-text, and JSON is the adapter/converter between these two
representations. Do not mix these two distinct layers.

(this is also the case when someone confuses an XML document with its DOM:
not the same layer)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/824ef8e7/attachment.html>

From markus.icu at gmail.com  Fri May  8 23:37:40 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Fri, 8 May 2015 21:37:40 -0700
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
Message-ID: <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>

On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
>> I can't think of a practical use for the specific concepts of Unicode
>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>> pedantry; there are more useful categories of 8-bit strings that are
>> not UTF-8 strings.
>>
>
> And here you're wrong: a 16-bit string is just a sequence of arbitrary
> 16-bit code units, but an Unicode string (whatever the size of its code
> units) adds restrictions for validity (the only restriction being in fact
> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
> forbidden.
>

No, Richard had it right. See for example definition D82 "Unicode 16-bit
string" in the standard. (Section 3.9 Unicode Encoding Forms,
http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)

I agree that the definitions for Unicode 8-bit and 32-bit strings are not
particularly useful.

For being "plain-text" there are additional restrictions: non-characters
> are also excluded, and only a small subset of controls (basically tabs and
> newlines) is allowed (the other controls, including U+0000 are restricted
> for private protocols and not designed for plain text... except
> specifically in a few legacy encoded 8-bit "charsets" like VISCII or ISO
> 2022 or Videotext which need these controls in fact to represent characters
> into sequences, possibly with contextual encoding).
>

Where did you find that definition of "plain text"?
Unicode just defines "plain text" by contrast with "rich text" which is
text with markup or other such structure. There is no limitation of code
points associated with that term.
http://unicode.org/glossary/#plain_text

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150508/734b6c79/attachment.html>

From verdy_p at wanadoo.fr  Sat May  9 00:55:17 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 07:55:17 +0200
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
Message-ID: <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>

2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>> richard.wordingham at ntlworld.com>:
>>
>>> I can't think of a practical use for the specific concepts of Unicode
>>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>>> pedantry; there are more useful categories of 8-bit strings that are
>>> not UTF-8 strings.
>>>
>>
>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>> 16-bit code units, but an Unicode string (whatever the size of its code
>> units) adds restrictions for validity (the only restriction being in fact
>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>> forbidden.
>>
>
> No, Richard had it right. See for example definition D82 "Unicode 16-bit
> string" in the standard. (Section 3.9 Unicode Encoding Forms,
> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>

I was right, D82 refers to "UTF-16", which implies  the restriction of
validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
non-characters).

I was right, You and Richard were wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/58371733/attachment.html>

From verdy_p at wanadoo.fr  Sat May  9 00:56:52 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 07:56:52 +0200
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
Message-ID: <CAGa7JC1A_mwYRbaZDkQyr=K6gyN5_huDBv=F-xMb+neDpmeiVg@mail.gmail.com>

Note: I used "16-bit string" in my sentence, NOT "Unicode 16-bit string"
which I used in the later part of my sentence (but also including 8-bit and
32-bit for the same restrictions in "Unicode strings")... So no
contradiction.


2015-05-09 7:55 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

>
>
> 2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
>
>> On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
>>> richard.wordingham at ntlworld.com>:
>>>
>>>> I can't think of a practical use for the specific concepts of Unicode
>>>> 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings are
>>>> essentially the same as 16-bit strings, and Unicode 32-bit strings are
>>>> UTF-32 strings.   'Unicode 8-bit string' strikes me as an exercise in
>>>> pedantry; there are more useful categories of 8-bit strings that are
>>>> not UTF-8 strings.
>>>>
>>>
>>> And here you're wrong: a 16-bit string is just a sequence of arbitrary
>>> 16-bit code units, but an Unicode string (whatever the size of its code
>>> units) adds restrictions for validity (the only restriction being in fact
>>> that surrogates (when present in 16-bit strings, i.e. UTF-16) must be
>>> paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates are
>>> forbidden.
>>>
>>
>> No, Richard had it right. See for example definition D82 "Unicode 16-bit
>> string" in the standard. (Section 3.9 Unicode Encoding Forms,
>> http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)
>>
>
> I was right, D82 refers to "UTF-16", which implies  the restriction of
> validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
> non-characters).
>
> I was right, You and Richard were wrong.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/227fc7d9/attachment.html>

From verdy_p at wanadoo.fr  Sat May  9 01:00:34 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 08:00:34 +0200
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
Message-ID: <CAGa7JC1vf5U-B-f4sVJhLxFxukY_+CCuqMNen4s_b6kvaHWZcA@mail.gmail.com>

2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> Where did you find that definition of "plain text"?
>

I have not said that Unicode defines what is plain-text. It is defined in
RFC describing the MIME type and giving the name "plain text".

> Unicode just defines "plain text" by contrast with "rich text" which is
> text with markup or other such structure. There is no limitation of code
> points associated with that term.
> http://unicode.org/glossary/#plain_text
>

This is not a definition, or just a mere definition of "Unicode plain text"
(i.e. more restrictive than "plain text"). Please don't add
restriction/qualifying words ("Unicode") that I did not use in my sentence
**on purpose**.

Plain text has been defined much longer before Unicode wrote its
informative glossary.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/28e70f22/attachment.html>

From richard.wordingham at ntlworld.com  Sat May  9 04:59:57 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 9 May 2015 10:59:57 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
Message-ID: <20150509105957.66267e13@JRWUBU2>

On Sat, 9 May 2015 07:55:17 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-09 6:37 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
> 
> > On Fri, May 8, 2015 at 9:13 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> > wrote:

> >> 2015-05-09 5:13 GMT+02:00 Richard Wordingham <
> >> richard.wordingham at ntlworld.com>:

WARNING: This post belongs in pedants' corner, or possibly a pantomime.

> >>> I can't think of a practical use for the specific concepts of
> >>> Unicode 8-bit, 16-bit and 32-bit strings.  Unicode 16-bit strings
> >>> are essentially the same as 16-bit strings, and Unicode 32-bit
> >>> strings are UTF-32 strings.   'Unicode 8-bit string' strikes me
> >>> as an exercise in pedantry; there are more useful categories of
> >>> 8-bit strings that are not UTF-8 strings.

> >> And here you're wrong: a 16-bit string is just a sequence of
> >> arbitrary 16-bit code units, but an Unicode string (whatever the
> >> size of its code units) adds restrictions for validity (the only
> >> restriction being in fact that surrogates (when present in 16-bit
> >> strings, i.e. UTF-16) must be paired, and in 32-bit (UTF-32) and
> >> 8-bit (UTF-8) strings, surrogates are forbidden.

You are thinking of a Unicode string as a sequence of codepoints.  Now
that may be a linguistically natural interpretation of 'Unicode
string', but 'Unicode string' has a different interpretation, given in
D80.  A 'Unicode string' (D80) is a sequence of code-units occurring in
some Unicode encoding form.  By this definition, every permutation of
the code-units in a Unicode string is itself a Unicode string.  UTF-16
is unique in that every code-unit corresponds to a codepoint.  (We
could extend the Unicode codespace (D9, D10) by adding integers for the
bytes of multibyte UTF-8 encodings, but I see no benefit.)

A Unicode 8-bit string may have no interpretation as a sequence of
codepoints. For example, the 8-bit string <C2, A0> is a Unicode 8-bit
string denoting a sequence of one Unicode scalar value, namely U+00A0.
<A0, A0> is therefore also a Unicode 8-bit string, but it has no
defined or obvious interpretation as a codepoint; it is *not* a UTF-8
string.  The string <E0, 80, 80> is also a Unicode 8-bit string, but is
not a UTF-8 string because the sequence is not the shortest
representation of U+0000.  The 8-bit string <C0, 80> is *not* a Unicode
8-bit string, for the byte C0 does not occur in well-formed UTF-8; one
does not even need to note that it is not the shortest representation
of U+0000. 

> > No, Richard had it right. See for example definition D82 "Unicode
> > 16-bit string" in the standard. (Section 3.9 Unicode Encoding Forms,
> > http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf)

> I was right, D82 refers to "UTF-16", which implies  the restriction of
> validity, i.e. NO isolated/unpaired surrogates,(but no exclusion of
> non-characters).

No, D82 merely requires that each 16-bit value be a valid UTF-16 code
unit.  Unicode strings, and Unicode 16-bit strings in particular, need
not be well-formed.  For x = 8, 16, 32, a 'UTF-x string', equivalently a
'valid UTF-x string', is one that is well-formed in UTF-x. 

> I was right, You and Richard were wrong.

I stand by my explanation.  I wrote it with TUS open at the definitions
by my side.

Richard.

From daniel.buenzli at erratique.ch  Sat May  9 06:09:11 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Sat, 9 May 2015 13:09:11 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150509044213.28b48ac8@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509044213.28b48ac8@JRWUBU2>
Message-ID: <7BB6B573C8F448B9BF024AC65B86AC18@erratique.ch>


Le samedi, 9 mai 2015 ? 05:42, Richard Wordingham a ?crit :

> > (However noncharacters are not designed to be openly interchanged see
> > "Restricted interchange" on p. 31. of 7.0.0)
>  
> That is irrelevant, for JSON is not restricted to open interchange.

Irrelevant to what ? I never said such a thing. Of course you can have non characters in JSON strings. I was just mentioning that it is not *advised* by the standard to interchange non characters. In practice you can always have them.  

Best,

Daniel


From daniel.buenzli at erratique.ch  Sat May  9 07:16:28 2015
From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=)
Date: Sat, 9 May 2015 14:16:28 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond
 to a Unicode character?
In-Reply-To: <CAGa7JC0QMqn0T=aN79=BFyVOUv9cgXLVf3m-OXe+DxknVm5BfQ@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
 <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
 <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>
 <CAGa7JC0QMqn0T=aN79=BFyVOUv9cgXLVf3m-OXe+DxknVm5BfQ@mail.gmail.com>
Message-ID: <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch>

Le samedi, 9 mai 2015 ? 06:24, Philippe Verdy a ?crit :
> You are not stuck! You can still regenerate a valid JSON output encoded in UTF-8: it will once again use escape sequences (which are also needed if your text contains quotation marks used to delimit the JSON strings in its syntax.

That's a possible resolution, but a very bad one: I can then no longer in my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD". This exactly leads to the interoperability problems mentioned in section 8.2 of RFC 7159.

You say passing escapes to the programmer is needed if your text contains quotation marks, this is nonsense. A good and sane JSON codec will never let the programmer deal with escapes directly, it is its responsability to allow the programmer to only deal with the JSON *data* not the details of the encoding of the data. As such it will automatically unescape on decoding to give you the data represented by the encoding and automatically escape (if needed) the data you give it on encoding.

> Unlike UTF-8, JSON has never been designed to restrict its strings to have its represented values to be only plain-text, it is a only a serialization of "strings" to valid plain-text using a custom syntax.
You say a lot of things about what JSON is supposed to be/has been designed for. It would be nice to substantiate your claims by pointing at relevant standards. If JSON as in RFC 4627 really wanted to transmit sequences of bytes I think it would have been *much more* explicit.  

The introduction of both RFC 4627 (remember, written by the *inventor* of JSON) and RFC 7159 (that obsoletes 4627) say "A string is a sequence of zero or more Unicode characters" as we already mentioned an both agree on this is very imprecise. There are two interpretations:

* This is a sequence of Unicode scalar values, i.e. text (mine)
* This is a sequence of Unicode code points, i.e. a JavaScript string (yours)

Now given this imprecision the fact is that you cannot ignore that some stupid people that are very wrong like me will take the first interpretation. Since this interpretation is less liberal you will have to cope with it and acknowledge the fact that lone escaped surrogates may not be interpreted correctly in the wild.  

This leads to the clarification and the interoperability warnings of section 8.2 in RFC 7159. If you read carefully these two paragraphs you may infer that their "Unicode character" is more likely to be "Unicode scalar value". These paragraphs were not present in RFC 4267 so the latter was really ambiguous, I would however say RFC 7159 is not, if you don't agree with that we are still left with the above two possible interpretations and if you care about interoperability you should know which interpretation to take.

Best,

Daniel


From verdy_p at wanadoo.fr  Sat May  9 07:51:18 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 14:51:18 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
 <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
 <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>
 <CAGa7JC0QMqn0T=aN79=BFyVOUv9cgXLVf3m-OXe+DxknVm5BfQ@mail.gmail.com>
 <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch>
Message-ID: <CAGa7JC0zZwtU0qrtMWnatAgbxnQx+UYiY87GJbr_yeskoB-8vw@mail.gmail.com>

2015-05-09 14:16 GMT+02:00 Daniel B?nzli <daniel.buenzli at erratique.ch>:

> Le samedi, 9 mai 2015 ? 06:24, Philippe Verdy a ?crit :
> > You are not stuck! You can still regenerate a valid JSON output encoded
> in UTF-8: it will once again use escape sequences (which are also needed if
> your text contains quotation marks used to delimit the JSON strings in its
> syntax.
>
> That's a possible resolution, but a very bad one: I can then no longer in
> my program distinguish between the JSON strings "\uDEAD" and "\\uDEAD".
> This exactly leads to the interoperability problems mentioned in section
> 8.2 of RFC 7159.
>
> You say passing escapes to the programmer is needed if your text contains
> quotation marks, this is nonsense. A good and sane JSON codec will never
> let the programmer deal with escapes directly, it is its responsability to
> allow the programmer to only deal with the JSON *data* not the details of
> the encoding of the data.


Yes, this is part of the codec, the data itself is not modified and does
not have to handle the syntax (for quotation marks or escapes).


> As such it will automatically unescape on decoding to give you the data
> represented by the encoding and automatically escape (if needed) the data
> you give it on encoding.
>
> > Unlike UTF-8, JSON has never been designed to restrict its strings to
> have its represented values to be only plain-text, it is a only a
> serialization of "strings" to valid plain-text using a custom syntax.
> You say a lot of things about what JSON is supposed to be/has been
> designed for. It would be nice to substantiate your claims by pointing at
> relevant standards. If JSON as in RFC 4627 really wanted to transmit
> sequences of bytes I think it would have been *much more* explicit.
>

No instead it speaks (incorrectly) about code points and mixes the concept
with code units.

Code units are just code units nothing else, they are not "characters", and
certainly not in the meaning of "Unicode abstract characters" and not even
"code points" or "scalar values" (and I did not speak about sequences of
"bytes", which is the result of the UTF-8 encoding if this is the charset
used for the transport of the plain-text JSON syntax)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/fcfe3148/attachment.html>

From costello at mitre.org  Sat May  9 08:04:10 2015
From: costello at mitre.org (Costello, Roger L.)
Date: Sat, 9 May 2015 13:04:10 +0000
Subject: Surrogates and noncharacters (was: Re: Ways to detect that
 XXXX...)
In-Reply-To: <CAGa7JC1vf5U-B-f4sVJhLxFxukY_+CCuqMNen4s_b6kvaHWZcA@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC1vf5U-B-f4sVJhLxFxukY_+CCuqMNen4s_b6kvaHWZcA@mail.gmail.com>
Message-ID: <DM2PR09MB0351F7472A9185761E1819A6C8DD0@DM2PR09MB0351.namprd09.prod.outlook.com>

Hi Folks,

      Just want you to know, this discussion is EXCELLENT. I am learning a lot.

      Thank you!

/Roger
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/ca819c1d/attachment.html>

From verdy_p at wanadoo.fr  Sat May  9 08:07:12 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 15:07:12 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <CAGa7JC0zZwtU0qrtMWnatAgbxnQx+UYiY87GJbr_yeskoB-8vw@mail.gmail.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <E7AB6BE3853A49658D413BD6D05A8A8A@erratique.ch>
 <CAN49p6p+JKfysGyUHzNKXDT57Gu5TVYj8kRBnfRN9h7oQfLrvw@mail.gmail.com>
 <BFE6990842064BFDA2228516717491D5@erratique.ch>
 <CAGa7JC0E5BUOvTinjuS_a6wwTJHWJa4JRG0nqymh1+=SjN3noA@mail.gmail.com>
 <406345450D52417C9DEE234A6C0662A2@erratique.ch>
 <CAGa7JC0j1EF1+B6mkxjHyuuJEFvcFMDvHyFP4GHnAmc9qh1Bvw@mail.gmail.com>
 <A69D3A601C8242D4A522F1A595460400@erratique.ch>
 <CAGa7JC2ePxLMC2xBQ1sEJsb8YFoWhQt4KCSendLKTpDsqmyuXw@mail.gmail.com>
 <821476CFD30C4A6C95CA6319394C723C@erratique.ch>
 <CAGa7JC0OEL9Wam=VJDCzfYfhfnUGckSPgWOjM-ckk6epmXUH4A@mail.gmail.com>
 <C1EC7FCFD589425C9E0E55683CC6929A@erratique.ch>
 <CAGa7JC0QMqn0T=aN79=BFyVOUv9cgXLVf3m-OXe+DxknVm5BfQ@mail.gmail.com>
 <9B5FDBFA5A3B4C32B9C1A0070A36C663@erratique.ch>
 <CAGa7JC0zZwtU0qrtMWnatAgbxnQx+UYiY87GJbr_yeskoB-8vw@mail.gmail.com>
Message-ID: <CAGa7JC0Pzq746HfF8cQ4X4qU2atfXRQ-pyCPQ_9_NhT+MuzGKw@mail.gmail.com>

2015-05-09 14:51 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> You say a lot of things about what JSON is supposed to be/has been
>> designed for. It would be nice to substantiate your claims by pointing at
>> relevant standards. If JSON as in RFC 4627 really wanted to transmit
>> sequences of bytes I think it would have been *much more* explicit.
>>
>
> No instead it speaks (incorrectly) about code points and mixes the concept
> with code units.
>

In fact it mixes/confuses three separate concepts, i.e. three layers
distinct (that the Unicode standard distinguishes clearly):
-1.  the internal dataset (values of "strings" as expected by programmers
and transmitted via the CODEC of the JSON parser/encoder), using code units
in a fixed size (16-bit)
-2.  the plain-text syntax of JSON (which is independant of the actual
character encoding but can be formalized as a stream of Unicode code points
-3.  the serialization of this plain-text in a stream of bytes (using some
UTF encoding scheme, or other legacy 8-bit charsets).

The initial implementation of JSON, in Javascript, still used today, just
performs the adaptation of the internal dataset (16-bit streams) to
plain-text (layers 1. and 2. above).

Then Javascript itself specifies no seialization of its source: this is
part of the MIME standard for the transport (using MIME "charset" attribute
to the media type) when using protocols like HTTP or HTTPS, or some
external metadata, or a static definition which is system-dependant (for
example in local file systems if they do not store the metadata as a file
attribute, a case for which the "BOM" or similar signatures was created or
for which there is specific syntax in some languages like XML or HTML for
specifying the charset at the beginning of the file, or by using some
"charset guesser").

Here also Javascript programmers do not have to worry about the layers 2.
and 3. above, they just have to handle 16-bit streams (same remark in PHP,
Java or many programming languages): they work at the layer 1 where there's
a single encoding, a single size of code unit for everything, and no
restriction of values on code units. Same thing when working with the DOM
API in XML, HTTP, XVG...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/54f99898/attachment.html>

From verdy_p at wanadoo.fr  Sat May  9 08:11:51 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 15:11:51 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150509105957.66267e13@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
Message-ID: <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>

2015-05-09 11:59 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> No, D82 merely requires that each 16-bit value be a valid UTF-16 code
> unit.  Unicode strings, and Unicode 16-bit strings in particular, need
> not be well-formed.  For x = 8, 16, 32, a 'UTF-x string', equivalently a
> 'valid UTF-x string', is one that is well-formed in UTF-x.
>
> > I was right, You and Richard were wrong.
>
> I stand by my explanation.  I wrote it with TUS open at the definitions
> by my side.
>

Except that you are explaining something else. You are speaking about
"Unicode strings" which are bound to a given UTF, I was speaking ONLY about
"16-bit strings" which were NOT bound to Unicode (and did not have to). So
TUS is compeltely not relevant here I have NOT written "Unicode 16-bit
strings", only "16-bit strings" and I clearly opposed the two DISTINCT
concepts in the SAME sentence so that no confusion was possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/c9c4fd01/attachment.html>

From richard.wordingham at ntlworld.com  Sat May  9 09:26:34 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 9 May 2015 15:26:34 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
Message-ID: <20150509152634.47f815f0@JRWUBU2>

On Sat, 9 May 2015 15:11:51 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Except that you are explaining something else. You are speaking about
> "Unicode strings" which are bound to a given UTF, I was speaking ONLY
> about "16-bit strings" which were NOT bound to Unicode (and did not
> have to). So TUS is compeltely not relevant here I have NOT written
> "Unicode 16-bit strings", only "16-bit strings" and I clearly opposed
> the two DISTINCT concepts in the SAME sentence so that no confusion
> was possible.

The long sentence of yours I am responding to is:

"And here you're wrong: a 16-bit string is just a sequence of arbitrary
16-bit code units, but an Unicode string (whatever the size of its code
units) adds restrictions for validity (the only restriction being in
fact that surrogates (when present in 16-bit strings, i.e. UTF-16) must
be paired, and in 32-bit (UTF-32) and 8-bit (UTF-8) strings, surrogates
are forbidden."

The point I made is that every string of 16-bit values is (valid
as) a Unicode string.  Do you accept that?  If not, please exhibit a
counter-example.

In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
are Unicode strings, but that only two, namely <D800, DCC1, 0054> and
<0054, D800, DCC1>, are UTF-16 strings.

Richard.

From verdy_p at wanadoo.fr  Sat May  9 09:54:30 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 16:54:30 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150509152634.47f815f0@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
Message-ID: <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>

2015-05-09 16:26 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> are Unicode strings, but that only two, namely <D800, DCC1, 0054> and
> <0054, D800, DCC1>, are UTF-16 strings.
>

Again you use "Unicode strings" for your 6 permutations, but in your
example they have nothing that make them "Unicode strings", given you allow
arbitrary code units in arbitrary order, including unpaired ones. The 6
permutations are just "16-bit strings" (addding "Unicode" for these 6
permutations gives absolutely no value if you keep your definition, but
visibly it cannot fit with the term used in the RFC trying to normalize
JSON, with similar confusions !).

TUS does not define what is a "Unicode string" like you do here.
TUS just defines "Unicode 16-bit strings" with a direct reference to UTF-16
(which implies conformance and only accepts the later two strings, that TUS
names "Unicode 16-bit strings", not "UTF-16 strings"...)

TUS goes further by then distinguishing its encoding schemes (taking into
account their serialization ti 8-bit streams, and also considering the byte
order, for defining the 3 supported UTF-16 encoding schemes: with or
without BOM): then an "UTF-16 string" become "UTF-16 encoded text" (or
UTF-16BE or UTF16-LE).

Note also that I used the term "stream" instead of "string" only to avoid
restricting the length (but JSON does not support encoding streams of
arbitrary lengths, all of them must have a start, an end, and a defined
bounded length (while streams don't necessarily have any defined length
property, independantly of the way we measure length: either in bytes, code
units, code points, combining sequences, grapheme clusters...).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/fa2f22a6/attachment.html>

From richard.wordingham at ntlworld.com  Sat May  9 10:51:21 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 9 May 2015 16:51:21 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
Message-ID: <20150509165121.502d9906@JRWUBU2>

On Sat, 9 May 2015 16:54:30 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > and <0054, D800, DCC1>, are UTF-16 strings.
> >
> 
> Again you use "Unicode strings" for your 6 permutations, but in your
> example they have nothing that make them "Unicode strings", given you
> allow arbitrary code units in arbitrary order, including unpaired
> ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> for these 6 permutations gives absolutely no value if you keep your
> definition, but visibly it cannot fit with the term used in the RFC
> trying to normalize JSON, with similar confusions !).

> TUS does not define what is a "Unicode string" like you do here.

D80    _Unicode string:_  A code unit sequence containing code units of
a particular Unicode encoding form

RW: Note that by this definition, a permutation of a Unicode string is
a Unicode string.

D82    _Unicode 16-bit string:_  A Unicode string containing only UTF-16
code units.

D85    _Well-formed:_  A Unicode code unit sequence that purports to be
in a Unicode encoding form is called well-formed if and only if it
_does_ follow the specification of that Unicode encoding form

D89    _In a Unicode encoding form:_ A Unicode string is said to be in
a particular Unicode encoding form if and only if it consists of a
well-formed Unicode code unit sequence of that Unicode encoding form.
?   A Unicode string consisting of a well-formed UTF-8 code unit
sequence is said to be _in UTF-8_. Such a Unicode string is referred to
as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
?   A Unicode string consisting of a well-formed UTF-16 code unit
sequence is said to be _in UTF-16_. Such a Unicode string is referred to
as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
?   A Unicode string consisting of a well-formed UTF-32 code unit
sequence is said to be _in UTF-32_. Such a Unicode string is referred to
as a _valid UTF-32 string_, or a _UTF-32 string_ for short.

> TUS just defines "Unicode 16-bit strings" with a direct reference to
> UTF-16 (which implies conformance and only accepts the later two
> strings, that TUS names "Unicode 16-bit strings", not "UTF-16
> strings"...)

Look at D82 again.  It refers to UTF-16 code units and does not
otherwise reference UTF-16.

If you still do not believe me, consider D89.  Can you think of an
example of a Unicode string consisting of UTF-8 code units, UTF-16
code units or UTF-32 code units that is not a UTF-8 string, not a
UTF-16 and is not a UTF-32 string?  If you can't, the use of
"well-formed" is curiously redundant in D89.

Richard.


From unicode at lindenbergsoftware.com  Sat May  9 01:26:56 2015
From: unicode at lindenbergsoftware.com (Norbert Lindenberg)
Date: Fri, 8 May 2015 23:26:56 -0700
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
Message-ID: <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com>

RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode code points in the Basic Multilingual Plane, but also a 12-character sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800 ? YYYY < 0xDC00 ? ZZZZ ? 0xDFFF) for supplementary Unicode code points. A tool checking for escape sequences that don?t correspond to any Unicode character must be aware of this, because neither \uYYYY nor \uZZZZ by itself would correspond to any Unicode character, but their combination may well do so.

Norbert

[1] https://tools.ietf.org/html/rfc7158#section-7


> On May 7, 2015, at 5:46 , Costello, Roger L. <costello at mitre.org> wrote:
> 
> Hi Folks,
> 
> The JSON specification says that a character may be escaped using this notation: \uXXXX    (XXXX are four hex digits)
> 
> However, not every four hex digits corresponds to a Unicode character. 
> 
> Are there tools to scan a JSON document to detect the presence of \uXXXX, where XXXX does not correspond to any Unicode character?
> 
> /Roger
> 


From verdy_p at wanadoo.fr  Sat May  9 13:44:32 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 9 May 2015 20:44:32 +0200
Subject: Ways to detect that XXXX in JSON \uXXXX does not correspond to a
 Unicode character?
In-Reply-To: <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com>
References: <DM2PR09MB03518BE885BB2395660FF293C8DF0@DM2PR09MB0351.namprd09.prod.outlook.com>
 <42E0DD15-9F43-4014-9720-45BD5210FD12@lindenbergsoftware.com>
Message-ID: <CAGa7JC3Mg0oD-v2-nZ0A+JL6n20NZ-+pjthoH7dpibOh0t+EOg@mail.gmail.com>

If is not necessary, in fact that same section is also repeating that any
"code point" from U+0000 to U+FFFF is representable with the escape
sequence, without restriction !
This just confirms that JSON does not really encode Unicode strings but
just streams of arbitrary 16-bit code-units (and then possibly reencoded
into an internal encoding scheme used by JSON parsers, that internal
encoding being bound to the programming environment and its internal binary
API or exposed variables or properties).

The fact that it is also bond to the plain-text encoding is just because
the plain-text characters used in its syntax that not encoded with those
escape sequences, and that are not assigned a special role for delimiting
string literals, will be decoded from the input syntax and then reencoded
into their equivalent in the internal encoding (in the parser, or exposed
by the parser in its returned variables or properties):
- if the transport format is UTF-8, the syntaxic file will be read using an
UTF-8 scanner returning code points or small strings containing the full
sequence representing a single code point (over MIME-compatible transports
this uses the charset settings of this transport). These codepoints are
then converted to one or two 16-bit code units. Then the JSON syntax is
recognized by its parser, which will recognize string delimiters, and then
also the escape sequences which will be parsed and also converted to 16-bit
code units.
Then this internal stream of 16-bit code units will be exposed to the
output using the encoding expected by the JSON client or programming
environement.

In summary, the refernece to Unicode in the RFCs for JSON is not really
necesssary, all it needs to say is that the JSON parsers must be able to
accept a file containing any plain-text valid in its transport encoding
scheme, and that it will be able to decode from it the stream of 16bit code
units and generate a valid output in the encoding expected by the client
(when the client is Javascript or Java, the internal encoding will be the
same as the exposed encoding ; this won't be true in Lua, or PHP or many
C/C++ programs that often prefer using 8-bit strings; Some languages are
hybrids and support two kinds of strings: 8-bit strings and 16-bit strings,
rarely 32-bit strings)

2015-05-09 8:26 GMT+02:00 Norbert Lindenberg <unicode at lindenbergsoftware.com
>:

> RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode
> code points in the Basic Multilingual Plane, but also a 12-character
> sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800
> ? YYYY < 0xDC00 ? ZZZZ ? 0xDFFF) for supplementary Unicode code points. A
> tool checking for escape sequences that don?t correspond to any Unicode
> character must be aware of this, because neither \uYYYY nor \uZZZZ by
> itself would correspond to any Unicode character, but their combination may
> well do so.
>
> Norbert
>
> [1] https://tools.ietf.org/html/rfc7158#section-7
>
>
> > On May 7, 2015, at 5:46 , Costello, Roger L. <costello at mitre.org> wrote:
> >
> > Hi Folks,
> >
> > The JSON specification says that a character may be escaped using this
> notation: \uXXXX    (XXXX are four hex digits)
> >
> > However, not every four hex digits corresponds to a Unicode character.
> >
> > Are there tools to scan a JSON document to detect the presence of
> \uXXXX, where XXXX does not correspond to any Unicode character?
> >
> > /Roger
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150509/68b56f51/attachment.html>

From verdy_p at wanadoo.fr  Sun May 10 00:42:14 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 10 May 2015 07:42:14 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150509165121.502d9906@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
 <20150509165121.502d9906@JRWUBU2>
Message-ID: <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>

OK, but D80 and D82 have no purpose, except adding the term "Unicode"
redundantly to these expressions.
- D80 defines "Unicode string" but in fact it just defines a generic
"string" as an arbitrary stream of fixed-size code units. This is the basic
definition applicable to all languages I've seen (even if they add
additional properties or methods in OOP). It is the same as a C/C++ string
(if we ignore the additonal convention of using null as a terminator,
soething that is not required in the language, but only a convention of its
oldest standard libraries; newer libraries encode length separately)
- D82 defines "Unicode 16-bit string" but in fact it just defines a generic
"16-bit string" as an arbitrary stream of 16-bit code units. This is
basically the same as Javascript and Java strings (where they are objects
not requiring the null-byte termination but storing the length as an
internal property).

These two rules are not productive at all, except for saying that all
values of fixed size code units are acceptable (including for example 0xFF
in 8-bit strings, which is invalid in UTF-8)

Curiously D80 and D82 just restrict themselves to bounded strings (with a
defined length), instead of streams (with undetermined length, no start
index, no absolute position, no terminator, but just a special distinct
value returned for EOF or a method to query the current termination state
of the stream, which may be time-dependant).

However I wonder what would be the effect of D80 in UTF-32: is <0xFFFFFFFF>
a valid "32-bit string" ? After all it is also containing a single 32-bit
code unit (for at least one Unicode  encoding form), even if it has no
"scalar value" and then does not have to validate D89 (for UTF-32)...

If there are confusions in other documents, it's now probably because of
the completely unproductive D80 and D82 definitions of specific terms
(which are probably not definitions of terms but just fixing the needed
local context in order to define D89). the two rules D80 and D82 have
absolutely no use in TUS outside D89. So D80 and D82 are probaly excessive
definitions, D89 would be enough (TUS shoukd not have to dictate other
lower-level behavior to programming environments or protocols)

2015-05-09 17:51 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 9 May 2015 16:54:30 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-09 16:26 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > In particular, I claim that all 6 permutations of <D800, 0054, DCC1>
> > > are Unicode strings, but that only two, namely <D800, DCC1, 0054>
> > > and <0054, D800, DCC1>, are UTF-16 strings.
> > >
> >
> > Again you use "Unicode strings" for your 6 permutations, but in your
> > example they have nothing that make them "Unicode strings", given you
> > allow arbitrary code units in arbitrary order, including unpaired
> > ones. The 6 permutations are just "16-bit strings" (addding "Unicode"
> > for these 6 permutations gives absolutely no value if you keep your
> > definition, but visibly it cannot fit with the term used in the RFC
> > trying to normalize JSON, with similar confusions !).
>
> > TUS does not define what is a "Unicode string" like you do here.
>
> D80    _Unicode string:_  A code unit sequence containing code units of
> a particular Unicode encoding form
>
> RW: Note that by this definition, a permutation of a Unicode string is
> a Unicode string.
>
> D82    _Unicode 16-bit string:_  A Unicode string containing only UTF-16
> code units.
>
> D85    _Well-formed:_  A Unicode code unit sequence that purports to be
> in a Unicode encoding form is called well-formed if and only if it
> _does_ follow the specification of that Unicode encoding form
>
> D89    _In a Unicode encoding form:_ A Unicode string is said to be in
> a particular Unicode encoding form if and only if it consists of a
> well-formed Unicode code unit sequence of that Unicode encoding form.
> ?   A Unicode string consisting of a well-formed UTF-8 code unit
> sequence is said to be _in UTF-8_. Such a Unicode string is referred to
> as a _valid UTF-8 string_, or a _UTF-8 string_ for short.
> ?   A Unicode string consisting of a well-formed UTF-16 code unit
> sequence is said to be _in UTF-16_. Such a Unicode string is referred to
> as a _valid UTF-16 string_, or a _UTF-16 string_ for short.
> ?   A Unicode string consisting of a well-formed UTF-32 code unit
> sequence is said to be _in UTF-32_. Such a Unicode string is referred to
> as a _valid UTF-32 string_, or a _UTF-32 string_ for short.
>
> > TUS just defines "Unicode 16-bit strings" with a direct reference to
> > UTF-16 (which implies conformance and only accepts the later two
> > strings, that TUS names "Unicode 16-bit strings", not "UTF-16
> > strings"...)
>
> Look at D82 again.  It refers to UTF-16 code units and does not
> otherwise reference UTF-16.
>
> If you still do not believe me, consider D89.  Can you think of an
> example of a Unicode string consisting of UTF-8 code units, UTF-16
> code units or UTF-32 code units that is not a UTF-8 string, not a
> UTF-16 and is not a UTF-32 string?  If you can't, the use of
> "well-formed" is curiously redundant in D89.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150510/bc6b15d7/attachment.html>

From richard.wordingham at ntlworld.com  Sun May 10 05:23:41 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 10 May 2015 11:23:41 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
 <20150509165121.502d9906@JRWUBU2>
 <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>
Message-ID: <20150510112341.4ea1ea4e@JRWUBU2>

On Sun, 10 May 2015 07:42:14 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

I as replying out of order for greater coherence of my reply.

> However I wonder what would be the effect of D80 in UTF-32: is
> <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> containing a single 32-bit code unit (for at least one Unicode
> encoding form), even if it has no "scalar value" and then does not
> have to validate D89 (for UTF-32)...

The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
cannot represent a unit of encoded text in a UTF-32 string.  By D77
paragraph 1, "Code unit:  The minimal bit combination that can
represent a unit of encoded text for processing or interchange", it is
therefore not a code unit.  The effect of D77, D80 and D83 is that
<0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.

> - D80 defines "Unicode string" but in fact it just defines a generic
> "string" as an arbitrary stream of fixed-size code units.

No - see argument above.

> These two rules [D80 and D82 - RW] are not productive at all, except
> for saying that all values of fixed size code units are acceptable
> (including for example 0xFF in 8-bit strings, which is invalid in
> UTF-8)

Do you still maintain this reading of D77?  D77 is not as clear as it
should be.

> <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> redundantly to these expressions.

I have the cynical suspicion that these definitions were added to
preserve the interface definitions of routines processing UCS-2
strings when the transition to UTF-16 occurred.  They can also have the
(intentional?) side-effect of making more work for UTF-8 and UTF-32
processing, because arbitrary 8-bit strings and 32-bit strings are not
Unicode strings.

Richard.

From haberg-1 at telia.com  Sun May 10 13:35:41 2015
From: haberg-1 at telia.com (Hans Aberg)
Date: Sun, 10 May 2015 20:35:41 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150510112341.4ea1ea4e@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
 <20150509165121.502d9906@JRWUBU2>
 <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>
 <20150510112341.4ea1ea4e@JRWUBU2>
Message-ID: <881564BF-0C35-4947-8CAD-04CFAEB0AC6C@telia.com>


> On 10 May 2015, at 12:23, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

>> However I wonder what would be the effect of D80 in UTF-32: is
>> <0xFFFFFFFF> a valid "32-bit string" ?
> 
> The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
> cannot represent a unit of encoded text in a UTF-32 string.

Even though the values with highest bit set are not a part of original UTF-32, it can easily be extended also to original UTF-8, which may be simpler to implement.


From verdy_p at wanadoo.fr  Sun May 10 14:19:52 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 10 May 2015 21:19:52 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150510112341.4ea1ea4e@JRWUBU2>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
 <20150509165121.502d9906@JRWUBU2>
 <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>
 <20150510112341.4ea1ea4e@JRWUBU2>
Message-ID: <CAGa7JC1Nt2ce09mwjdJmXbOi-NipW9Wq-n01Kw4SnsxKehziew@mail.gmail.com>

The wy I read D77 (code unit) it is not bound to any Unicode encoding form;
"The minimal bit combination that can represent a unit of encoded text for
processing or interchange" can beany bit length and can even use non binary
repreentation (not bit-based; it could be ternary; or floatting point, or
base ten with the remaining bit patterns posibly used for other functions
(such as clock synchronization!calibration, polarization balancing; lieving
only some patterns distinctable but not necessarily an exact power of
two...)
I don't see why a 32-bit code unit or 8-bit code unit has to be bound to
UTF-32 or UTF-8 in D77; the code unit is just a code unit; it does not have
to be assigned any Unicode scalar value or exist in a specific pattern
valid for UTF-32 or UTF-8 (in addition these two UTF's are not the only two
ones supported; look as SCSU for example; or GB18030 which are also
conforming UTF's):
The code unit is just one element within an enumerable and finite set of
elements that is transmissible to some interface and interchangeable.

It's up to each UTF to define how they can use them: these UTF's are usable
on these stes provided that these sets are large nuitto contain at least a
the number of code units required for this UTF to be supported (which means
that the actual bitcount of the transported code units does not matter;
this is out of scope of TUS which jsut requires sets with sufficient
cardinality):

For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF
cannot be a valid 32-bit code unit and then why <0xFFFFFFFF> cant be a
valid 32-bit string (or "Unicode 32-bit string> liek TUS renames it in
D80-D83 in a way that is really unproductive (and in fact confusive).

As well nothing prohibits supportng the UTF-32 encoding form over a 21-bit
stream, using another "encding scheme" (which cannt be named also UTF-32 or
UT-32BE or UTF-32LE" but could be named 'UTF-32-21": the result witll be a
21-bit strng; but still the 21(bit code unit 0x1FFFFF will still be valid.

2015-05-10 12:23 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sun, 10 May 2015 07:42:14 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> I as replying out of order for greater coherence of my reply.
>
> > However I wonder what would be the effect of D80 in UTF-32: is
> > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> > containing a single 32-bit code unit (for at least one Unicode
> > encoding form), even if it has no "scalar value" and then does not
> > have to validate D89 (for UTF-32)...
>
> The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
> cannot represent a unit of encoded text in a UTF-32 string.  By D77
> paragraph 1, "Code unit:  The minimal bit combination that can
> represent a unit of encoded text for processing or interchange", it is
> therefore not a code unit.  The effect of D77, D80 and D83 is that
> <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.
>
> > - D80 defines "Unicode string" but in fact it just defines a generic
> > "string" as an arbitrary stream of fixed-size code units.
>
> No - see argument above.
>
> > These two rules [D80 and D82 - RW] are not productive at all, except
> > for saying that all values of fixed size code units are acceptable
> > (including for example 0xFF in 8-bit strings, which is invalid in
> > UTF-8)
>
> Do you still maintain this reading of D77?  D77 is not as clear as it
> should be.
>
> > <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> > redundantly to these expressions.
>
> I have the cynical suspicion that these definitions were added to
> preserve the interface definitions of routines processing UCS-2
> strings when the transition to UTF-16 occurred.  They can also have the
> (intentional?) side-effect of making more work for UTF-8 and UTF-32
> processing, because arbitrary 8-bit strings and 32-bit strings are not
> Unicode strings.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150510/0285a1f4/attachment.html>

From richard.wordingham at ntlworld.com  Sun May 10 15:44:29 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 10 May 2015 21:44:29 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC1Nt2ce09mwjdJmXbOi-NipW9Wq-n01Kw4SnsxKehziew@mail.gmail.com>
References: <20150508153756.665a7a7059d7ee80bb4d670165c8327d.7ad5c8ff2c.wbe@email03.secureserver.net>
 <1C0DAB1968D04D5786DB0E31EC46611B@erratique.ch>
 <20150509041352.60c24989@JRWUBU2>
 <CAGa7JC23n0TomfuhtScUoAnDhM2r5jaTP6Y7dDfExyZJEM044Q@mail.gmail.com>
 <CAN49p6oZWaSZhvEEuEOzDLWpbo275vJtXCiAvgzyhBN+w7rHag@mail.gmail.com>
 <CAGa7JC0OWtnMQu5Cs=Bp+-KX38-FT=HcOivg4+BKNJ7hbMga6A@mail.gmail.com>
 <20150509105957.66267e13@JRWUBU2>
 <CAGa7JC3DfsNL7=GARuvmgrw4xr4Sm-xHp9H2drH_EU3r6FVfGA@mail.gmail.com>
 <20150509152634.47f815f0@JRWUBU2>
 <CAGa7JC2z39SF_tFyoHWXq_5mEYfi7C3U2bQsSOwmACHzAJ3iGA@mail.gmail.com>
 <20150509165121.502d9906@JRWUBU2>
 <CAGa7JC33Ln2Tx+tBd2tzJZ_vFXm3cqBogR06Kup1_ECQiSjoRA@mail.gmail.com>
 <20150510112341.4ea1ea4e@JRWUBU2>
 <CAGa7JC1Nt2ce09mwjdJmXbOi-NipW9Wq-n01Kw4SnsxKehziew@mail.gmail.com>
Message-ID: <20150510214429.5d1ad31f@JRWUBU2>

On Sun, 10 May 2015 21:19:52 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> The wy I read D77 (code unit) it is not bound to any Unicode encoding
> form;

Agreed.

> "The minimal bit combination that can represent a unit of
> encoded text for processing or interchange" can beany bit length and
> can even use non binary repreentation (not bit-based; it could be
> ternary; or floatting point, or base ten with the remaining bit
> patterns posibly used for other functions (such as clock
> synchronization!calibration, polarization balancing; lieving only
> some patterns distinctable but not necessarily an exact power of
> two...)

I don't object to that reading, but I'm not sure it's correct.

> I don't see why a 32-bit code unit or 8-bit code unit has to
> be bound to UTF-32 or UTF-8 in D77; the code unit is just a code
> unit; it does not have to be assigned any Unicode scalar value or
> exist in a specific pattern valid for UTF-32 or UTF-8 (in addition
> these two UTF's are not the only two ones supported; look as SCSU for
> example; or GB18030 which are also conforming UTF's):

D77 is definitely not bound to Unicode encoding forms - it gives
Shift-JIS as an example of an encoding that has code units.

> The code unit is just one element within an enumerable and finite set
> of elements that is transmissible to some interface and
> interchangeable.
> 
> It's up to each UTF to define how they can use them: these UTF's are
> usable on these stes provided that these sets are large nuitto
> contain at least a the number of code units required for this UTF to
> be supported (which means that the actual bitcount of the transported
> code units does not matter; this is out of scope of TUS which jsut
> requires sets with sufficient cardinality):

The critical matter is the number of array elements needed for each
scalar value and the pattern of which elements of the scalar values
have the 'same' values.

> For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF
> cannot be a valid 32-bit code unit

Fair point so far.  I agree it can be a 32-bit code unit in some
character encoding.  However, it is not a UTF-32 code unit.

> and then why <0xFFFFFFFF> cant be a
> valid 32-bit string

I agree that it is a 32-bit string.  I don't know what you mean by the
word 'valid' in this context.

> (or "Unicode 32-bit string> liek TUS
renames it in
> D80-D83 in a way that is really unproductive (and in fact confusive).

I hope you now see that it cannot be Unicode 32-bit string, for
0xFFFFFFFF is not a UTF-32 code unit.  This is a key point in the
difference between:

a) x-bit string,
b) Unicode x-bit string, and
c) UTF-x string

For x=8, these are three different things.  For x=16 or x=32, these are
two different things, but they do not split the same way.

D80-D83 do not directly rename 8-bit strings, 16-bit strings or 32-bit
strings as Unicode 8-bit strings, Unicode 16-bit strings or Unicode
32-bit strings.  That all 16-bit strings are Unicode 16-bit strings
is a consequence of the definition of UTF-16. Similarly, not all 8-bit
strings being Unicode 8-bit strings and not all 32-bit strings are
consequences of the definitions of UTF-8 and UTF-32 respectively.

I agree that the concept of Unicode 8-bit strings is not useful.  The
separate concept of Unicode 32-bit strings is also not useful, for I
contend that all Unicode 32-bit strings are in fact UTF-32 strings.
The latter result is an immediate consequence of UTF-32 not being a
multi-code unit encoding.

> As well nothing prohibits supportng the UTF-32 encoding form over a
> 21-bit stream, using another "encding scheme" (which cannt be named
> also UTF-32 or UT-32BE or UTF-32LE" but could be named 'UTF-32-21":
> the result witll be a 21-bit strng; but still the 21(bit code unit
> 0x1FFFFF will still be valid.
> 
> 2015-05-10 12:23 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Sun, 10 May 2015 07:42:14 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > I as replying out of order for greater coherence of my reply.
> >
> > > However I wonder what would be the effect of D80 in UTF-32: is
> > > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> > > containing a single 32-bit code unit (for at least one Unicode
> > > encoding form), even if it has no "scalar value" and then does not
> > > have to validate D89 (for UTF-32)...
> >
> > The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
> > cannot represent a unit of encoded text in a UTF-32 string.  By D77
> > paragraph 1, "Code unit:  The minimal bit combination that can
> > represent a unit of encoded text for processing or interchange", it
> > is therefore not a code unit.

Correction: "is therefore not a UTF-32 code unit."

> >  The effect of D77, D80 and D83 is
> > that <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit
> > string.
> >
> > > - D80 defines "Unicode string" but in fact it just defines a
> > > generic "string" as an arbitrary stream of fixed-size code units.
> >
> > No - see argument above.
> >
> > > These two rules [D80 and D82 - RW] are not productive at all,
> > > except for saying that all values of fixed size code units are
> > > acceptable (including for example 0xFF in 8-bit strings, which is
> > > invalid in UTF-8)

I ask again:
Do you still maintain this reading of D77?  D77 is not as clear as it
should be.

Richard.

From ishida at w3.org  Mon May 11 03:25:38 2015
From: ishida at w3.org (Richard Ishida)
Date: Mon, 11 May 2015 09:25:38 +0100
Subject: Notes on Mongolian variant forms
Message-ID: <55506782.2090001@w3.org>

fyi, i have been developing a page

Notes on Mongolian variant forms
http://r12a.github.io/scripts/mongolian/variants

the page compares variant glyph shapes proposed in three documents,
and shows what shapes fonts actually produce.

i have been documenting changes at 
http://lists.w3.org/Archives/Public/public-i18n-mongolian/ - if you want 
to discuss the page, you are free to join and contribute to that list.


introduction to the page:

======================================

There is some confusion about which shapes should be produced by fonts
for Mongolian characters. Most letters have at least one isolated,
initial, medial and final shape, but other shapes are produced by
contextual factors, such as vowel harmony.

Unicode has a list of standardised variant shapes, dating from 27
November 2013, but that list is not complete and contains what are
currently viewed by some as errors.

The original list of standardised variants was based on ????? by
Professor Quejingzhabu in 2000.

A new proposal was published on 20 January 2014, which attempts to
resolve the current issues.

The other factor in this is what the actual fonts do. Sometimes they 
follow the Unicode standardised variants list, other times they diverge 
from it. Occasionally a majority of implementations appear to diverge in 
the same way, suggesting that the standardised list should be adapted to 
reality.

In this document I map the changes between the various proposals, and
compare to various font implementations.

From petercon at microsoft.com  Mon May 11 11:45:09 2015
From: petercon at microsoft.com (Peter Constable)
Date: Mon, 11 May 2015 16:45:09 +0000
Subject: Script / font support in Windows 10
In-Reply-To: <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
References: <BLUPR03MB12055035420EB190A49FD66D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120FF52E7507CB1D2F5852CD5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BDB5CBD12A1E9DB95466D5DE0@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <BN1PR03MB122D1E52635D46D33424B34D5DB0@BN1PR03MB122.namprd03.prod.outlook.com>

When the update with Windows 10 info was posted, earlier sections for Windows 2000 / XP / XP SP2 were inadvertently deleted. Those have been restored.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Friday, May 8, 2015 7:16 AM
To: unicode at unicode.org
Subject: RE: Script / font support in Windows 10

I think this is the right public link:

https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx


From: Peter Constable
Sent: Thursday, May 7, 2015 10:29 PM
To: Peter Constable; unicode at unicode.org<mailto:unicode at unicode.org>
Subject: RE: Script / font support in Windows 10

Oops... my bad: maybe it isn't on live servers yet. It will be soon. I'll update with the public link when it is.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Thursday, May 7, 2015 10:15 PM
To: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Script / font support in Windows 10

This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10:

https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150511/cb8d27a0/attachment.html>

From doug at ewellic.org  Mon May 11 12:44:19 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 11 May 2015 10:44:19 -0700
Subject: Surrogates and noncharacters
Message-ID: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>

Hans Aberg <haberg dash 1 at telia dot com> wrote:

>>> However I wonder what would be the effect of D80 in UTF-32: is
>>> <0xFFFFFFFF> a valid "32-bit string" ?
>> 
>> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
>> cannot represent a unit of encoded text in a UTF-32 string.
>
> Even though the values with highest bit set are not a part of original
> UTF-32, it can easily be extended also to original UTF-8, which may be
> simpler to implement.

"Original UTF-8," regardless of where defined, only ever encoded scalar
values up to 0x7FFFFFFF. See, for example, RFC 2279.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From haberg-1 at telia.com  Mon May 11 13:05:23 2015
From: haberg-1 at telia.com (Hans Aberg)
Date: Mon, 11 May 2015 20:05:23 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
Message-ID: <3EE00C84-398E-4A21-B18E-A27D8CB49F21@telia.com>


> On 11 May 2015, at 19:44, Doug Ewell <doug at ewellic.org> wrote:
> 
> Hans Aberg <haberg dash 1 at telia dot com> wrote:
> 
>>>> However I wonder what would be the effect of D80 in UTF-32: is
>>>> <0xFFFFFFFF> a valid "32-bit string" ?
>>> 
>>> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
>>> cannot represent a unit of encoded text in a UTF-32 string.
>> 
>> Even though the values with highest bit set are not a part of original
>> UTF-32, it can easily be extended also to original UTF-8, which may be
>> simpler to implement.
> 
> "Original UTF-8," regardless of where defined, only ever encoded scalar
> values up to 0x7FFFFFFF. See, for example, RFC 2279.

The intended meaning is that also original UTF-8 can be extended to full 32-bit by using 6-byte sequences leading byte 111111xx bit pattern.


From verdy_p at wanadoo.fr  Mon May 11 14:25:29 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 11 May 2015 21:25:29 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>

Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit)
code unit in "32-bit strings", even if it is not a valid code point with a
valid scaar value in any legacy or standard version of UTF-32.

The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned
differences in 32-bit integers (if ever they were in fact converted to
larger integers such as 64-bit to exhibit differences in APIs returning
individual code units).

It's true that in 32-bit integers (signed or unsigned) you cannot
differenciate 0xFFFFFFF from -1 (which is generally the value chosen in
C/C++ standard libraries for representing the EOF condition returned by
functions or macros like getchar(). But EOF conditions do not require to be
differentiated when you are scanning positions in a buffer of 32-bit
integers (instead you compare the relative index in the buffer with the
buffer length, or the buffer object includes a separate method to test this
condition).

But today, where programming environment are going to 64-bit by default,
the APIs that return an integer when reading individual code positions will
return them as 64-bit integers, even if the inner storage uses 32-bit code
units: 0xFFFFFFFF will then be returned as a positive integer and not -1
used for EOF.

This was not still true when the legacy UTF-32 encoding was created, where
a majority of environments were still only running 32-bit or 16-bit code;
for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had
to be assigned to a non-character to limit problems of confusions with the
EOF condition in C/C++ or similar APIs in other languages (when they cannot
throw an exception instead of a distinct EOF value).

Well, there are stil la lot of devices running 32-bit code (notably in
guest VMs, and in small devices) and written in C/C++ with the old standard
C library, but without OOP features (such as exceptions, or methods for
buffering objects). In Java, the "int" datatype (which is 32-bit and
signed) has not been extended to 64-bit, even on platforms where 64-bit
integers are the internal datatype used by the JVM in its natively compiled
binary code.

Once again, "code units" and "x-bit strings" are not bound to any Unicode
or ISO/IEC 10646 or legacy RFC contraints related to the current standard
UTFs or legacy (obsoleted) UTF's.

And I still don't see any productive need for "Unicode x-bit strings" in
TUS D80-D83, when all that is needed for the conformance is NOT the whole
range of valid code units, but only the allowed range of scalar values
(which there's only the need for code units to be defined in a large enough
set of distinct values:

The exact cardinality of this set does not matter, and there can always
exist additional valid "code units" not bound to any valid "scalar value"
or to a minimal set of distinct "Unicode code units" needed to support the
standard Unicode encoding forms).

Even the Unicode scalar values or the implied values of "Unicode code
units" to not have to be aligned with the effective native values of "code
units" used in the lower level... except for the standard encoding schemes
for 8-bit interchanges, where byte order matters... but still not the lower
level bit order and the native hardware representation of invidually
addressable bytes which may be sometimes larger than 8-bit, with some other
control bits or framing bits, and sometimes even with variable bit sizes
depending on their relative position in transport frames !


2015-05-11 19:44 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Hans Aberg <haberg dash 1 at telia dot com> wrote:
>
> >>> However I wonder what would be the effect of D80 in UTF-32: is
> >>> <0xFFFFFFFF> a valid "32-bit string" ?
> >>
> >> The value 0xFFFFFFFF cannot appear in a UTF-32 string. Therefore it
> >> cannot represent a unit of encoded text in a UTF-32 string.
> >
> > Even though the values with highest bit set are not a part of original
> > UTF-32, it can easily be extended also to original UTF-8, which may be
> > simpler to implement.
>
> "Original UTF-8," regardless of where defined, only ever encoded scalar
> values up to 0x7FFFFFFF. See, for example, RFC 2279.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150511/e0441df3/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 11 15:43:21 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 11 May 2015 21:43:21 +0100
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
Message-ID: <20150511214321.55a94551@JRWUBU2>

On Mon, 11 May 2015 21:25:29 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Once again, "code units" and "x-bit strings" are not bound to any
> Unicode or ISO/IEC 10646 or legacy RFC contraints related to the
> current standard UTFs or legacy (obsoleted) UTF's.

Who says they are?

I'm just saying that the concepts of Unicode x-bit strings are. 

Richard.

From haberg-1 at telia.com  Mon May 11 16:53:02 2015
From: haberg-1 at telia.com (Hans Aberg)
Date: Mon, 11 May 2015 23:53:02 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
Message-ID: <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>


> On 11 May 2015, at 21:25, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> Yes, but this does not mean that 0xFFFFFFF cannot be used as a (32-bit) code unit in "32-bit strings", even if it is not a valid code point with a valid scaar value in any legacy or standard version of UTF-32.

The reason I did it was to avoid having a check to throw an exception. It merely means that the check for valid Unicode code points, in such a context, must be elsewhere.

> The limitation to 0x7FFFFFF was certainly just there to avoid sign/unsigned differences in 32-bit integers (if ever they were in fact converted to larger integers such as 64-bit to exhibit differences in APIs returning individual code units).

Indeed, so I use uint32_t combined with uint32_t, because char can be signed at the will of the C/C++ compiler implementer.

> It's true that in 32-bit integers (signed or unsigned) you cannot differenciate 0xFFFFFFF from -1 (which is generally the value chosen in C/C++ standard libraries for representing the EOF condition returned by functions or macros like getchar(). But EOF conditions do not require to be differentiated when you are scanning positions in a buffer of 32-bit integers (instead you compare the relative index in the buffer with the buffer length, or the buffer object includes a separate method to test this condition).

It is s good point - perhaps that was the reason to not allow highest bit set. But it is not a problem in C++, would it get UTF-32 streams, as they can throw an exception

> But today, where programming environment are going to 64-bit by default, the APIs that return an integer when reading individual code positions will return them as 64-bit integers, even if the inner storage uses 32-bit code units: 0xFFFFFFFF will then be returned as a positive integer and not -1 used for EOF.

Right, the C/C++ languages specifications say that size_t and friend must be able to hold any size, and similar for differences. So this forces signed and unsigned 64-bit integral types on a 64-bit platform.

> This was not still true when the legacy UTF-32 encoding was created, where a majority of environments were still only running 32-bit or 16-bit code; for the 16-bit code, the 0xFFFF code unit, for the U+FFFF code point, had to be assigned to a non-character to limit problems of confusions with the EOF condition in C/C++ or similar APIs in other languages (when they cannot throw an exception instead of a distinct EOF value).

Right, it might be a non-issue today.

> Well, there are stil la lot of devices running 32-bit code (notably in guest VMs, and in small devices) and written in C/C++ with the old standard C library, but without OOP features (such as exceptions, or methods for buffering objects). In Java, the "int" datatype (which is 32-bit and signed) has not been extended to 64-bit, even on platforms where 64-bit integers are the internal datatype used by the JVM in its natively compiled binary code.

Legacy is a problem.

> Once again, "code units" and "x-bit strings" are not bound to any Unicode or ISO/IEC 10646 or legacy RFC contraints related to the current standard UTFs or legacy (obsoleted) UTF's.
> 
> And I still don't see any productive need for "Unicode x-bit strings" in TUS D80-D83, when all that is needed for the conformance is NOT the whole range of valid code units, but only the allowed range of scalar values (which there's only the need for code units to be defined in a large enough set of distinct values:
> 
> The exact cardinality of this set does not matter, and there can always exist additional valid "code units" not bound to any valid "scalar value" or to a minimal set of distinct "Unicode code units" needed to support the standard Unicode encoding forms).
> 
> Even the Unicode scalar values or the implied values of "Unicode code units" to not have to be aligned with the effective native values of "code units" used in the lower level... except for the standard encoding schemes for 8-bit interchanges, where byte order matters... but still not the lower level bit order and the native hardware representation of invidually addressable bytes which may be sometimes larger than 8-bit, with some other control bits or framing bits, and sometimes even with variable bit sizes depending on their relative position in transport frames !

It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32. One is going check that the code points are valid Unicode values somewhere, so it is hard to see to point of restricting UTF-8 to align it with UTF-16.


From verdy_p at wanadoo.fr  Tue May 12 08:45:52 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 12 May 2015 15:45:52 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
Message-ID: <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>

2015-05-11 23:53 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:

> It is perfectly fine considering the Unicode code points as abstract
> integers, with UTF-32 and UTF-8 encodings that translate them into byte
> sequences in a computer. The code points that conflict with UTF-16 might
> have been merely declared not in use until UTF-16 has been fallen out of
> use, replaced by UTF-8 and UTF-32.


The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in
MIME) is already very advanced. But they will certinaly not likely
disappear as encoding *forms* for internal use in binary APIs and in
several very popular programming languages: Java, Javascript, even C++ on
Windows platforms (where it is the 8-bit interface, based on legacy "code
pages" and with poor support of the UTF-8 encoding scheme as a Windows
"code page", is the one that is now being phased out), C#, J#...

UTF-8 will also remain for long as the prefered internal encoding for
Python, PHP (even if Python introduced also a 16-bit native datatype).

In all cases, programming languages are not based on any Unicode encoding
forms but on more or less opaque streams of code units using datatypes that
are not constrained by Unicode (because their "character" or "byte"
datatype is also used for binary I/O and for supporting also the conversion
of various binary structures, including executable code, and also because
even this datatype is not necessarily 8-bit but may be larger and not even
an even multiple of 8-bits)

One is going check that the code points are valid Unicode values somewhere,
> so it is hard to see to point of restricting UTF-8 to align it with UTF-16.
>

What I meant when starting discussing in this thread was just to obsolete
the unnecessary definitions of "x-bit strings" from TUS. The stadnard does
not need these definitions and if we want it to be really open to various
architectures, languages, protocols, all that is needed is only the
definition of "code units" specific to each standard UTF (encoding form or
encoding scheme when splitting code units to smaller code units and
ordering them, by only determining this order and the minimum set of
distinct values that these code units must support: we should not speak
about "bits", just about "sets" of distinct elements with a sufficient
cardinality).

So let's jsut speak about "UTF-8 code units", "UTF-16 code units", "UTF-32
code units" (not just "code units" and not even "Unicode code units", which
is also a non-sense given the existence of standardized compression schemes
defining also their own "XXX code units").

If the expressions "16-bit code units" has been used, it's purely for
internal use as a shortcut for the complete name, and these shortcuts are
not part of the external entities to standardize (they are not precise
enough and cannot be used safely out of their local context): consider
these definitions just as "private" ones (same meaning as in OOP) boxed as
internals to the TUS seen as a blackbox.

It's not the focus of TUS to discuss what are "strings": it's just the
mater of each integration platform that wants to use TUS.

In summary, the definitions in TUS should be split in two parts: those that
are "public" and needed by external references (in other standards), and
those that are private (many of them do not have even to be within the
generic section of the standard, they should be listed in the appropriate
sections needing them locally, and also clearly separating the "public" and
"private" interfaces.

In all cases, the public interfaces msut define precise and anambiguous
terms, bound to the standard or section of the standard defining them. Even
if later within that section a shortcut will be used as a convenience (to
make the text easier to read). We need "scopes" for these definitions (and
shorter aliases must be made private).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/7966809b/attachment.html>

From haberg-1 at telia.com  Tue May 12 08:56:04 2015
From: haberg-1 at telia.com (Hans Aberg)
Date: Tue, 12 May 2015 15:56:04 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
 <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
Message-ID: <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>


> On 12 May 2015, at 15:45, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> 
> 
> 2015-05-11 23:53 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:
>> It is perfectly fine considering the Unicode code points as abstract integers, with UTF-32 and UTF-8 encodings that translate them into byte sequences in a computer. The code points that conflict with UTF-16 might have been merely declared not in use until UTF-16 has been fallen out of use, replaced by UTF-8 and UTF-32.
>> 
> The deprecation of UTF-16 and UTF-32 as encoding *schemes* ("charsets" in MIME) is already very advanced. 

UTF-32 is usable for internal use in programs.

> But they will certinaly not likely disappear as encoding *forms* for internal use in binary APIs and in several very popular programming languages: Java, Javascript, even C++ on Windows platforms (where it is the 8-bit interface, based on legacy "code pages" and with poor support of the UTF-8 encoding scheme as a Windows "code page", is the one that is now being phased out), C#, J#?

That is legacy, which may remain for long. For example, C/C++ trigraphs are only removed now, since long just a bother for compiler implementation. Java is very old, designed around 32-bit programming with limits on function code size, which was a limitation in pre-PPC CPU that went out of use in the early 1990s.

> UTF-8 will also remain for long as the prefered internal encoding for Python, PHP (even if Python introduced also a 16-bit native datatype).
> 
> In all cases, programming languages are not based on any Unicode encoding forms but on more or less opaque streams of code units using datatypes that are not constrained by Unicode (because their "character" or "byte" datatype is also used for binary I/O and for supporting also the conversion of various binary structures, including executable code, and also because even this datatype is not necessarily 8-bit but may be larger and not even an even multiple of 8-bits)

Indeed, that is why UTF-8 was invented for use in Unix-like environments.


From verdy_p at wanadoo.fr  Tue May 12 09:50:02 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 12 May 2015 16:50:02 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
 <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
 <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>
Message-ID: <CAGa7JC0rQyzoEDRCQsVY4a+-5PnHsUKAxdts7=1k6AA29qrOyg@mail.gmail.com>

2015-05-12 15:56 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:

>
> Indeed, that is why UTF-8 was invented for use in Unix-like environments.
>

Not the main reason: communication protocols, and data storage is also
based on 8-bit code units (even if storage group them by much larger
blocks).

UTF-8 is the default choice for all Internet protocols because all these
protocols are based on these units

This last remark is true except at lower levels, on the link interfaces and
on physical links where the unit is the bit or sometimes even smaller units
with fractions of bits, grouped into frames that not only transport data
bits but also specific items needed by the physical constraints, such as
maintaining the mean polarity, restricting the frequency bandwidth,
reducing noise in lateral bands, synchronizing clocks for data sampling,
reducing the power usage, allowing adaptation of bandwidth by insertion of
new parallel streams in the same shared band, allowing changing the framing
format in the case where the signal-noise ratio is degraded by using some
additional signals normally not used by the normal data stream, or adapting
to the degradation of the transport medium, or to some emergency situations
(or sometimes to local legal requirements) that require reducing the usage
to leave space for priority traffic (e.g. air regulation or military use)...

Each time the transport medium has to be shared with third parties (this is
the case for infrastructure networks or for the radiofrequencies in the
public airspace which may also be shared internationally), or if the medium
is known to have a slowly degrading quality (e.g. SSD storage), the
transport and storage protocols never use the whole bandwidth available and
reserve some regulatory space for specific signalisation that may be needed
to allow the current usages to be autoadapted: the physical format of
datastreams can change at any time, and what was initially encoded one way
will then be encoded another way (such things also occur extremely locally,
for example on databuses within computers, for example between the various
electronic chips on the same motherboard, or that could be plugged to it as
optional extensions ! Electronic devices are full of bus adapters that have
to manage the priority between concurrent traffics that are unpredictable,
and with changing environment conditions such as the current state of power
sources).

Programmers however only see the result on the upper layer data frames
where they manage bits, then they can create streams of bytes, that are
usable for transport protocols and interchange over a larger network or
computing system.

But for the worldwide network (Internet), everything is based on 8-bit
bytes that are the minimal units of information (and also the maximal
units: over larger units are not portable, not interoperable over the
global network) in all related protocols (including for negociating options
in these protocols): UTF-8 is then THE universal encoding that will
interoperate everywhere on the Internet, even if locally (in connected
hosts), other encoding may be used (which ''may'' be more efficiently
processed) after a simple conversion (this does not necessarly requires
changing the size of code units used in local protocols and interfaces, for
example there could exist some reencoding, or data compression or
expansion).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/22522309/attachment.html>

From haberg-1 at telia.com  Tue May 12 10:58:00 2015
From: haberg-1 at telia.com (Hans Aberg)
Date: Tue, 12 May 2015 17:58:00 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <CAGa7JC0rQyzoEDRCQsVY4a+-5PnHsUKAxdts7=1k6AA29qrOyg@mail.gmail.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
 <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
 <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>
 <CAGa7JC0rQyzoEDRCQsVY4a+-5PnHsUKAxdts7=1k6AA29qrOyg@mail.gmail.com>
Message-ID: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com>


> On 12 May 2015, at 16:50, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
>> Indeed, that is why UTF-8 was invented for use in Unix-like environments.
>> 
> Not the main reason: communication protocols, and data storage is also based on 8-bit code units (even if storage group them by much larger blocks).

There is some history here:
  https://en.wikipedia.org/wiki/UTF-8#History


From verdy_p at wanadoo.fr  Tue May 12 11:44:15 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 12 May 2015 18:44:15 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
 <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
 <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>
 <CAGa7JC0rQyzoEDRCQsVY4a+-5PnHsUKAxdts7=1k6AA29qrOyg@mail.gmail.com>
 <26383F58-189A-4167-9530-1CE33EE9536F@telia.com>
Message-ID: <CAGa7JC0J+byw-xT1JDrM5Muyb=CKj1Lnb6fO9XrgfOQDq-iihw@mail.gmail.com>

Even if UTF-8 initially started as part of some Unix standardization
process, it was for the prupose of allowing interchanges across systems.
The networking concept was already there (otherwise it would not have been
part of the emerging *nix standardization processes, and would have
remained a proprietary encoding in local systems).

At the same time, The Internet was also about to emerge as a worldwide
network, but Internet was still very limited and full of restrictions,
accessible only from a few (very costly) gateways in other countries, and
not even with the IP protocol but with many specific protocols (may be you
remember the time of CompuServe, billed only in US dollars and only via
international payments and costly bank processing fees; you also had to
call an international phone number before a few national phone numbers
appeared, cooperated by CompuServe and some national or regional services

At that time, the Telcos were not even interested to participate and all
wanted to develop their own national or regional networks with their own
protocols and "national" standards; real competition in telecommunications
only started just before Y2K, with the deregulation in North America and
some parts of Europe, in fact just in the EEA, before progressively going
worldwide when the initial competitors started to restructure/split/merge
and aligning their too many technical standards with the need of a common
interoperable one that would worlk in all their new local branches). In
fact the worldwide Internet would not have become THE global network
without the reorganisation of older dereregulated national telcos and the
end of their monopoles.

The development of "the" Internet, and the development of the UCS, were
then completely made in parallel. Both were appearing to replace former
national standards in the same domains previously operated by the former
monopoles in telecommunications (and that also needed computing and data
standards, not just networking standards).

In the early time of Internet, the IP protocol was still not really adapted
as the universal internetworking protocol (other competitors were also
proposed by private companies, notably Token-Ring by IBM, and the X21-X25
family promoted essentially by European telcos (which prefered realtime
protocols with warrantied/reserved bandwidth, and commutation by packets
instead of by frames of variable sizes).

Even today, there are some remaining parts of the X* network family, but
only for short-distance private links: e.g. with ATM (in xDSL
technologies), or for local buses within electronic devices (under the 1
meter limit), or within some critical missions (realtime constraints used
for networking equipements in aircrafts, that have their own standard, wit
ha few of them developped recently as adaptation of Internet technologies
over channels in a realtime network, generally not structured in a "mesh"
but with a "star" topology and dedicated bandwidths).

If you want to look for remaining text encoding standards that are still
not based on the UCS, look into aircraft technologies, and military
equipements (there's also the GSM family of protocols, which continues to
keep many legacy proprietary standards, with poor adaptation to Internet
technologies and the UCS...)

The situation is starting to change now in aircraft/military technology too
(first Airbus in Europe, now also adopted by its major US competitors) and
mobile networks (4G), with the full integration of the the IEEE Ethernet
standard, that allows a more natural and straightforward integration of IP
protocols and the UCS standards with it (even if compatibility is kept by
reserving a space for former protocols, something that the IEEE Ethernet
standard has already facilitated for the Internet we know now, both in
worldwide communications, and in private LANs)...


2015-05-12 17:58 GMT+02:00 Hans Aberg <haberg-1 at telia.com>:

>
> > On 12 May 2015, at 16:50, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> >> Indeed, that is why UTF-8 was invented for use in Unix-like
> environments.
> >>
> > Not the main reason: communication protocols, and data storage is also
> based on 8-bit code units (even if storage group them by much larger
> blocks).
>
> There is some history here:
>   https://en.wikipedia.org/wiki/UTF-8#History
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/2d72f947/attachment.html>

From sdaoden at yandex.com  Tue May 12 13:46:26 2015
From: sdaoden at yandex.com (Steffen Nurpmeso)
Date: Tue, 12 May 2015 20:46:26 +0200
Subject: Surrogates and noncharacters
In-Reply-To: <26383F58-189A-4167-9530-1CE33EE9536F@telia.com>
References: <20150511104419.665a7a7059d7ee80bb4d670165c8327d.4f55ecb0f1.wbe@email03.secureserver.net>
 <CAGa7JC1C1jUXtpeeTTfTUa_0UYpQgZyNr62VMGVJJqmvW-eC0Q@mail.gmail.com>
 <182274CE-06AD-487C-8E85-9FFEEA54AD94@telia.com>
 <CAGa7JC0SL2h4KOq8rome9uKkECcmyGXZmbJc7OkCUQ09jcYz2g@mail.gmail.com>
 <085B08E1-55BB-4B31-AE58-8B7601DCE857@telia.com>
 <CAGa7JC0rQyzoEDRCQsVY4a+-5PnHsUKAxdts7=1k6AA29qrOyg@mail.gmail.com>
 <26383F58-189A-4167-9530-1CE33EE9536F@telia.com>
Message-ID: <20150512184626.1YIf9x0Co6o=%sdaoden@yandex.com>

Hans Aberg <haberg-1 at telia.com> wrote:
 |> On 12 May 2015, at 16:50, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
 |>> Indeed, that is why UTF-8 was invented for use in Unix-like environments.
 |>> 
 |> Not the main reason: communication protocols, and data storage \
 |> is also based on 8-bit code units (even if storage group \
 |> them by much larger blocks).
 |
 |There is some history here:
 |  https://en.wikipedia.org/wiki/UTF-8#History

"What happened was this":

  http://doc.cat-v.org/bell_labs/utf-8_history

--steffen

From mark at macchiato.com  Tue May 12 16:05:29 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 12 May 2015 14:05:29 -0700
Subject: =?UTF-8?Q?FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_charts?=
Message-ID: <CAJ2xs_HYj4ui0xZ3Kvx7WLEGy50iZb4UPBTB82B6sOJo7nVtGA@mail.gmail.com>

http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150512/dccbff8d/attachment.html>

From public at khwilliamson.com  Tue May 12 17:19:57 2015
From: public at khwilliamson.com (Karl Williamson)
Date: Tue, 12 May 2015 16:19:57 -0600
Subject: FYI: The =?UTF-8?B?d29ybGTigJlzIGxhbmd1YWdlcywgaW4gNyBtYXBzIA==?=
 =?UTF-8?B?YW5kIGNoYXJ0cw==?=
In-Reply-To: <CAJ2xs_HYj4ui0xZ3Kvx7WLEGy50iZb4UPBTB82B6sOJo7nVtGA@mail.gmail.com>
References: <CAJ2xs_HYj4ui0xZ3Kvx7WLEGy50iZb4UPBTB82B6sOJo7nVtGA@mail.gmail.com>
Message-ID: <55527C8D.2070406@khwilliamson.com>

On 05/12/2015 03:05 PM, Mark Davis ?? wrote:
> http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
> //////

And a critique:

http://languagelog.ldc.upenn.edu/nll/?p=18844

From dzo at bisharat.net  Tue May 12 17:47:27 2015
From: dzo at bisharat.net (dzo at bisharat.net)
Date: Tue, 12 May 2015 22:47:27 +0000
Subject: =?Windows-1252?B?UmU6IEZZSTogVGhlIHdvcmxkknMgbGFuZ3VhZ2VzLCBpbiA3IG1hcHMgYW5kIGNoYXJ0cw==?=
Message-ID: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>

And a tangent, picking up on a complaint that Swahili wasn't represented on one of the 7 WaPost graphics:

http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html

Two other recent posts on this blog ("Beyond Niamey") critique the Africa part of a set of graphics/maps of "Second Most Spoken Languages Worldwide" (on the Olivet Nazarene University site) - another thought-provoking effort that could inform better if redone. 

Don Osborn


------Original Message------
From: Karl Williamson
Sender: Unicode
To: Mark Davis ??
To: Unicode Public
Subject: Re: FYI: The world?s languages, in 7 maps and charts
Sent: May 12, 2015 6:19 PM

On 05/12/2015 03:05 PM, Mark Davis ?? wrote:
> http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
> //////

And a critique:

http://languagelog.ldc.upenn.edu/nll/?p=18844

Sent via BlackBerry by AT&T


From jonathan.rosenne at gmail.com  Wed May 13 04:24:45 2015
From: jonathan.rosenne at gmail.com (Jonathan Rosenne)
Date: Wed, 13 May 2015 12:24:45 +0300
Subject: =?utf-8?Q?RE:_FYI:_The_world=E2=80=99s_languages=2C_?=
 =?utf-8?Q?in_7_maps_and_charts?=
In-Reply-To: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
Message-ID: <000401d08d5e$a811de90$f8359bb0$@gmail.com>

I have two comments: 

- if Hindi and Urdu are counted together, why not Italian and Portuguese?

- According to a lecture some time ago by a Israel professor (I forgot his name), there are 80 languages actively used in Israel, including Hebrew, Arabic, English (both varieties), Russian, Ukrainian, Yiddish, Ladino, Tagalog, most European languages, and various African and East Asian languages used by the large number of refugees from Africa and foreign workers from East Asia.

Best Regards,

Jonathan Rosenne

054-4246522

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of dzo at bisharat.net
Sent: Wednesday, May 13, 2015 1:47 AM
To: Karl Williamson; Unicode; Mark Davis ??; Unicode Public
Subject: Re: FYI: The world?s languages, in 7 maps and charts

And a tangent, picking up on a complaint that Swahili wasn't represented on one of the 7 WaPost graphics:

http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html

Two other recent posts on this blog ("Beyond Niamey") critique the Africa part of a set of graphics/maps of "Second Most Spoken Languages Worldwide" (on the Olivet Nazarene University site) - another thought-provoking effort that could inform better if redone. 

Don Osborn


------Original Message------
From: Karl Williamson
Sender: Unicode
To: Mark Davis ??
To: Unicode Public
Subject: Re: FYI: The world?s languages, in 7 maps and charts
Sent: May 12, 2015 6:19 PM

On 05/12/2015 03:05 PM, Mark Davis ?? wrote:
> http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
> //////

And a critique:

http://languagelog.ldc.upenn.edu/nll/?p=18844

Sent via BlackBerry by AT&T


From verdy_p at wanadoo.fr  Wed May 13 05:37:44 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 13 May 2015 12:37:44 +0200
Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?=
 =?UTF-8?Q?ts?=
In-Reply-To: <000401d08d5e$a811de90$f8359bb0$@gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
Message-ID: <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>

Italian and Portuguese are difficult to understand between each other
(especially in speech: Italians speak really too fast)


On the opposite, exchanges between Standard French and Iberian Portuguese
is really easy, with low time of adaptation, either for native French
coming in Portugal for the first time, or native Portugueses coming in
France. Also there's not much difficulties between French Guiana and Brazil
for the two regional variants of the two "standard" languages.

Native Portuguese ans native French use approximately the same syntaxic
structure, similar phonology, similar rythms, and there's a large common
lexicon (also with imports from alsmost the same set of modern foreign
languages or historical languages), if this still does not work, reading
remains easy, and beside minor grammatical termination differences the
lexical roots are the same for most words, many words in Portuguese are
borrowed directly from French with very minor changes; the creation of new
words also use a similar system of prefixes and suffixes which are nearly
identical.

This is not true with modern Italian that has accumulated lots of phonetic
transforms since Latin, and that has mixed very different sets of regional
minority languages. And where the transformation of meanings (creation of
new lemmas of the same term, creation of irregular words composed by fusion
and many mutations) was much deeper in Italian than in French and
Portuguese (which were more conservative).

But if we speak about Hindi and Urdu, for long it was considered the same
language in speech (the writing systems in Urdu were separated only for
religious reasons, but religious texts could not be read by a vast majority
of people in India. They were really splitted in two languages only when
education and litteracy progressed a lot sarting in the middle of the 20th
century, and after the indepedence of India, then the separation of
Pakistan.

So the practical difficult differences are only for the written script, but
as Urdu is also spoken in India, it is also still written with the
Devanagari script (in which case it becomes relatively easy to read by
native Hindi readers). Arabic-Devanagari Transliterators are still heavily
used for Urdu in India. But if Urdu native speakers don't want Hindi, they
choose to communicate in English (as a de facto interchange language
understood by both communities in India, but also by many Urdu speakers in
Pakistan).

For many things, Urdu and Hindi are in a situation quite similar to Serbian
Cyrillic vs.Croatian (and the Serbian Latin transliteration is often named
"Serbocroatian" and can be used also as an interchange language. Bosnian
(or more recently Montenegrin) is also in the middle, extremely similar to
Serbian Latin (for now the separation is not really justified, except for
political reasons, but not cultural reasons (the attempt to separate them
is made by artificially introducing neologisms that many people don't know
or use correctly, or by inventing new orthographic rules that few people
know or follow exactly; mass medias cannot really help because they are
overwhelmed by medias in other major languages, or because medias in all
these newly intriduced languages are spread over the same regions, local
medias are not powerful enough to have a decisive audience that can
influence rapidly the evolution to separate languages, and even if they
exist, they often ignore the new artificial rules. In that region, many
people belonging to distinct communities have to interchange contents
everyday; and the time were Serbocroatian was still a single language is
not very old; even if the Cyrlliic script is prefered in Serbia, it is
still not the only standard and most people also use the Latin script
easily for the same language, and translitterators are also doing good job
with only very minor differences remaining with the standard orthography of
Serbian in each script).


2015-05-13 11:24 GMT+02:00 Jonathan Rosenne <jonathan.rosenne at gmail.com>:

> I have two comments:
>
> - if Hindi and Urdu are counted together, why not Italian and Portuguese?
>
> - According to a lecture some time ago by a Israel professor (I forgot his
> name), there are 80 languages actively used in Israel, including Hebrew,
> Arabic, English (both varieties), Russian, Ukrainian, Yiddish, Ladino,
> Tagalog, most European languages, and various African and East Asian
> languages used by the large number of refugees from Africa and foreign
> workers from East Asia.
>
> Best Regards,
>
> Jonathan Rosenne
>
> 054-4246522
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of
> dzo at bisharat.net
> Sent: Wednesday, May 13, 2015 1:47 AM
> To: Karl Williamson; Unicode; Mark Davis ??; Unicode Public
> Subject: Re: FYI: The world?s languages, in 7 maps and charts
>
> And a tangent, picking up on a complaint that Swahili wasn't represented
> on one of the 7 WaPost graphics:
>
>
> http://niamey.blogspot.com/2015/05/how-many-people-speak-what-in-africa.html
>
> Two other recent posts on this blog ("Beyond Niamey") critique the Africa
> part of a set of graphics/maps of "Second Most Spoken Languages Worldwide"
> (on the Olivet Nazarene University site) - another thought-provoking effort
> that could inform better if redone.
>
> Don Osborn
>
>
> ------Original Message------
> From: Karl Williamson
> Sender: Unicode
> To: Mark Davis ??
> To: Unicode Public
> Subject: Re: FYI: The world?s languages, in 7 maps and charts
> Sent: May 12, 2015 6:19 PM
>
> On 05/12/2015 03:05 PM, Mark Davis ?? wrote:
> >
> http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
> > //////
>
> And a critique:
>
> http://languagelog.ldc.upenn.edu/nll/?p=18844
>
> Sent via BlackBerry by AT&T
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150513/b509aea6/attachment.html>

From richard.wordingham at ntlworld.com  Wed May 13 19:31:29 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 14 May 2015 01:31:29 +0100
Subject: Regular Expressions and Canonical Equivalence
Message-ID: <20150514013129.0b68eb41@JRWUBU2>

What is the current state of play on regular expression engines that
acknowledge canonical equivalence?  By acknowledge, I mean that will
deem a string to have a match for a pattern if any string canonically
equivalent to the string does.  I believe this corresponds to the
intent of requirement RL2.1 that was in UTS#18 Unicode Regular
Expression until the towel was thrown in and the paragraph survived but
the requirement vanished.

I have been putting my own together, but my efforts have bogged down
with how to select the match and subexpression matches to report.  The
relevant theory is not of regular languages of strings, but of regular
languages of 'traces'.  I currently leave the results undefined if
an algebraic Kleene star is not a regular expression, e.g.
(\u0323\u0301)*.

It is particularly relevant to using regular expressions for text
rendering, e.g. for something like an imitation of Microsoft?s
Universal Shaping Engine.

I note that ICU is having another attempt at supporting canoncial
equivalence - http://bugs.icu-project.org/trac/ticket/9111 'Support
UREGEX_CANON_EQ'.  At least, they are if the User Guide
(http://userguide.icu-project.org/strings/regexp) is to be believed.
Perhaps not, though, if the old comments in the ticket are taken
seriously.

For example, I believe that one should be able to find the Lanna script
subscript nga <U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER NGA>
in the word ?????? <koeng> /k??/ 'half' <U+1A20 TAI THAM LETTER HIGH KA,
U+1A6E TAI THAM VOWEL SIGN E, U+1A65 TAI THAM VOWEL SIGN I, U+1A75 TAI
THAM SIGN TONE-1, U+1A60 TAI THAM SIGN SAKOT, U+1A26 TAI THAM LETTER
NGA> or the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH
CIRCUMFLEX in the word _bu?c_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN
SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As far as I can
tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the
combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323
COMBINING DOT BELOW> of Vietnamese letter and tone mark.  One will not
find them if one simply applies the string theory of regular
expressions to NFD equivalents, as the initial bug report in
the ticket suggests doing. A later comment in the ticket suggests that
the alphabet for the string theory should be 'the combining
sequences'.  (I hope there is no theoretical problem from there being
an infinite number of them.)  The Vietnamese search would work if the
alphabet in the string theory were *Vietnamese* collation elements.

In the text rendering domain, HarfBuzz makes regular expressions work
with conversion to NFD by permuting the canonical combining classes
on a script by script basis.  This requires care.

Richard.


From richard.wordingham at ntlworld.com  Thu May 14 02:59:59 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 14 May 2015 08:59:59 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150514013129.0b68eb41@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2>
Message-ID: <20150514085959.433e49af@JRWUBU2>

On Thu, 14 May 2015 01:31:29 +0100
Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

> I believe this corresponds to the
> intent of requirement RL2.1 that was in UTS#18 Unicode Regular
> Expression until the towel was thrown in and the paragraph survived
> but the requirement vanished.

I apologise if I am telling those interested what they already know.  I
couldn't find it written down in terms of NFD strings.

I believe the core of the problem is that Thompson's construction
algorithm has to be significantly elaborated for concatenation.  When
running the non-deterministic finite state machine for the regular
expression st, if the string is amnb with ccc(m) != ccc(n), one has to
consider the possibility that subsequence an matches expression s and
subsequence mb matches expression t.  To handle a run of decomposed
characters with non-zero canonical combining class, one method
adds states of the form (x,y,n) where x is a state of for expression s,
y is a state for expression t, and n is the non-zero canonical combining
class of the last character received.

The additional problem with (algebraic) Kleene star is that for s* one
has to simultaneously consider s, ss, sss and so on, which makes the
state machine non-finite.  This is probably just a formal problem; once
one adds capture groups to the FSM, the memory requirement depends on
the size of the string being examined.  A solution is to effectively
add a loop to the parse structure of the regular expression and add
checks to the matching function to avoid unnecessary recursion.

An elegant formal solution to the Kleene star problem interprets
(\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
counter-intuitive, and simply rejecting such expressions would probably
be better.  Going non-finite is probably better.  My *finite* state
machine bodge for these cases is to simply match s+ to something
uncharacterised between s|ss and s+.

Richard.

From verdy_p at wanadoo.fr  Thu May 14 05:58:29 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 14 May 2015 12:58:29 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150514085959.433e49af@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
Message-ID: <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>

2015-05-14 9:59 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> An elegant formal solution to the Kleene star problem interprets
> (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> counter-intuitive


Yes it is problematic:  (ab)* is not the same as (a|b)* as this requires
matching pairs of letters "ab" in that order in the first expression, but
random strings of "a" and "b" i nthe second one (so the second matches
*more* input samples.

Even if you consider canonical equivalences (where the relative order of
"ab" does not matter for example because they have distinct non-zero
canonical) this does not mean that "a" alone will match in the first
expression "(ab*)", even though it MUST match in "(a|b)*".

So the solution is just elegant to simplify the first level of analysis of
"(ab)*" by using "(a|b)*" instead. But then you need to perform a second
pass on the match to make sure it is containing only complete sequences
"ab" in that order (or any other order if they are all combining with a
non-zero combining class) and no unpaired "a" or "b".

Such transform using two passes should only be made when subregexps within
a "(...)*" contain only alternatives (converted to NFD) such then each of
them contains ONLY combining characters with distinct non-zero combining
classes. If one of the alternatives "ab" contains any character with
combining class 0 or if they have blockers with identical non-zero
combining classes, we cannot use this transform.

But this transform using two passes is stil elegant: the alternatives where
we can use it and that requires a second pass have a bounded length (it's
impossible for them to be longer than 255 codepoints given there cannot be
more than 255 *distinct* non-zero combining classes. But even in this case,
the current UCD currently uses a much lower number of non-zero combining
classes, so this limit is even lower: the substrings where this transform
is possible will be extremely short and a second pass on them will be
extremely fast (using very small string buffers that can stay in memory).

For your example "(\u0323\u0302)*" the characters in the alternatives
(COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to NFD
(which is the same here) are just using at most two distinct non-zero
combining classes and no blocker; so it is safe to trasform it to
(\u0323|\u0302)* for a first pass matching that will then only check
candidate matchs in the second pass. or more efficiently, a second finite
state automata (FSA) running in parallel with its own state:

in your example this second FSA just has 2 states: the initial state 0
which is also the final/accept state, and state 1 after matching one
character of the pair. When you reach the point where matching (\u0323 |
\u0302)* with the first level of analysis would terminate, you just need to
check the state of the 2nd FSA to see if it is also in the
initial/final/accept state 0 (otherwise this is not an valid accept state
for the untransformed (\u0323\u0302)* regexp.

However, the most difficult part for regexps supporting canonical
equivalence is about what we can do to return submatches! they are not
necessarily contiguous in the input stream. You can still return a matching
substring but if you use it for performing search/replace operations, it
becomes difficult to know where to place the replacement, when that
replacement string (even if it was converted first to NFD) may also contain
combining characters. Or even worse if the replacement contains some
blockers that will be inserted in the middle of the non-replaced text (wnad
where can we safely place the remaining characters in the middle of the
match but that are not part of the match itself ???

One solution is to not exclude these characters in the middle of a match
and return them too. It's up to the replacement function to check their
existence:

The regexp engine can just provide an additional index of characters in the
returned matched substring, that are in fact not part of the actual match
but present in the middle, instead of just the substring for the match.

Or it can return just the exact matching substring, but also an index array
containing their relative position in the actual input string (in standard
matches those indexes would be the sequence of intergers 0 to N-1 where N
is the length of the matched substring, but if the sequence is
discontinusous in the input, the sequence will still be growing with some
steps higher than 1, leaving some holes (the last index in that sequence
will be equal to or higher than N).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/8f5ba4fb/attachment.html>

From webalorixa at gmail.com  Thu May 14 08:24:57 2015
From: webalorixa at gmail.com (Luis de la Orden)
Date: Thu, 14 May 2015 14:24:57 +0100
Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?=
 =?UTF-8?Q?ts?=
In-Reply-To: <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
Message-ID: <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>

As a speaker of both Portuguese (mother tongue, native) and Spanish
(father, not native anymore) with a Catalan connection (dad was from
Barcelona and I lived there for a few months, amazing language, love it to
bits), I would say these two languages are closer to each other than
Italian to Portuguese. But never as close to consider them the same I can
assure you :).

In fact, even European Portuguese can be a bit hermetic to understand
although they understand Brazilian Portuguese better. This is all down to
the fact that they import more cultural products such as books and TV
programmes from Brazil than the other way around. European Portuguese
speakers say that the way we tonalise and inflect the language is softer
but I believe that they understand us due to exposure to the language which
in turn teaches them to decipher our pronunciation and the Brazilian
linguistic idiosyncracies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/a9fa98f9/attachment.html>

From doug at ewellic.org  Thu May 14 09:08:14 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 14 May 2015 07:08:14 -0700
Subject: Regular Expressions and Canonical Equivalence
Message-ID: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net>

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> For example, I believe that one should be able to find
> [...]
> the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH
> CIRCUMFLEX in the word _bu?c_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN
> SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As far as I
> can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the
> combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323
> COMBINING DOT BELOW> of Vietnamese letter and tone mark.

What you're looking for in this case is neither an NFC match nor an NFD
match, but a language-dependent match, as you imply further down. <1ED9>
decomposes to <006F 0323 0302>, and if you want a match with <00F4>,
which decomposes to <006F 0302>, your regex engine has to reorder the
marks. It sounds unlikely that you'll find such an engine, but there is
a lot of Vietnamese-language?specific software out there, so you never
know.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From slevin at signpuddle.net  Thu May 14 12:25:06 2015
From: slevin at signpuddle.net (Stephen E Slevinski Jr)
Date: Thu, 14 May 2015 12:25:06 -0500
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
Message-ID: <5554DA72.4040509@signpuddle.net>

On 5/14/15 5:58 AM, Philippe Verdy wrote:
> Yes it is problematic:  (ab)* is not the same as (a|b)* as this 
> requires matching pairs of letters "ab" in that order in the first 
> expression, but random strings of "a" and "b" i nthe second one (so 
> the second matches *more* input samples.
>
> Even if you consider canonical equivalences (where the relative order 
> of "ab" does not matter for example because they have distinct 
> non-zero canonical) this does not mean that "a" alone will match in 
> the first expression "(ab*)", even though it MUST match in "(a|b)*".
>
> So the solution is just elegant to simplify the first level of 
> analysis of "(ab)*" by using "(a|b)*" instead. But then you need to 
> perform a second pass on the match to make sure it is containing only 
> complete sequences "ab" in that order (or any other order if they are 
> all combining with a non-zero combining class) and no unpaired "a" or "b".

If you always want to find "a" and "b" in a pair without regard to the 
order, how about the regex:
((ab)|(ba))*

?Steve

From wjgo_10009 at btinternet.com  Thu May 14 12:14:57 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 14 May 2015 18:14:57 +0100 (BST)
Subject: Tag characters
Message-ID: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>

http://www.unicode.org/L2/L2015/15107.htm
Section E.1.3 of the above-linked document is amazing and is about a brilliant new use for some of the tag characters.
What else would be possible if the same sort of technique were applied to another base character?
William Overington
14 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/58b17656/attachment.html>

From richard.wordingham at ntlworld.com  Thu May 14 12:55:33 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 14 May 2015 18:55:33 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <5554DA72.4040509@signpuddle.net>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <5554DA72.4040509@signpuddle.net>
Message-ID: <20150514185533.776a7772@JRWUBU2>

On Thu, 14 May 2015 12:25:06 -0500
Stephen E Slevinski Jr <slevin at signpuddle.net> wrote:

> On 5/14/15 5:58 AM, Philippe Verdy wrote:
> > Yes it is problematic:  (ab)* is not the same as (a|b)* as this 
> > requires matching pairs of letters "ab" in that order in the first 
> > expression, but random strings of "a" and "b" i nthe second one (so 
> > the second matches *more* input samples.
> >
> > Even if you consider canonical equivalences (where the relative
> > order of "ab" does not matter for example because they have
> > distinct non-zero canonical) this does not mean that "a" alone will
> > match in the first expression "(ab*)", even though it MUST match in
> > "(a|b)*".
> >
> > So the solution is just elegant to simplify the first level of 
> > analysis of "(ab)*" by using "(a|b)*" instead. But then you need to 
> > perform a second pass on the match to make sure it is containing
> > only complete sequences "ab" in that order (or any other order if
> > they are all combining with a non-zero combining class) and no
> > unpaired "a" or "b".
> 
> If you always want to find "a" and "b" in a pair without regard to
> the order, how about the regex:
> ((ab)|(ba))*

In NFD, the language (\u0323\u0302)* consists of

 ? (empty string)
\u0323\u0302
\u0323\u0323\u0302\u0302
\u0323\u0323\u0323\u0302\u0302\u0302
\u0323\u0323\u0323\u0323\u0302\u0302\u0302\u0302

and so on.

Therefore the finite automaton implied by your regex won't work.  No
regular expression will work.  That is mathematically proven.  What I
have listed above is the standard example of a 'non-regular language',
a set of strings that cannot be defined by a finite site of regular
expressions.

Richard.


From richard.wordingham at ntlworld.com  Thu May 14 13:13:24 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 14 May 2015 19:13:24 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
Message-ID: <20150514191324.1e455c57@JRWUBU2>

On Thu, 14 May 2015 12:58:29 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > An elegant formal solution to the Kleene star problem interprets
> > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > counter-intuitive

The technical term for this is the 'concurrent iteration' - or at
least, that's the term used in the 'Book of Traces'.

> For your example "(\u0323\u0302)*" the characters in the alternatives
> (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to
> NFD (which is the same here) are just using at most two distinct
> non-zero combining classes and no blocker; so it is safe to trasform
> it to (\u0323|\u0302)* for a first pass matching that will then only
> check candidate matchs in the second pass. or more efficiently, a
> second finite state automata (FSA) running in parallel with its own
> state:

You've forgotten the basic problem.  A *finite* state automaton cannot
count very far; with only n states, it cannot count as far as n.

For this simple example, one could simply use something like
(\u0323\u0302)\{0,7\}, which should be more than enough for any likely
occurrences.  It's an interesting challenge, but I think solving it
provides satisfaction rather than practical benefit.

> However, the most difficult part for regexps supporting canonical
> equivalence is about what we can do to return submatches! they are not
> necessarily contiguous in the input stream. You can still return a
> matching substring but if you use it for performing search/replace
> operations, it becomes difficult to know where to place the
> replacement, when that replacement string (even if it was converted
> first to NFD) may also contain combining characters. Or even worse if
> the replacement contains some blockers that will be inserted in the
> middle of the non-replaced text (wnad where can we safely place the
> remaining characters in the middle of the match but that are not part
> of the match itself ???

Interestingly, ICU hides that detail from the user.

For search and replace on a text buffer, the text to be replaced would
be defined by a list of text intervals.  If the text is unnormalised,
some of the boundaries may divide precomposed characters.

If the interval list is compacted, at most one of the intervals will
contain a character properly having combining class 0.  (U+0F73 and
U+0F75 do not count.)  If there is such an interval, it will be
replaced and the others simply deleted.  If there is no such interval,
then the choice of insertion point may be more difficult.  Indeed, in
some cases, it could be appropriate to reject the replacement command
as undefined in the context.  On the other hand, if the text buffer is
normalised, then one would be able to have well-defined behaviour, as
one does when splitting text into UCA collating elements.

Richard.

From richard.wordingham at ntlworld.com  Thu May 14 14:29:06 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 14 May 2015 20:29:06 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net>
References: <20150514070814.665a7a7059d7ee80bb4d670165c8327d.cc9235bbfb.wbe@email03.secureserver.net>
Message-ID: <20150514202906.34c33518@JRWUBU2>

On Thu, 14 May 2015 07:08:14 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> 
> > For example, I believe that one should be able to find
> > [...]
> > the Vietnamese letter ? U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX in the word _bu?c_ 'to bind' <U+0062, U+0075, U+1ED9
> > LATIN SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As
> > far as I can tell, U+1ED9 is not a letter of the Vietnamese
> > alphabet; it is the combination <U+00F4 LATIN SMALL LETTER O WITH
> > CIRCUMFLEX, U+0323 COMBINING DOT BELOW> of Vietnamese letter and
> > tone mark.
> 
> What you're looking for in this case is neither an NFC match nor an
> NFD match, but a language-dependent match, as you imply further down.
> <1ED9> decomposes to <006F 0323 0302>, and if you want a match with
> <00F4>, which decomposes to <006F 0302>, your regex engine has to
> reorder the marks. It sounds unlikely that you'll find such an
> engine, but there is a lot of Vietnamese-language?specific software
> out there, so you never know.

There's no more reordering than is involved in doing a Vietnamese
collation-based search, where one has to split <006F 0323 0302> up into
collating elements <006F 0302><0323>. 

Possibly a back-tracking regular expression would reorder the string.

My experimental canonical-equivalence respecting regular
expression engine is designed in the same manner as the Thomson
construction - it is a non-deterministic finite automaton (except for
the effects of capturing parts of the input string) composed of a
hierarchy of non-deterministic finite automata.

States are identified as
strings of scalars following the hierarchy. The engine checks whether a
string matches a regular expression.

The engine decomposes the string to NFD.  This keeps the automaton for
the concatenation of two regular expressions simple.

I will now show how it handles the search.  The regular expression to
match against is \u00f4.* - a character U+00F4 followed by anything,
including nothing.

My program essentially produced the following output, with
comments added later indicated by #:

$ ./regex  '\u00f4.*' ? # Arguments are regular expression and string

Simple Unicode regex "\u006F\u0302" # First half of regular expression
                                    # as the automaton actually sees it.

Initial states: 

0) L0   # Initial state - expecting 'o', in first half of expression.
        # 'L' = left.

=o=10:20:=         # Gets 'o'

L0 => L1           # Changes state to expecting combining circumflex

=0323=20:30:=      # Gets combining dot below (U+0323)

L1 => N001220:1:*  # N => State for concatenation of regular
                   # expressions; both automata are run.
                   # 001 => Substring length within state identifier.
                   # 220 => Combining class of U+0323.  Characters with
                   # this ccc or lower may no longer be processed by
                   # the left-hand automaton.
                   # : is punctuation for readability of state
                   # 1 => Left half still expecting combining circumflex
                   # * => only state for regex ".*".  

=0302=30:06:=      # Gets combining circumflex (U+0302)
                   # The engine runs a non-deterministic finite
                   # automaton.  It now branches to 3 states.

N001220:1:* => N001220:M:* # Left half has now reached end of expected
                           # string.

N001220:1:* => R* (match)  # On transferring to an accept state of
                           # \u00f4 automaton, only the .*
                           # automaton needs to # be processed.

N001220:1:* => N001230:1:* # Possibly the combining circumflex is to
                           # match the '.*'.  The combining class is
                           # updated to 230.  This 230 will actually
                           # block the \u00f4 automaton from reaching
                           # an accept state from this state, for a
                           # combining circumflex can henceforth only be
                           # considered by the .* automaton.

At no point has the input string been reordered.

Richard.


> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
> 
> 


From wjgo_10009 at btinternet.com  Thu May 14 15:40:19 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 14 May 2015 21:40:19 +0100 (BST)
Subject: Tag characters
In-Reply-To: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
Message-ID: <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>

> What else would be possible if the same sort of technique were applied to another base character?
Thinking about this further, could the technique be used to solve the requirements of
section 8 Longer Term Solutions
of
http://www.unicode.org/reports/tr51/tr51-2.html
?
Both colour pixel map and colour OpenType vector font solutions would be possible.
Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results.
William Overington
14 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/f5902ca2/attachment.html>

From shervinafshar at gmail.com  Thu May 14 16:26:32 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 14 May 2015 14:26:32 -0700
Subject: Tag characters
In-Reply-To: <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
 <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
Message-ID: <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>

>
> Thinking about this further, could the technique be used to solve the
> requirements of
> section 8 Longer Term Solutions


IMO, the industry preferred longer term solution (which is also discussed
in that section with few existing examples) for emoji, is not going to be
based on characters.


? Shervin

On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <
wjgo_10009 at btinternet.com> wrote:

> > What else would be possible if the same sort of technique were applied
> to another base character?
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
>
> section 8 Longer Term Solutions
>
> of
>
> http://www.unicode.org/reports/tr51/tr51-2.html
>
> ?
>
>
> Both colour pixel map and colour OpenType vector font solutions would be
> possible.
>
>
> Colour voxel map and colour vector 3d solids solutions are worth thinking
> about too as fun coding thought experiments that could possibly lead to
> useful practical results.
>
>
>
> William Overington
>
>
> 14 May 2015
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/663ce67b/attachment.html>

From doug at ewellic.org  Thu May 14 17:13:39 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 14 May 2015 15:13:39 -0700
Subject: Tag characters
Message-ID: <20150514151339.665a7a7059d7ee80bb4d670165c8327d.ce7108c845.wbe@email03.secureserver.net>

http://www.unicode.org/L2/L2015/15107.htm

points indirectly to:

http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf

which says:

> The proposal has two parts
>
> 1. Un-deprecate TAG characters E0020-E007E.

Hee hee.

Hee hee.

> 2. Define a character as the ?base? for a following sequence of
> TAG characters that specifies a region or subregion to be
> represented using a sequence of TAG characters. There are two
> possibilities for the base character:
>
> a. Preferred: Use the Unicode 7.0 character WAVING WHITE FLAG:
> 1F3F3?WAVING WHITE FLAG?So?0?ON?????N?????
> The advantage is no new characters need be encoded.

"Add language to UTR #51 describing the mechanism given in 2A" means
that U+1F3F3 will be the tag introducer, basically the "flag emoji"
equivalent of U+E0001 LANGUAGE TAG.

I think I understand why the TAG/CANCEL TAG start-end mechanism which
was invented for Plane 14 language tags wasn't reused for flag emoji.
Adding U+E0002 FLAG TAG would have implied that the sequence ends with
CANCEL TAG. Flags don't have scope and there is no need to indicate the
end of the sequence explicitly for scoping purposes, as there is with
tagged text.

I assume that existing text with U+1F3F3 followed by no tag characters
should continue to display the waving white flag glyph, whereas text
conforming to this new mechanism should suppress that glyph and show the
Scottish, Welsh, Delawarean, or Nordlending flag instead.

> Using the following notation -
> B designates the chosen base character (U+1F3F3 or new U+1F1E5)
> TL designates a TAG LATIN CAPITAL LETTER (A..Z)
> TD designates a TAG DIGIT (ZERO..NINE)
> TH designates TAG HYPHEN-MINUS
> 
> - a well-formed sequence for for designating flags for ISO 3166-1,
> 3166-2 or UN M49 codes would be
>
> B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))

Will the subdivision sequence always be exactly 3 characters long? CLDR
ticket #8423 seems to say that ISO 3166-2 code elements that are only 1
or 2 characters long will be prepended with "xx" or "x" to make them all
exactly 3. Obviously some research will need to be done to ensure this
doesn't result in conflicts with existing code elements, and of course
3166-2 makes no promises that future assignments will deliberately avoid
such a conflict.

Will both mechanisms, old and new, be available for encoding national
flags? For example, for a French flag:

<1F1EB 1F1F7>

or

<1F3F3 E0046 E0052>

> In CLDR 28, LDML will define a unicode_subdivision_subtag which also
> provides validity criteria for the codes used for regional
> subdivisions (see CLDR ticket #8423). When representing regional
> subdivisions using ISO 3166-2 codes, only those codes that are valid
> for the LDML unicode_subdivision_subtag should be used.

I note that a preliminary file is already available at
http://unicode.org/repos/cldr/trunk/common/supplemental/subdivisions.xml
.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Thu May 14 19:10:36 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 15 May 2015 02:10:36 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150514191324.1e455c57@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
Message-ID: <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>

2015-05-14 20:13 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Thu, 14 May 2015 12:58:29 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > An elegant formal solution to the Kleene star problem interprets
> > > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > > counter-intuitive
>
> The technical term for this is the 'concurrent iteration' - or at
> least, that's the term used in the 'Book of Traces'.
>
> > For your example "(\u0323\u0302)*" the characters in the alternatives
> > (COMBINING DOT BELOW and COMBINING ACUTE ACCENT), once converted to
> > NFD (which is the same here) are just using at most two distinct
> > non-zero combining classes and no blocker; so it is safe to trasform
> > it to (\u0323|\u0302)* for a first pass matching that will then only
> > check candidate matchs in the second pass. or more efficiently, a
> > second finite state automata (FSA) running in parallel with its own
> > state:
>
> You've forgotten the basic problem.  A *finite* state automaton cannot
> count very far; with only n states, it cannot count as far as n.
>

I did not forget it, this is why there's a second pass (or a second FSA
running in parallel to indicate its own accept state). You have to combine
the two states variables to get the final combined state to determine if it
is a final accept state.

But one of the two state variable has an upper bound which is not only
finite but very small (it has at most 255 possible values).

Typical regexp engines do not create the full deterministic automata with
all its states (it would require a lot of memory due to combinatorial
effects, they handle multiple state variables in parallel and use a
rendez-vous system to test them in order to determine if we have an accept
state, or a fail state (for which we must rollback).

So even if one of the state is not bound in terms of length, the other one
(exploring the possible lengths of reorderable of non-blocking combining
characters) is clearly finite (so you don't need to count very far.

So you just need a single additional byte of storage for storing the second
state variable in the global state of your FSA. The size of the global
state variable only depends on the number of alternatives in your regexp
and it is also bound (limited to the length of the source regexp: even if
if your regexp speicification string is 1000 characters long, you know that
you will never need more than 1000 bytes to represent it, but of course it
will not be a simple 32-bit integer: this structure can represent billions
of billions of billions of possible states without needing to transform the
FSA to a pure deterministic FSA with a single integer and without having to
build a single MxN transition matrix (with M columns for each possible
character class, and N rows for each each deterministic state, where each
cell contains the value of for next deterministic state: this will not work)

Even if your regexp is so complex that it requires a specification string
that is 100KB long, your global state variable will never be longer than
100KB.

But of course, this structure is a bit less easier to use when advancing:
you have to advance all active states in parallel using the current input
character with each transition submatrix (which is really small as well
with just a couple of elements that can fit in a small structure with fixed
size: an accept character, limited to 21 bits with Unicode, or a character
class index, and a few flags, 3 bits, for saying if this character is
advancable, or if the current state is an accept state or a failure state,
or to indicate the presence of an alternative and give an index to its own
branch by specifying only which elementary state variable is used for that
alternative within the structure of the global state variable).

In summary the global state varaible is just a small array of 32 bit
integers for the most complex regexps you will encounter (I don't think
that 100KB regexps are very common, almost of them are below 1KB, so your
global state variable will fit in 4KB of memory, and the transition matrix
will also fit in 4KB).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/a458cadf/attachment.html>

From verdy_p at wanadoo.fr  Thu May 14 19:38:17 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 15 May 2015 02:38:17 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150514191324.1e455c57@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
Message-ID: <CAGa7JC3SY7mWTdEc0CrtbUXFGfxpbc0cuYuG1PTQ30V8Y4Nfog@mail.gmail.com>

2015-05-14 20:13 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> If the interval list is compacted, at most one of the intervals will
> contain a character properly having combining class 0.

This is not a sufficent condition, there is also the case where two
intervals contain combining characters with the same combining class: their
relative order is significant because one is blocking the other (it limits
the alllowed reorderings that are canonically equivalent).

But if the replacement string also adds its own blockers the situation is
worse...
There's no simple way to determine what to do by just returning a
replacement string that the regexp engine will insert itself in the output
text: the base that can be done is that the regexp gives a full view not
only to the characters withjin matches, but also the characters in the
middle that are not part of the match: instead of performing the insertion
itself (by specifying a single expression for the replacement text), you
will provide a callback function analysing also the non-matched characters
in the middle to decide what to do with them: you should then be able to
choose between several replacement patterns (including placeholders also
for unmathed intervals such as numbered placeholders with negative values
$-1, $-2, ..., positive or null numbers being used for the classical array
of matched captures $0, $1... But for these additional captures that are
not part of the match, you need a way to indicate their placement within
the true matched captures, and not all positive captures share the same set
of negative captures and not at the same positions).

Note that for making sure we can perform safe replacements within
normalized text and makeing sure that the result will also be normalized,
we need to include in negative captures some characters that are not in the
middle of a match, but also all the other combining characters with
non-zero combining class that are before the matched string (if the matched
string does not start with a character with combining class 0) and after it
and that have a higher combining class than the last character in the
positive capture.; if the positive capure is an ampty string, the first
negative capture will include all combining characters with distinct non-0
combining class. before the insertion point of that empty positive capture,
and the second one will onclude all non-0 combining characters after thje
insertion point that have distinct non-0 combining classes (these two
negative captures are bounded in length to at most 255 characters, just
like with the negative captures added for parts of the input that are in
the middle of a positive capture).

For now I've never seen any regexp engine supporting the concept of
"negative captures", all of them only return positive ones, including when
they allow the replacement to be a callback and not just a static string
with optional placeholders.

If there is such an interval, it will be
> replaced and the others simply deleted.  If there is no such interval,
> then the choice of insertion point may be more difficult.  Indeed, in
> some cases, it could be appropriate to reject the replacement command
> as undefined in the context.  On the other hand, if the text buffer is
> normalised, then one would be able to have well-defined behaviour, as
> one does when splitting text into UCA collating elements.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/f8ca7284/attachment.html>

From petercon at microsoft.com  Thu May 14 21:44:21 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 15 May 2015 02:44:21 +0000
Subject: Tag characters
In-Reply-To: <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
 <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
 <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>
Message-ID: <BLUPR03MB1200466902DE4EBC4B64198D5C70@BLUPR03MB120.namprd03.prod.outlook.com>

And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shervin Afshar
Sent: Thursday, May 14, 2015 2:27 PM
To: wjgo_10009 at btinternet.com
Cc: unicode at unicode.org
Subject: Re: Tag characters

Thinking about this further, could the technique be used to solve the requirements of
section 8 Longer Term Solutions

IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters.


? Shervin

On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:
> What else would be possible if the same sort of technique were applied to another base character?


Thinking about this further, could the technique be used to solve the requirements of

section 8 Longer Term Solutions

of

http://www.unicode.org/reports/tr51/tr51-2.html

?


Both colour pixel map and colour OpenType vector font solutions would be possible.


Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results.


William Overington


14 May 2015

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/01ae66d5/attachment.html>

From shervinafshar at gmail.com  Thu May 14 22:11:37 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 14 May 2015 20:11:37 -0700
Subject: Future of Emoji? (was Re: Tag characters)
Message-ID: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>

Peter,

This very topic was discussed in last meeting of the subcommittee and my
impression is that there are plans to promote the use of embedded graphics
(aka stickers) either through expansions to section 8 of TR51 or through
some other means. It should also be noted that none of current members of
Unicode seem to have a sticker-based implementation (with the exception of
an experimental limited trial by Twitter[1]).

[1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/


? Shervin

On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com>
wrote:

>  And yet UTC devotes lots of effort (with an entire subcommittee) to
> encode more emoji as characters, but no effort toward any preferred longer
> term solution not based on characters.
>
>
>
>
>
> Peter
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin
> Afshar
> *Sent:* Thursday, May 14, 2015 2:27 PM
> *To:* wjgo_10009 at btinternet.com
> *Cc:* unicode at unicode.org
> *Subject:* Re: Tag characters
>
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
> section 8 Longer Term Solutions
>
>
>
> IMO, the industry preferred longer term solution (which is also discussed
> in that section with few existing examples) for emoji, is not going to be
> based on characters.
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <
> wjgo_10009 at btinternet.com> wrote:
>
> > What else would be possible if the same sort of technique were applied
> to another base character?
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
>
> section 8 Longer Term Solutions
>
> of
>
> http://www.unicode.org/reports/tr51/tr51-2.html
>
> ?
>
>
> Both colour pixel map and colour OpenType vector font solutions would be
> possible.
>
>
> Colour voxel map and colour vector 3d solids solutions are worth thinking
> about too as fun coding thought experiments that could possibly lead to
> useful practical results.
>
>
>
>
> William Overington
>
>
> 14 May 2015
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/87b36f61/attachment.html>

From petercon at microsoft.com  Thu May 14 23:37:37 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 15 May 2015 04:37:37 +0000
Subject: Future of Emoji? (was Re: Tag characters)
In-Reply-To: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
References: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
Message-ID: <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>

Skype uses stickers, including animated stickers. Here?s the documented set:

https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons

And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?.


Peter


From: Shervin Afshar [mailto:shervinafshar at gmail.com]
Sent: Thursday, May 14, 2015 8:12 PM
To: Peter Constable
Cc: unicode at unicode.org
Subject: Future of Emoji? (was Re: Tag characters)

Peter,

This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]).

[1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/


? Shervin

On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>] On Behalf Of Shervin Afshar
Sent: Thursday, May 14, 2015 2:27 PM
To: wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Tag characters

Thinking about this further, could the technique be used to solve the requirements of
section 8 Longer Term Solutions

IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters.


? Shervin

On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:
> What else would be possible if the same sort of technique were applied to another base character?


Thinking about this further, could the technique be used to solve the requirements of

section 8 Longer Term Solutions

of

http://www.unicode.org/reports/tr51/tr51-2.html

?


Both colour pixel map and colour OpenType vector font solutions would be possible.


Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results.


William Overington


14 May 2015


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/928d99bb/attachment.html>

From shervinafshar at gmail.com  Fri May 15 00:40:22 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 14 May 2015 22:40:22 -0700
Subject: Future of Emoji? (was Re: Tag characters)
In-Reply-To: <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
References: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
 <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <CA+ONOD=GeS63fUrKWrKny0t6HzLDd_Q3ubbHCtogCoMqCotEUQ@mail.gmail.com>

Good point. I missed these while looking into compatibility symbols. Of
course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are
mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ?
U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended
support for flags, most of smiley faces, objects, etc. on all platforms
would affect this approach.

My idea of a sticker-based solution is something more like Facebook's[3] or
Line's[4] implementations.

[1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf
[2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf
[3]:
http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html
[4]: https://creator.line.me/en/guideline/


? Shervin

On Thu, May 14, 2015 at 9:37 PM, Peter Constable <petercon at microsoft.com>
wrote:

>  Skype uses stickers, including animated stickers. Here?s the documented
> set:
>
>
>
> https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons
>
>
>
> And if you search, you?ll find lots more ?hidden? emoticons, like
> ?(bartlett)?.
>
>
>
>
>
>
>
> Peter
>
>
>
>
>
> *From:* Shervin Afshar [mailto:shervinafshar at gmail.com]
> *Sent:* Thursday, May 14, 2015 8:12 PM
> *To:* Peter Constable
> *Cc:* unicode at unicode.org
> *Subject:* Future of Emoji? (was Re: Tag characters)
>
>
>
> Peter,
>
>
>
> This very topic was discussed in last meeting of the subcommittee and my
> impression is that there are plans to promote the use of embedded graphics
> (aka stickers) either through expansions to section 8 of TR51 or through
> some other means. It should also be noted that none of current members of
> Unicode seem to have a sticker-based implementation (with the exception of
> an experimental limited trial by Twitter[1]).
>
>
>
> [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com>
> wrote:
>
>  And yet UTC devotes lots of effort (with an entire subcommittee) to
> encode more emoji as characters, but no effort toward any preferred longer
> term solution not based on characters.
>
>
>
>
>
> Peter
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin
> Afshar
> *Sent:* Thursday, May 14, 2015 2:27 PM
> *To:* wjgo_10009 at btinternet.com
> *Cc:* unicode at unicode.org
> *Subject:* Re: Tag characters
>
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
> section 8 Longer Term Solutions
>
>
>
> IMO, the industry preferred longer term solution (which is also discussed
> in that section with few existing examples) for emoji, is not going to be
> based on characters.
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <
> wjgo_10009 at btinternet.com> wrote:
>
> > What else would be possible if the same sort of technique were applied
> to another base character?
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
>
> section 8 Longer Term Solutions
>
> of
>
> http://www.unicode.org/reports/tr51/tr51-2.html
>
> ?
>
>
> Both colour pixel map and colour OpenType vector font solutions would be
> possible.
>
>
> Colour voxel map and colour vector 3d solids solutions are worth thinking
> about too as fun coding thought experiments that could possibly lead to
> useful practical results.
>
>
>
>
> William Overington
>
>
> 14 May 2015
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150514/c8001d47/attachment.html>

From richard.wordingham at ntlworld.com  Fri May 15 02:10:03 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 15 May 2015 08:10:03 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
Message-ID: <20150515081003.1984d0c4@JRWUBU2>

On Fri, 15 May 2015 02:10:36 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Thu, 14 May 2015 12:58:29 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> > >
> > > > An elegant formal solution to the Kleene star problem interprets
> > > > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > > > counter-intuitive
> >
> > The technical term for this is the 'concurrent iteration' - or at
> > least, that's the term used in the 'Book of Traces'.
> >
> > > For your example "(\u0323\u0302)*" the characters in the
> > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT),
> > > once converted to NFD (which is the same here) are just using at
> > > most two distinct non-zero combining classes and no blocker; so
> > > it is safe to trasform it to (\u0323|\u0302)* for a first pass
> > > matching that will then only check candidate matchs in the second
> > > pass. or more efficiently, a second finite state automata (FSA)
> > > running in parallel with its own state:
> >
> > You've forgotten the basic problem.  A *finite* state automaton
> > cannot count very far; with only n states, it cannot count as far
> > as n.
> >
> 
> I did not forget it, this is why there's a second pass (or a second
> FSA running in parallel to indicate its own accept state). You have
> to combine the two states variables to get the final combined state
> to determine if it is a final accept state.

Your description makes no sense to me as a description of a finite
state automaton.  Now, a program to check whether a trace matching
{\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
counts the number of times \u0323 occurs and the number of times
\u0302 occurs, and returns whether they are equal.  The two counters
are the key variables (and one could just keep the difference in the
counts).  However, this is not a finite state automaton.

Now, to some extent I am cheating by assuming that the characters are
delivered in NFD order.  If I did not do this, to construct the
non-deterministic finite automaton (NDFA) for the concatenation of two
sets / regular expressions, the triples (x, y, n) of (left NDFA state,
right NDFA state, highest non-zero ccc assigned to righthand component)
would need to be expanded.  The third component would become a list of
non-zero ccc's - in principle 2^254 values, but in fact rather fewer as
not all 255 ccc values are used by Unicode.  It is still finite.  I
prefer to keep the complexity out of the regular expression engine
proper.

Given a NDFA recognising a set of NFD strings, one can convert it to a
deterministic finite automaton (DFA), say X, provided one does not run
out of memory or time. One can then 'easily' construct a DFA Y
recognising the canonical equivalents of the strings.  The state in DFA
Y reached by string x is defined to be the state reached by the string
to_NFD(x) in DFA X.  This method relies on the identity
to_NFD(to_NFD(x)z) = to_NFD(xz). 

This handles the recognition of a string canonically equivalent to
\u0323\u0302. (The constructions above are sledge hammers; the NDFAs
have many unreachable states.) However, recognising canonical
equivalents of (\u0323\u0302)* via an FSM is rather more difficult; to
be precise, it cannot be done by an FSM.

Richard.

From abdo.alrhman.aiman at gmail.com  Fri May 15 09:18:47 2015
From: abdo.alrhman.aiman at gmail.com (=?UTF-8?B?2LnYqNivINin2YTYsdit2YXYp9mGINij2YrZhdmG?=)
Date: Fri, 15 May 2015 17:18:47 +0300
Subject: Arabic diacritics
Message-ID: <CAJxhVHjwZskeV+bT9mQ84mf1n_+89aY9bMqGkE9i1RvW_ncBSQ@mail.gmail.com>

hi,

regarding the Arabic diacritics. e.g. for the Shadda, we
have:

1. The form that people type:
http://unicode-table.com/en/0651/

2. An Isolated form. It should be the same, but looks different in the
Unicode table, which is confusing me now.
http://unicode-table.com/en/FE7C/

3. A medial form:
http://unicode-table.com/en/FE7D/

When do I use 1/2, and when do I use 3?

some diacritics has e.g. isolated and medial forms. Some have
only one of these forms, some have both. So, where does each of them go?

respectfully
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/5078d4b6/attachment.html>

From richard.wordingham at ntlworld.com  Fri May 15 10:45:03 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 15 May 2015 16:45:03 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC3SY7mWTdEc0CrtbUXFGfxpbc0cuYuG1PTQ30V8Y4Nfog@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC3SY7mWTdEc0CrtbUXFGfxpbc0cuYuG1PTQ30V8Y4Nfog@mail.gmail.com>
Message-ID: <20150515164503.2c8624f0@JRWUBU2>

On Fri, 15 May 2015 02:38:17 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > If the interval list is compacted, at most one of the intervals will
> > contain a character properly having combining class 0.
> 
> This is not a sufficent condition, there is also the case where two
> intervals contain combining characters with the same combining class:
> their relative order is significant because one is blocking the other
> (it limits the alllowed reorderings that are canonically equivalent).

If two fully decomposed characters of combining class 0 are included in
the match to a subexpression, all the characters between them will be
included.

The needs you perceive would be met by providing the start
and end points of the locations of the non-starters flanking the
matching string on the sides where it starts with a non-starter or ends
with a character with non-zero rccc.  (U+00E2 would probably have to
count as a non-starter for your purposes.)  However, I'm not sure that
passing the positions would not suffice.

Don't forget that the input string can be rearranged, preserving
canonical equivalence, so that the captured string is actually
contiguous.

I think this discussion on search and replace would benefit from some
examples.  I don?t see your problem.  Is it based on experience?  I have
some fairly simple examples.

My first example is the replacement of ? <U+006F LATIN SMALL LETTER O
WITH CIRCUMFLEX> by U+00E2 <LATIN SMALL LETTER A WITH CIRCUMFLEX> in
the 4-character string bu?c <U+0062, U+0075, U+1ED9, U+0063>.  U+1ED9
has the full decomposition <U+006F, U+0323, U+0302>.  The substring ?
has the discontiguous position, in inclusive:exclusive notation:

Component 1 at Position 2:Component 2 at Position 2 (content U+006F)
Component 3 at Position 2:Whole at Position 3 (content U+0302)

Now, the regular expression syntax for an identified substring suggests
that it is contiguous.  For substitution, it therefore makes most sense
to view the whole string as though it were the canonically equivalent
<U+0062, U+0075, U+006F,  U+0302, U+0323, U+0063>, a form in which the
identified substring is contiguous.  Replacement should therefore
create something canonically equivalent to   <U+0062, U+0075, U+00E2,
U+0323, U+0063>.

In terms of program logic, I would expect the string editing to proceed
something like this:

1. Decompose characters that straddle range boundaries, so:

  a.  String becomes <U+0062, U+0075, U+006F, U+0323, U+0302, U+0063>

  b.  Identified substring location updates to:

      i.  Whole at Position 2: Whole at Position 3 (content U+006F)

      ii. Whole at Position 4: Whole at Position 5 (content U+0302)

2. First portion contains a character with canonical combining class 0,
so replace it by replacement string.

3. Delete other portions.

4. Apply any normalisation requirements.

For my second example, let the replacement string be <U+006F, U+031D
COMBINING UP TACK BELOW> instead.  I would expect the same logic to
apply, yielding a substring <U+006F, U+031D, U+0323>, and would not be
concerned by its not being canonically equivalent to <U+006F, U+0323,
U+031D>.

For my third example, consider the replacement of U+0302 by U+031B
COMBINING HORN in the 6-character string buo??c <U+0062, U+0075,
U+006F, U+0302, U+0323, U+0063>.  The character is at location Whole at
Position 4:Whole at Position 5.  The identified substring does not
contain any characters of canonical combining class 0.

U+031B has ccc=216 and U+0323 has ccc=220, so it matters little how the
characters between U+006F and U+0063 are arranged -  the results are
canonically equivalent and the substitution should be made without
complaint.

For my fourth example, consider again the replacement of U+0302 in the
6-character string buo??c <U+0062, U+0075, U+006F, U+0302, U+0323,
U+0063>, but this time by U+0068 LATIN SMALL LETTER H.

We now have a problem.  Applying the substitution at this location
yields the string buoh?c (dot below the ?h?), while applying the
substitution to the string in NFD form  yields buo?hc (dot below the
?o?), which is visually distinct.  In some ways this is similar to the
problem of grouping text into collating elements for collation.   The
Unicode Collation Algorithm resolves conflicts on the basis of the NFD
form.  Requiring the string to be in strict NFD might not be suitable ?
it breaks compatibility ideographs.  Also, I can imagine wanting  to
make global substitutions so as to undo ill effects of normalisation.
There are many different ways to handle the problem, and I can imagine
a rich selection of flags for a substitution routine.  I would urge,
however, that the replacement text should be contiguous in some
canonical equivalent of the resulting string.

Richard.


From petercon at microsoft.com  Fri May 15 10:46:39 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 15 May 2015 15:46:39 +0000
Subject: Future of Emoji? (was Re: Tag characters)
In-Reply-To: <CA+ONOD=GeS63fUrKWrKny0t6HzLDd_Q3ubbHCtogCoMqCotEUQ@mail.gmail.com>
References: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
 <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <CA+ONOD=GeS63fUrKWrKny0t6HzLDd_Q3ubbHCtogCoMqCotEUQ@mail.gmail.com>
Message-ID: <BLUPR03MB120B5D338591C59B3016245D5C70@BLUPR03MB120.namprd03.prod.outlook.com>

MSN Messenger supported extensible stickers years ago. A couple of sites still offering add-ons:

http://www.getsmile.com/
http://www.smileys4msn.com/


Peter

From: Shervin Afshar [mailto:shervinafshar at gmail.com]
Sent: Thursday, May 14, 2015 10:40 PM
To: Peter Constable
Cc: unicode at unicode.org
Subject: Re: Future of Emoji? (was Re: Tag characters)

Good point. I missed these while looking into compatibility symbols. Of course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended support for flags, most of smiley faces, objects, etc. on all platforms would affect this approach.

My idea of a sticker-based solution is something more like Facebook's[3] or Line's[4] implementations.

[1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf
[2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf
[3]: http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html
[4]: https://creator.line.me/en/guideline/


? Shervin

On Thu, May 14, 2015 at 9:37 PM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
Skype uses stickers, including animated stickers. Here?s the documented set:

https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons

And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?.


Peter


From: Shervin Afshar [mailto:shervinafshar at gmail.com<mailto:shervinafshar at gmail.com>]
Sent: Thursday, May 14, 2015 8:12 PM
To: Peter Constable
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Future of Emoji? (was Re: Tag characters)

Peter,

This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]).

[1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/


? Shervin

On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>] On Behalf Of Shervin Afshar
Sent: Thursday, May 14, 2015 2:27 PM
To: wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Tag characters

Thinking about this further, could the technique be used to solve the requirements of
section 8 Longer Term Solutions

IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters.


? Shervin

On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:
> What else would be possible if the same sort of technique were applied to another base character?


Thinking about this further, could the technique be used to solve the requirements of

section 8 Longer Term Solutions

of

http://www.unicode.org/reports/tr51/tr51-2.html

?


Both colour pixel map and colour OpenType vector font solutions would be possible.


Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results.


William Overington


14 May 2015


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/f9abafda/attachment.html>

From petercon at microsoft.com  Fri May 15 10:57:56 2015
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 15 May 2015 15:57:56 +0000
Subject: Future of Emoji? (was Re: Tag characters)
In-Reply-To: <BLUPR03MB120B5D338591C59B3016245D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
References: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
 <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <CA+ONOD=GeS63fUrKWrKny0t6HzLDd_Q3ubbHCtogCoMqCotEUQ@mail.gmail.com>
 <BLUPR03MB120B5D338591C59B3016245D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <BLUPR03MB120BC905835D2896AA0ADCBD5C70@BLUPR03MB120.namprd03.prod.outlook.com>

Ah,yes. And Messenger ?winks?. E.g.,

http://www.msn-tools.net/free-msn-winks-1.htm

I note that this has .swf files, and that?s what we saw one of the Japanese carriers saying they?d be moving to instead of PUA characters.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter Constable
Sent: Friday, May 15, 2015 8:47 AM
To: Shervin Afshar
Cc: unicode at unicode.org
Subject: RE: Future of Emoji? (was Re: Tag characters)

MSN Messenger supported extensible stickers years ago. A couple of sites still offering add-ons:

http://www.getsmile.com/
http://www.smileys4msn.com/


Peter

From: Shervin Afshar [mailto:shervinafshar at gmail.com]
Sent: Thursday, May 14, 2015 10:40 PM
To: Peter Constable
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Future of Emoji? (was Re: Tag characters)

Good point. I missed these while looking into compatibility symbols. Of course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ? U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended support for flags, most of smiley faces, objects, etc. on all platforms would affect this approach.

My idea of a sticker-based solution is something more like Facebook's[3] or Line's[4] implementations.

[1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf
[2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf
[3]: http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html
[4]: https://creator.line.me/en/guideline/


? Shervin

On Thu, May 14, 2015 at 9:37 PM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
Skype uses stickers, including animated stickers. Here?s the documented set:

https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons

And if you search, you?ll find lots more ?hidden? emoticons, like ?(bartlett)?.


Peter


From: Shervin Afshar [mailto:shervinafshar at gmail.com<mailto:shervinafshar at gmail.com>]
Sent: Thursday, May 14, 2015 8:12 PM
To: Peter Constable
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Future of Emoji? (was Re: Tag characters)

Peter,

This very topic was discussed in last meeting of the subcommittee and my impression is that there are plans to promote the use of embedded graphics (aka stickers) either through expansions to section 8 of TR51 or through some other means. It should also be noted that none of current members of Unicode seem to have a sticker-based implementation (with the exception of an experimental limited trial by Twitter[1]).

[1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/


? Shervin

On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters.


Peter

From: Unicode [mailto:unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>] On Behalf Of Shervin Afshar
Sent: Thursday, May 14, 2015 2:27 PM
To: wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>
Cc: unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Tag characters

Thinking about this further, could the technique be used to solve the requirements of
section 8 Longer Term Solutions

IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters.


? Shervin

On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <wjgo_10009 at btinternet.com<mailto:wjgo_10009 at btinternet.com>> wrote:
> What else would be possible if the same sort of technique were applied to another base character?


Thinking about this further, could the technique be used to solve the requirements of

section 8 Longer Term Solutions

of

http://www.unicode.org/reports/tr51/tr51-2.html

?


Both colour pixel map and colour OpenType vector font solutions would be possible.


Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results.


William Overington


14 May 2015


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/14947d9c/attachment-0001.html>

From moyogo at gmail.com  Fri May 15 11:09:29 2015
From: moyogo at gmail.com (Denis Jacquerye)
Date: Fri, 15 May 2015 16:09:29 +0000
Subject: Arabic diacritics
In-Reply-To: <CAJxhVHjwZskeV+bT9mQ84mf1n_+89aY9bMqGkE9i1RvW_ncBSQ@mail.gmail.com>
References: <CAJxhVHjwZskeV+bT9mQ84mf1n_+89aY9bMqGkE9i1RvW_ncBSQ@mail.gmail.com>
Message-ID: <CAJKta0xwmsii-evPMj7S91UNRhg38wbJ8UYy=ff4EaRPZsUTVw@mail.gmail.com>

You should use ARABIC SHADDA U+0651 in all positions. The presentation
forms (isolated, medial, final forms) are for compatibility with legacy
systems.
See what is said in http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf
about the Arabic Presentation Forms-B.

Cheers,


On Fri, 15 May 2015 at 15:53 ??? ??????? ???? <abdo.alrhman.aiman at gmail.com>
wrote:

> hi,
>
> regarding the Arabic diacritics. e.g. for the Shadda, we
> have:
>
> 1. The form that people type:
> http://unicode-table.com/en/0651/
>
> 2. An Isolated form. It should be the same, but looks different in the
> Unicode table, which is confusing me now.
> http://unicode-table.com/en/FE7C/
>
> 3. A medial form:
> http://unicode-table.com/en/FE7D/
>
> When do I use 1/2, and when do I use 3?
>
> some diacritics has e.g. isolated and medial forms. Some have
> only one of these forms, some have both. So, where does each of them go?
>
> respectfully
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/75831669/attachment.html>

From shervinafshar at gmail.com  Fri May 15 12:57:46 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Fri, 15 May 2015 10:57:46 -0700
Subject: Future of Emoji? (was Re: Tag characters)
In-Reply-To: <BLUPR03MB120BC905835D2896AA0ADCBD5C70@BLUPR03MB120.namprd03.prod.outlook.com>
References: <CA+ONODkH6zmnyh61O9zZn5qRt=br=sfSkGCx1gAq2w0sF6yfRg@mail.gmail.com>
 <BLUPR03MB1204BFF73341716CC57B1C7D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <CA+ONOD=GeS63fUrKWrKny0t6HzLDd_Q3ubbHCtogCoMqCotEUQ@mail.gmail.com>
 <BLUPR03MB120B5D338591C59B3016245D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <BLUPR03MB120BC905835D2896AA0ADCBD5C70@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <CA+ONODnK7F7A_1mL5toAjuqyzSfrnYX+e4f_+hROEpMJ7GN+qA@mail.gmail.com>

These are all great pointers which we might want to look into more closely
for expanding the longer term solution section in TR51 or any other
document encouraging folks to use stickers. May be Microsoft people who are
attending emoji SC can provide some insight on these issues, too.

I think I still prefer the current situation compared to Japanese carriers
having to go with .SWF!

? Shervin

On Fri, May 15, 2015 at 8:57 AM, Peter Constable <petercon at microsoft.com>
wrote:

>  Ah,yes. And Messenger ?winks?. E.g.,
>
>
>
> http://www.msn-tools.net/free-msn-winks-1.htm
>
>
>
> I note that this has .swf files, and that?s what we saw one of the
> Japanese carriers saying they?d be moving to instead of PUA characters.
>
>
>
>
>
> Peter
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Peter
> Constable
> *Sent:* Friday, May 15, 2015 8:47 AM
> *To:* Shervin Afshar
> *Cc:* unicode at unicode.org
> *Subject:* RE: Future of Emoji? (was Re: Tag characters)
>
>
>
> MSN Messenger supported extensible stickers years ago. A couple of sites
> still offering add-ons:
>
>
>
> http://www.getsmile.com/
>
> http://www.smileys4msn.com/
>
>
>
>
>
> Peter
>
>
>
> *From:* Shervin Afshar [mailto:shervinafshar at gmail.com
> <shervinafshar at gmail.com>]
> *Sent:* Thursday, May 14, 2015 10:40 PM
> *To:* Peter Constable
> *Cc:* unicode at unicode.org
> *Subject:* Re: Future of Emoji? (was Re: Tag characters)
>
>
>
> Good point. I missed these while looking into compatibility symbols. Of
> course, as with Yahoo[1] and MSN[2] Messenger emoji sets, most of these are
> mappable to current or proposed sets of Unicode emoji (e.g. Lips Sealed ?
> U+1F910 ZIPPER-MOUTH FACE). It would be interesting to see how the extended
> support for flags, most of smiley faces, objects, etc. on all platforms
> would affect this approach.
>
>
>
> My idea of a sticker-based solution is something more like Facebook's[3]
> or Line's[4] implementations.
>
>
>
> [1]: http://www.unicode.org/L2/L2015/15059-emoji-im-yahoo.pdf
>
> [2]: http://www.unicode.org/L2/L2015/15058-emoji-im-msn.pdf
>
> [3]:
> http://www.huffingtonpost.com/2014/10/14/facebook-stickers-comments_n_5982546.html
>
> [4]: https://creator.line.me/en/guideline/
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 9:37 PM, Peter Constable <petercon at microsoft.com>
> wrote:
>
>  Skype uses stickers, including animated stickers. Here?s the documented
> set:
>
>
>
> https://support.skype.com/en/faq/FA12330/what-is-the-full-list-of-emoticons
>
>
>
> And if you search, you?ll find lots more ?hidden? emoticons, like
> ?(bartlett)?.
>
>
>
>
>
>
>
> Peter
>
>
>
>
>
> *From:* Shervin Afshar [mailto:shervinafshar at gmail.com]
> *Sent:* Thursday, May 14, 2015 8:12 PM
> *To:* Peter Constable
> *Cc:* unicode at unicode.org
> *Subject:* Future of Emoji? (was Re: Tag characters)
>
>
>
> Peter,
>
>
>
> This very topic was discussed in last meeting of the subcommittee and my
> impression is that there are plans to promote the use of embedded graphics
> (aka stickers) either through expansions to section 8 of TR51 or through
> some other means. It should also be noted that none of current members of
> Unicode seem to have a sticker-based implementation (with the exception of
> an experimental limited trial by Twitter[1]).
>
>
>
> [1]: http://mashable.com/2015/04/16/twitter-star-wars-emoji/
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com>
> wrote:
>
>  And yet UTC devotes lots of effort (with an entire subcommittee) to
> encode more emoji as characters, but no effort toward any preferred longer
> term solution not based on characters.
>
>
>
>
>
> Peter
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin
> Afshar
> *Sent:* Thursday, May 14, 2015 2:27 PM
> *To:* wjgo_10009 at btinternet.com
> *Cc:* unicode at unicode.org
> *Subject:* Re: Tag characters
>
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
> section 8 Longer Term Solutions
>
>
>
> IMO, the industry preferred longer term solution (which is also discussed
> in that section with few existing examples) for emoji, is not going to be
> based on characters.
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <
> wjgo_10009 at btinternet.com> wrote:
>
> > What else would be possible if the same sort of technique were applied
> to another base character?
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
>
> section 8 Longer Term Solutions
>
> of
>
> http://www.unicode.org/reports/tr51/tr51-2.html
>
> ?
>
>
> Both colour pixel map and colour OpenType vector font solutions would be
> possible.
>
>
> Colour voxel map and colour vector 3d solids solutions are worth thinking
> about too as fun coding thought experiments that could possibly lead to
> useful practical results.
>
>
>
>
> William Overington
>
>
> 14 May 2015
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/532e55b9/attachment.html>

From verdy_p at wanadoo.fr  Fri May 15 15:09:13 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 15 May 2015 22:09:13 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150515081003.1984d0c4@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
Message-ID: <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>

2015-05-15 9:10 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Fri, 15 May 2015 02:10:36 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > On Thu, 14 May 2015 12:58:29 +0200
> > > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> > >
> > > > 2015-05-14 9:59 GMT+02:00 Richard Wordingham <
> > > > richard.wordingham at ntlworld.com>:
> > > >
> > > > > An elegant formal solution to the Kleene star problem interprets
> > > > > (\u0323\u0302)* as (\u0323|\u0302)*.  However, that is
> > > > > counter-intuitive
> > >
> > > The technical term for this is the 'concurrent iteration' - or at
> > > least, that's the term used in the 'Book of Traces'.
> > >
> > > > For your example "(\u0323\u0302)*" the characters in the
> > > > alternatives (COMBINING DOT BELOW and COMBINING ACUTE ACCENT),
> > > > once converted to NFD (which is the same here) are just using at
> > > > most two distinct non-zero combining classes and no blocker; so
> > > > it is safe to trasform it to (\u0323|\u0302)* for a first pass
> > > > matching that will then only check candidate matchs in the second
> > > > pass. or more efficiently, a second finite state automata (FSA)
> > > > running in parallel with its own state:
> > >
> > > You've forgotten the basic problem.  A *finite* state automaton
> > > cannot count very far; with only n states, it cannot count as far
> > > as n.
> > >
> >
> > I did not forget it, this is why there's a second pass (or a second
> > FSA running in parallel to indicate its own accept state). You have
> > to combine the two states variables to get the final combined state
> > to determine if it is a final accept state.
>
> Your description makes no sense to me as a description of a finite
> state automaton.


This is because you don't understand the issue !


> Now, a program to check whether a trace matching
> {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
> counts the number of times \u0323 occurs and the number of times
> \u0302 occurs, and returns whether they are equal.


This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would pass
your counting test (which does not work in a FSA) but they are NOT
canonically equivalent because the identical combining characters are
blocking each other (so arbitrary ordering is not possible).

I maintain what I said: you don't need arbitrary counting and a FSA is
possible (both NFA, using compound states, and the derived DFA if ever you
want to resolve compound states to a single integer, but assume the fact
the the transition tables will explode dramatically)

Once again we cannot have pairs of strings where you cannot isolate BOUNDED
substrings (between blockers) where you can check their canonically
equivalence. At most you'll have only 255 combining characters to check
that have distinct non-zero combining classes.

So the FSA implementation is perfectly possible, for canonical equivalences
only... This evidently does not work if you are performing regexp searches
using looser equivalences, such as compatibility equivalence.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/06a625b0/attachment.html>

From verdy_p at wanadoo.fr  Fri May 15 15:21:56 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 15 May 2015 22:21:56 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150515164503.2c8624f0@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC3SY7mWTdEc0CrtbUXFGfxpbc0cuYuG1PTQ30V8Y4Nfog@mail.gmail.com>
 <20150515164503.2c8624f0@JRWUBU2>
Message-ID: <CAGa7JC2MLuPbtbSDOOvT4Vxu1rg==V-uF4zQdEUSB=4quzz-cg@mail.gmail.com>

2015-05-15 17:45 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I think this discussion on search and replace would benefit from some
> examples.  I don?t see your problem.  Is it based on experience?  I have
> some fairly simple examples.
>

Just consider a regexp that attempts to search and subtitute "?" (for
example by "?") and that has to locate it where it is in NFC form (single
character) or NFD form (combining sequence). You'll also have to match
cases where there are other intermediate combining characters (with a
distinct non-zero combining class, different from the combining class of
the acute accent) between the base letter and the acute accent.

You have then to return discontiguous matches, but your replacement string
"?" should still preserve the other combining characters.

The situation is even worse if you are looking for strings in which you
want to discard only some combining characters (the replacement is empty):
there may be several discontiguities in the matches. Now imagine that the
replacement string is to replace all these distinct combining characters by
a single one (such things would be done for filters that want to eliminate
some combining characters not suitable for a given language, or because
there's a linguistic orthographic rule that permits these substitutions of
foreign combining characters, e.g. : drop combining dots above, replace all
combining characters below, except the cedilla by a single one such as a
low line. Such thing would also happen for languages that have
changed/simplified their orthography about combiing characters, or that use
two distinct orthographic conventions and you want to convert between them)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/4a35d82a/attachment.html>

From richard.wordingham at ntlworld.com  Fri May 15 16:57:03 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 15 May 2015 22:57:03 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
Message-ID: <20150515225703.20771426@JRWUBU2>

On Fri, 15 May 2015 22:09:13 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

> This is because you don't understand the issue !

> > Now, a program to check whether a trace matching
> > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
> > counts the number of times \u0323 occurs and the number of times
> > \u0302 occurs, and returns whether they are equal.
 
> This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> pass your counting test (which does not work in a FSA) but they are
> NOT canonically equivalent because the identical combining characters
> are blocking each other (so arbitrary ordering is not possible).

TUS7.0: D108   Reorderable pair:
 Two adjacent characters A and B in a coded character sequence
 <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0. 

Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is
a reorderable pair.

TUS7.0: D109   Canonical Ordering Algorithm:
 In a decomposed character sequence D, exchange the positions of the
 characters in each Reorderable Pair until the sequence contains no
 more Reorderable Pairs.

The normalisation process on <U+0323, U+0302, U+0323, U+0302> first
replaces it by <U+0323, U+0323, U+0302, U+0302>.  There are then no
more reorderable pairs, so that has reduced it to form NFD.

Therefore <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> *are* canonically equivalent.

> So the FSA implementation is perfectly possible, for canonical
> equivalences only...

I now vaguely follow your argument, but it depends on the erroneous
claim that <U+0323, U+0323, U+0302, U+0302> and <U+0323, U+0302, U+0323,
U+0302> are not canonically equivalent.

> This evidently does not work if you are
> performing regexp searches using looser equivalences, such as
> compatibility equivalence.

I completely fail to understand this remark; it makes no difference
whether one uses canonical or compatibility equivalence.

Richard.

From verdy_p at wanadoo.fr  Fri May 15 17:31:53 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 16 May 2015 00:31:53 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150515225703.20771426@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
Message-ID: <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>

2015-05-15 23:57 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Fri, 15 May 2015 22:09:13 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
>
> > This is because you don't understand the issue !
>
> > > Now, a program to check whether a trace matching
> > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It just
> > > counts the number of times \u0323 occurs and the number of times
> > > \u0302 occurs, and returns whether they are equal.
>
> > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > pass your counting test (which does not work in a FSA) but they are
> > NOT canonically equivalent because the identical combining characters
> > are blocking each other (so arbitrary ordering is not possible).
>
> TUS7.0: D108   Reorderable pair:
>  Two adjacent characters A and B in a coded character sequence
>  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
>
> Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303) is
> a reorderable pair.
>

I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
U+0323 blocks another occurence of U+0323 because it has the **same**
combining class.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/1790bac9/attachment.html>

From richard.wordingham at ntlworld.com  Fri May 15 17:54:22 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Fri, 15 May 2015 23:54:22 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
Message-ID: <20150515235422.3e347dc3@JRWUBU2>

On Sat, 16 May 2015 00:31:53 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Fri, 15 May 2015 22:09:13 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> >
> > > This is because you don't understand the issue !
> >
> > > > Now, a program to check whether a trace matching
> > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
> > > > just counts the number of times \u0323 occurs and the number of
> > > > times \u0302 occurs, and returns whether they are equal.
> >
> > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > > pass your counting test (which does not work in a FSA) but they
> > > are NOT canonically equivalent because the identical combining
> > > characters are blocking each other (so arbitrary ordering is not
> > > possible).
> >
> > TUS7.0: D108   Reorderable pair:
> >  Two adjacent characters A and B in a coded character sequence
> >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
> >
> > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
> > is a reorderable pair.
> >
> 
> I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
> U+0323 blocks another occurence of U+0323 because it has the **same**
> combining class.

How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
U+0302, U+0323, U+0302> being canonically equivalent?

TUS7.0: D109   'Canonical Ordering Algorithm' says:
"In a decomposed character sequence D, exchange the positions of the
characters in each Reorderable Pair until the sequence contains no more
Reorderable Pairs."

There is no mention of blocking in D109.

Richard.

From verdy_p at wanadoo.fr  Fri May 15 19:04:55 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 16 May 2015 02:04:55 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150515235422.3e347dc3@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
Message-ID: <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>

But do you agree that we still need to match pairs of distinct characters
in your example ?
If you just count the otal it will be wrong with (\u0302\u0302\0323)* if
you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to
(\u0302|\0323)*, because what you want is not matching pairs but triples
(your counter check would have now to make sure there are two times more
occurences of \u0302 and occurences of \u0323.
If not, you need to rollback (by one or two characters, possibly more,
until you satisfy the condition, but you won't know by just seeing the
characters and advancing that your sequence is terminated: it is only at
end that you have to do this check and only then you can rollback :

The analysis cannot be deterministic, or it requires keeping a track of all
acceptable positions previously seen that could satisfy the condition; as
the sequence for (\u0302\u0302\0323)* can be extremely long, keeping this
track for possible rollbacks coudl be costly. For example consider this
regexp:

(\u0302\u0302\0323)*(\u0302\0303)*\u0302

Can you still transform it and correctly infer the type of counters you
need for the final check (before rollbacks) if you replace it with:

(\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to
(\u0302|\0303|\0323)*\u03202.

You'd need to check that there are exactly
- (2n+1) occurences of \0302
- (n) occurences of \0303
- (n) occurences of \0323

But it won't be enough because \0302 and \0303 have the same combining
class and cannot be reordered. So within the first regexp:

(\u0302\u0302\0323)*(\u0302\0303)*\u0302

the first iterated subregexp will need to scan first within the part that
is to match in the second iterated subregexp, where you cannot predict
where it will stop. It may even scan completely through it (if you have not
encountered any \0303) and eaten the last \u0302. At this time, you may see
that the first iterated subregexp cannot contain any \u0303 so the first
rollback to do will be to rollback just before the 1st occurence of \0303.

But the counter check may still be wrong and you'll have to rollback
through one or two occurences of \u0302 in order to find the location where
the first iterated subregexp is satisfied. At this point the one ot two
occurences of \u0302 that you've rolled back will be counted as being part
of the 2nd iterated regexp, or even be the final occurence looked to match
the end of the regexp.

I don't see how you can support this regexp with a DFA, you absolutely need
an NFA (and the counters you want to add do not offer any decisive help).


2015-05-16 0:54 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 16 May 2015 00:31:53 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > On Fri, 15 May 2015 22:09:13 +0200
> > > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> > >
> > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
> > > > richard.wordingham at ntlworld.com>:
> > >
> > > > This is because you don't understand the issue !
> > >
> > > > > Now, a program to check whether a trace matching
> > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
> > > > > just counts the number of times \u0323 occurs and the number of
> > > > > times \u0302 occurs, and returns whether they are equal.
> > >
> > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
> > > > pass your counting test (which does not work in a FSA) but they
> > > > are NOT canonically equivalent because the identical combining
> > > > characters are blocking each other (so arbitrary ordering is not
> > > > possible).
> > >
> > > TUS7.0: D108   Reorderable pair:
> > >  Two adjacent characters A and B in a coded character sequence
> > >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
> > >
> > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
> > > is a reorderable pair.
> > >
> >
> > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
> > U+0323 blocks another occurence of U+0323 because it has the **same**
> > combining class.
>
> How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
> U+0302, U+0323, U+0302> being canonically equivalent?
>
> TUS7.0: D109   'Canonical Ordering Algorithm' says:
> "In a decomposed character sequence D, exchange the positions of the
> characters in each Reorderable Pair until the sequence contains no more
> Reorderable Pairs."
>
> There is no mention of blocking in D109.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/9917c702/attachment.html>

From mark at macchiato.com  Fri May 15 19:18:56 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 15 May 2015 17:18:56 -0700
Subject: Tag characters
In-Reply-To: <BLUPR03MB1200466902DE4EBC4B64198D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
 <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
 <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>
 <BLUPR03MB1200466902DE4EBC4B64198D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <CAJ2xs_GTOPjam4B=V0rof+oJC=5nqjkqAgzBswWTm8iKrJNriA@mail.gmail.com>

The consortium is in no position to enhance protocols *itself* for
exchanging images. That's firmly in other groups' hands. We can try to
noodge them a bit, but what *will* make a difference is when the *vendors*
of sticker solutions put pressure on the different groups responsible for
the protocols to provide interoperability for images. Because there is a
lot of growth in sticker solutions, I would expect there to be more such
pressure. And even so, I expect it will take those some time to be deployed.

We've said what our longer-term position is, and I think we all pretty much
agree with that; exchanging images is much more flexible. However, we do
have strong short-term pressure to show that we are responsive and
responsible in adding emoji. And our adding a reasonable number of emoji
per year is not going to stop Line or Skype from adding stickers!

There are a few possible scenarios, and it's hard to predict the results.
It could be that emoji are largely supplanted by stickers in 5 years; could
be 10; could be that they both coexist indefinitely. I have no ??, and
neither does anyone else...


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Thu, May 14, 2015 at 7:44 PM, Peter Constable <petercon at microsoft.com>
wrote:

>  And yet UTC devotes lots of effort (with an entire subcommittee) to
> encode more emoji as characters, but no effort toward any preferred longer
> term solution not based on characters.
>
>
>
>
>
> Peter
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Shervin
> Afshar
> *Sent:* Thursday, May 14, 2015 2:27 PM
> *To:* wjgo_10009 at btinternet.com
> *Cc:* unicode at unicode.org
> *Subject:* Re: Tag characters
>
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
> section 8 Longer Term Solutions
>
>
>
> IMO, the industry preferred longer term solution (which is also discussed
> in that section with few existing examples) for emoji, is not going to be
> based on characters.
>
>
>
>
>   ? Shervin
>
>
>
> On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington <
> wjgo_10009 at btinternet.com> wrote:
>
> > What else would be possible if the same sort of technique were applied
> to another base character?
>
>
> Thinking about this further, could the technique be used to solve the
> requirements of
>
> section 8 Longer Term Solutions
>
> of
>
> http://www.unicode.org/reports/tr51/tr51-2.html
>
> ?
>
>
> Both colour pixel map and colour OpenType vector font solutions would be
> possible.
>
>
> Colour voxel map and colour vector 3d solids solutions are worth thinking
> about too as fun coding thought experiments that could possibly lead to
> useful practical results.
>
>
>
>
> William Overington
>
>
> 14 May 2015
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/db7b724b/attachment.html>

From verdy_p at wanadoo.fr  Fri May 15 19:41:33 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 16 May 2015 02:41:33 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
Message-ID: <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>

With a NFA, the representation is completely different, The regexp

(\u0302\u0302\0323)*(\u0302\0303)*\u0302

is just transformed into:

(?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\
0303|?\0303?\u0302)*?\u0302?

where I noted with the "tack" the 15 relative positions **in this new
regex** where there's a need to check if the input matches a character or
character class.

Note that in this transform, all allowed permutations of canonically
equivalent substrings are added; given that these substrings are bounded in
length in the initial regexp and there's a limited number of permutations
the result is still bounded.

The state of the NFA is represented as a set of these positions (here a
bitset with 15 bits). The initial state has only the first bit set to true,
the final accept state must just have the 15th bit set to true.

When you scan the input, you have to test the inpout character for each
position in the bitset that is currently true, if the associated character
or character class matches the input, and then advance this bit.

For that you use a second separate bitset initially empty (all 15 bits set
to false), and to advance bit n in the state, you will set bit (n+1) to
true; but to advance from bit 15, you don't set any in the second bitset.


You may also want to avoid generating these permutations:

(?\u0302?\u0302?\0323)*(??\u0302?\0303)*??\u0302?

Here I noted with the "combining glottal stop" the positions where you have
to count the characters ONLY in the subsequence, i.e. my "tacks" that are
just after the asterisks.


However in both cases (either the generated permutations, or using
counters) you'll need to use backtracing for rollbacks. Performing a
rollback in an NFA is not easy! you have to remember the bitsets
representing the state of the NFA before you advanced it to the next
state... If you did not want to generate the permutations but only use
counters, the backtracing to keep must also contain these counters (I have
no idea how to safely rollback those counters, my opinion is that it will
not work, and generating the permutations, even if this increases the
number of "tack" positions in the transformed regexp, is MUCH simpler, and
does not really generate a significant cost in term of memory).

But it's true that the allowed reorderings implied by canonical
equivalences (and those that are NOT allowed because they are blocked) are
really challenging !


2015-05-16 2:04 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> But do you agree that we still need to match pairs of distinct characters
> in your example ?
> If you just count the otal it will be wrong with (\u0302\u0302\0323)* if
> you transform it into (\u0302|\u0302|\0323)* which is fully equivalent to
> (\u0302|\0323)*, because what you want is not matching pairs but triples
> (your counter check would have now to make sure there are two times more
> occurences of \u0302 and occurences of \u0323.
> If not, you need to rollback (by one or two characters, possibly more,
> until you satisfy the condition, but you won't know by just seeing the
> characters and advancing that your sequence is terminated: it is only at
> end that you have to do this check and only then you can rollback :
>
> The analysis cannot be deterministic, or it requires keeping a track of
> all acceptable positions previously seen that could satisfy the condition;
> as the sequence for (\u0302\u0302\0323)* can be extremely long, keeping
> this track for possible rollbacks coudl be costly. For example consider
> this regexp:
>
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
>
> Can you still transform it and correctly infer the type of counters you
> need for the final check (before rollbacks) if you replace it with:
>
> (\u0302|\0323)*(\u0302|\0303)*\u0302 which is fully equivalent to
> (\u0302|\0303|\0323)*\u03202.
>
> You'd need to check that there are exactly
> - (2n+1) occurences of \0302
> - (n) occurences of \0303
> - (n) occurences of \0323
>
> But it won't be enough because \0302 and \0303 have the same combining
> class and cannot be reordered. So within the first regexp:
>
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
>
> the first iterated subregexp will need to scan first within the part that
> is to match in the second iterated subregexp, where you cannot predict
> where it will stop. It may even scan completely through it (if you have not
> encountered any \0303) and eaten the last \u0302. At this time, you may see
> that the first iterated subregexp cannot contain any \u0303 so the first
> rollback to do will be to rollback just before the 1st occurence of \0303.
>
> But the counter check may still be wrong and you'll have to rollback
> through one or two occurences of \u0302 in order to find the location where
> the first iterated subregexp is satisfied. At this point the one ot two
> occurences of \u0302 that you've rolled back will be counted as being part
> of the 2nd iterated regexp, or even be the final occurence looked to match
> the end of the regexp.
>
> I don't see how you can support this regexp with a DFA, you absolutely
> need an NFA (and the counters you want to add do not offer any decisive
> help).
>
>
> 2015-05-16 0:54 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
>> On Sat, 16 May 2015 00:31:53 +0200
>> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>>
>> > 2015-05-15 23:57 GMT+02:00 Richard Wordingham <
>> > richard.wordingham at ntlworld.com>:
>> >
>> > > On Fri, 15 May 2015 22:09:13 +0200
>> > > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>> > >
>> > > > 2015-05-15 9:10 GMT+02:00 Richard Wordingham <
>> > > > richard.wordingham at ntlworld.com>:
>> > >
>> > > > This is because you don't understand the issue !
>> > >
>> > > > > Now, a program to check whether a trace matching
>> > > > > {\u0323|\u0302)* matches (\u0323\u0302)* is very simple.  It
>> > > > > just counts the number of times \u0323 occurs and the number of
>> > > > > times \u0302 occurs, and returns whether they are equal.
>> > >
>> > > > This is wrong. \0323\0323\0302\0302 and \0323\0302\0323\0302 would
>> > > > pass your counting test (which does not work in a FSA) but they
>> > > > are NOT canonically equivalent because the identical combining
>> > > > characters are blocking each other (so arbitrary ordering is not
>> > > > possible).
>> > >
>> > > TUS7.0: D108   Reorderable pair:
>> > >  Two adjacent characters A and B in a coded character sequence
>> > >  <A, B> are a Reorderable Pair if and only if ccc(A) > ccc(B) > 0.
>> > >
>> > > Now, ccc(U+0302) = 230 > 220 = ccc(U+0323) > 0, so (U+0302, U+0303)
>> > > is a reorderable pair.
>> > >
>> >
>> > I do NOT contest that U+0323 and U+0302 can reorder, but the fact that
>> > U+0323 blocks another occurence of U+0323 because it has the **same**
>> > combining class.
>>
>> How does that stop <U+0323, U+0323, U+0302, U+0302> and <U+0323,
>> U+0302, U+0323, U+0302> being canonically equivalent?
>>
>> TUS7.0: D109   'Canonical Ordering Algorithm' says:
>> "In a decomposed character sequence D, exchange the positions of the
>> characters in each Reorderable Pair until the sequence contains no more
>> Reorderable Pairs."
>>
>> There is no mention of blocking in D109.
>>
>> Richard.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/6f8c5527/attachment.html>

From kenwhistler at att.net  Fri May 15 20:21:02 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 15 May 2015 18:21:02 -0700
Subject: A few emoji per year... (was: Re: Tag characters)
In-Reply-To: <CAJ2xs_GTOPjam4B=V0rof+oJC=5nqjkqAgzBswWTm8iKrJNriA@mail.gmail.com>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
 <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
 <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>
 <BLUPR03MB1200466902DE4EBC4B64198D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <CAJ2xs_GTOPjam4B=V0rof+oJC=5nqjkqAgzBswWTm8iKrJNriA@mail.gmail.com>
Message-ID: <55569B7E.7040401@att.net>

And to put Mark's comments in some statistical perspective, in the context
of all the media hype, the true "big bang" for emoji in Unicode was 
Version 6.0,
released over 4-1/2 years ago now. *That* was the Unicode release that added
hundreds and hundreds of emoji for Japanese carrier interoperability, as 
well
as the regional indicator mechanism for the representation of flag 
pictographs.
But at the time, relatively few people noticed, because no Unicode emoji 
were
on phones yet.

Unicode 7.0, which resulted in the huge media splash about emoji last 
year, actually
only added 103 emoji, and the majority of those were very old news: 
old-fashioned
pictographs for Webdings compatibility. There were only a few high 
visibility,
emotionally catchy new additions among that set, such as the CHIPMUNK and
the you-know-what-I'm-talking-about hand gesture, that convinced people this
was a bigger deal new release than it was. But suddenly everything was 
visible
on phones, and that made all the difference for the general public.

Unicode 8.0 is about to be released, and it will have just 41 emoji 
additions --
among them the 5 emoji modifiers that are already available on phones
to address the emoji diversity issue.

And the UTC just approved 38 new emoji candidates that will be the likely
basis of the emoji additions for Unicode 9.0 next year.

Once we get through the Unicode 8.0 and Unicode 9.0 cycles, this process
will have settled into a kind of a routine -- and it will be apparent to all
what the likely scale and scope of future emoji additions *as Unicode 
characters*
will be: a few dozen per year, carefully picked based on a set of 
criteria now to be set
out in the new UTR #51 regarding emoji.

The sky isn't falling here. ;-) The Unicode Consortium has not suddenly
transmogrified into the Emoji Consortium. People will get used to the fact
that a few dozen new emoji characters get added to the standard every 
year -- ho hum.
And for folks who can't wait through the 
two-years-from-proposal-to-implementation
cycles of character encoding committees, well... those stickers are out 
there
waiting for you.

--Ken


On 5/15/2015 5:18 PM, Mark Davis ?? wrote:
>
> However, we do have strong short-term pressure to show that we are 
> responsive and responsible in adding emoji. And our adding a 
> reasonable number of emoji per year is not going to stop Line or Skype 
> from adding stickers!
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/a5eba549/attachment.html>

From andrewcwest at gmail.com  Sat May 16 03:15:34 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Sat, 16 May 2015 09:15:34 +0100
Subject: A few emoji per year... (was: Re: Tag characters)
In-Reply-To: <55569B7E.7040401@att.net>
References: <23012433.62520.1431623697436.JavaMail.defaultUser@defaultHost>
 <11890020.82119.1431636019782.JavaMail.defaultUser@defaultHost>
 <CA+ONOD=b9BufYdKOawUyUcLuhgqvW_Hu6rdRkA6xgWzJqE2Eqw@mail.gmail.com>
 <BLUPR03MB1200466902DE4EBC4B64198D5C70@BLUPR03MB120.namprd03.prod.outlook.com>
 <CAJ2xs_GTOPjam4B=V0rof+oJC=5nqjkqAgzBswWTm8iKrJNriA@mail.gmail.com>
 <55569B7E.7040401@att.net>
Message-ID: <CALgEMhxid7e7Uwr3LWmdhcgCCg=AKDFAYhDq53fH_8Y4StkGSA@mail.gmail.com>

On 16 May 2015 at 02:21, Ken Whistler <kenwhistler at att.net> wrote:
>
> And for folks who can't wait through the two-years-from-proposal-to-implementation
> cycles of character encoding committees, well...

... don't worry, the UTC will simply bypass the normal ISO ballot
cycle, and fast-track them into the next available version of Unicode.

Andrew

From richard.wordingham at ntlworld.com  Sat May 16 09:14:28 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 16 May 2015 15:14:28 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
Message-ID: <20150516151428.550def44@JRWUBU2>

On Sat, 16 May 2015 02:04:55 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> But do you agree that we still need to match pairs of distinct
> characters in your example ?

The original point I made was that (\u0323\u0302)*, as applied to
'traces' of Unicode strings under canonical equivalence, was only a
regular expression if one reinterpreted the *-operator.

The key points established in the theory of 'trace monoids' as applied
to fully decomposed Unicode strings are:

1) If a set ('language') A of Unicode strings under canonical
equivalence can be recognised by a *finite* state machine and, for each
string in A:

a) the string contain a starter, or
b) all characters in the string have the same canonical combining class

then there is a *finite* state machine that recognises A* with the
normal interpretation as the set of concatenations of zero or more
members of A.

2) Every set recognised by a *finite* state machine can be written in
the form of a regular expression using optionality, bracketing,
alternative, concatenation and Kleene star.  Moreover, Kleene star will
only be applied to sets satisfying the condition above.

Moreover, the expression could be used to check the string as
converted to NFD.

That sounds like very good news until you remember that *searching* for
the canonical equivalent of U+00F4 in an NFD string needs something
like:

 .*(o)[:^ccc = 230:]*(\u0302).*

This expression has two capture groups.

> But do you agree that we still need to match pairs of distinct
> characters in your example ?

A *finite* automaton acting on Unicode traces won't support
(\u0323\u0302)*.  My preferred solution, if support is required, is to
bend the finite automaton to simultaneously consider an unbounded
number of repeats of a subexpression.  This works for me because I
store states as strings and allocate each string from the heap.  The
amount of memory required is sublinear in the length of the string
being searched.

> If you just count the otal it will be wrong with (\u0302\u0302\0323)*
> if you transform it into (\u0302|\u0302|\0323)* which is fully
> equivalent to (\u0302|\0323)*, because what you want is not matching
> pairs but triples (your counter check would have now to make sure
> there are two times more occurences of \u0302 and occurences of
> \u0323. If not, you need to rollback (by one or two characters,
> possibly more, until you satisfy the condition, but you won't know by
> just seeing the characters and advancing that your sequence is
> terminated: it is only at end that you have to do this check and only
> then you can rollback :

> The analysis cannot be deterministic, or it requires keeping a track
> of all acceptable positions previously seen that could satisfy the
> condition; as the sequence for (\u0302\u0302\0323)* can be extremely
> long, keeping this track for possible rollbacks coudl be costly. For
> example consider this regexp:

I don't do roll-backs.  I use a non-deterministic finite automaton that
is equivalent to a deterministic finite automaton or, confronted with
this type of rational expression (it ain't regular for traces!), use a
non-deterministic slightly non-finite automaton.  Now, capture groups
do destroy the finiteness of the automaton, and it looks like a matter
of trade-offs.

There is an example on the regular expression page in the ICU user
guide, searching  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC for (A+)+B.
Roll backs make this exponentially slow.  My code runs through this
faster than it can display its progress, which is par for the course.
Now, my implementation of capture groups is far from complete.  At
present, I capture all thousand or so possibilities, as I have no logic
to determine what is required.  If I set it to capture all occurrences
of A+, I can just perceive an increase in run time when I pipe the
progress reporting to /dev/null.   As I augment the recognition-related
state with the capture information, the number of active states is
quadratic with string length, and the logic to maintain the list of
states occupied is quadratic with the number of states, the
time to run varies as the fourth power of the string length.

> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
> 
> Can you still transform it and correctly infer the type of counters
> you need for the final check (before rollbacks) if you replace it
> with:

> <snip>

> But the counter check may still be wrong and you'll have to rollback
> through one or two occurences of \u0302 in order to find the location
> where the first iterated subregexp is satisfied. At this point the
> one ot two occurences of \u0302 that you've rolled back will be
> counted as being part of the 2nd iterated regexp, or even be the
> final occurence looked to match the end of the regexp.

> I don't see how you can support this regexp with a DFA, you
> absolutely need an NFA (and the counters you want to add do not offer
> any decisive help).

The reinterpretation of this expression as a regular expression
for traces substitutes 'concurrent iteration' for Kleene star.  Each
trace in the bracketed expression that lacks a character with ccc=0 is
replaced by its maximal subtraces of each canonical combining class.
Under this scheme, (\u0302\u0302\0323)*(\u0302\0303)*\u0302 would be
interpreted as (\u0302\u0302|0323)*(\u0302\0303)*\u0302.  As I said
before, that is unlikely to be what the user means by expressions like
these.

I'm not sure what you mean by 'NFA'.  Do you mean a 'back-tracking
automaton'?

To support
(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302
I would use my extension of non-deterministic finite automaton to
process it.  If you like, that is how I would do the counting - by the
number of incomplete matches to \u0323\u0302\u0302.  Note that I take
characters from the input string in NFD order, along with their
(fractional) positions in the input string.  I process
\u0302\u0302\u0323 as a very simple regex \u0323\u0302\u0302.  The
state is simply the position of the next character, plus two states for
'all matched' and for 'cannot match'.

Running it with input \u0323\u0302\u0302\u0302, which it did recognise,
did show one problem.  Me engine doesn't notice that when looking for
the first factor, \u0323\u0302\u0302, it is not possible for \u0302 to
belong to a subsequent factor.  Instead it progresses to a
dead-end state where all subsequent input is assumed to be
part of another factor.  Supporting Unicode properties may make fixing
this messy. I had been living with this because these dead-end states
are killed on receipt of a starter, and runs of non-starters are
normally not very long.  No precomposed character decomposes to more
than three of them.  I saw the need as being for something that runs
correctly, rather than for something that runs correctly and fast.

When checking whether a string matches, once I have fixed the problem
of dead-end states, there will be, for each state, one capturing group
for the last U+0323 encountered and, at most, one capturing group for
the last or penultimate U+0302 encountered.  While the number of states
is unbounded, the number of possible states at any point is
uniformly bounded.

Searching for the pattern is a bit more complicated, as each U+0323
or U+0302 could be the last such character in a matching subtrace.

Richard.

From richard.wordingham at ntlworld.com  Sat May 16 10:02:39 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 16 May 2015 16:02:39 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
Message-ID: <20150516160239.16638123@JRWUBU2>

On Sat, 16 May 2015 02:41:33 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> With a NFA, the representation is completely different, The regexp
> 
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302
> 
> is just transformed into:
> 
> (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\
> 0303|?\0303?\u0302)*?\u0302?
>
> where I noted with the "tack" the 15 relative positions **in this new
> regex** where there's a need to check if the input matches a
> character or character class.

The old regex is a pattern for use with the trace monoid of Unicode
strings under canonical equivalence.

>From its appearance, I presume the new regex is intended for use with
strings, and that the third run of codepoints is meant
to be ?\u0323?\u0302?\u0302 rather than a repeat of
?\u0302?\u0302?\0323.

There is an annoying error.  You appear to assume that U+0302 COMBINING
CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't;
they have the same combining class, namely 230.  I'm going to assume
that 0303 is a typo for 0323.

\u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to
\u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the
corrected old regex (\u0302\u0302\u0323)*(\u0302\u0323)*\u0302.
However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
corrected new regex
(?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0323|?\u0323?\u0302)*?\u0302?

This example goes straight to the problem with the recommended way of
using string-based regular expression engines.  Using NFD throughout
works fine if one is working with whole words.  If fails if one is
working with sequences of combining marks and there is any complexity.

> But it's true that the allowed reorderings implied by canonical
> equivalences (and those that are NOT allowed because they are
> blocked) are really challenging !

They are not challenging at all.  Once you have eliminated the
precomposed characters and characters with singleton decompositions,
you are left with the trace monoid of Unicode strings under canonical
equivalence.  All you have to remember is that two characters commute if
and only if they have different positive canonical combining classes. 

Richard.


From verdy_p at wanadoo.fr  Sat May 16 11:29:18 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 16 May 2015 18:29:18 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150516160239.16638123@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
Message-ID: <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>

2015-05-16 17:02 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> There is an annoying error.  You appear to assume that U+0302 COMBINING
> CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but they don't;
> they have the same combining class, namely 230.  I'm going to assume
> that 0303 is a typo for 0323.


Not a typo, and I did not made the assumption you suppose because I chose
then so that they were effectively using the **same** combining class, so
that they do not commute.
It was the key fact of my argument that destroys your argumentation.
Reread carefully and use the example string I gave and don't assume I
wanted to write u0323 instead of u0303.

And you'll see that backtracing is necessary for this case (EVEN if you
don't care about capture groups but you are only interested in the global
capture $0).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/fe88f397/attachment.html>

From doug at ewellic.org  Sat May 16 12:07:13 2015
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 16 May 2015 11:07:13 -0600
Subject: Tag characters
Message-ID: <794493C42D714C3C8A58D2F45AA36663@DougEwell>

L2/15-145R says:

> On some platforms that support a number of emoji flags, there is
> substantial demand to support additional flags for the following:
> [...]
> Certain supra-national regions, such as Europe (European Union flag)
> or the world (e.g. United Nations flag). These can be represented
> using UN M49 3-digit codes, for example "150" for Europe or "001" for
> World.

These are uncomfortable equivalence classes. Not all countries in Europe 
are members of the European Union, and the concept of "United Nations" 
is not really the same by definition as "all countries in the world."

The remaining UN M.49 code elements that don't have a 3166-1 equivalent 
seem wholly unsuited for this mechanism (and those that do, don't need 
it). There are no flags for "Middle Africa" or "Latin America and the 
Caribbean" or "Landlocked developing countries."

Some trans-national organizations might _almost_ seem as if they could 
be shoehorned into an M.49 code element, like identifying 035 
"South-Eastern Asia" with the ASEAN flag, but this would be problematic 
for the same reasons as 150 and 001.

Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for 
"European Union" and "UN" for "United Nations." If these flags are the 
use cases, why not simply use those alpha-2 code elements, instead of 
burdening the new mechanism with the 3-digit syntax?

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From verdy_p at wanadoo.fr  Sat May 16 14:28:12 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 16 May 2015 21:28:12 +0200
Subject: Tag characters
In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
References: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
Message-ID: <CAGa7JC1vBE5tn1WvOdfo_ryxTnOraPfVEN7+eCFOeqBfOrYG8g@mail.gmail.com>

2015-05-16 19:07 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> L2/15-145R says:
>
>  On some platforms that support a number of emoji flags, there is
>> substantial demand to support additional flags for the following:
>> [...]
>> Certain supra-national regions, such as Europe (European Union flag)
>> or the world (e.g. United Nations flag). These can be represented
>> using UN M49 3-digit codes, for example "150" for Europe or "001" for
>> World.
>>
>
> These are uncomfortable equivalence classes. Not all countries in Europe
> are members of the European Union


But the flag of the European in fact belongs to the Council of Europe that
created it 30 years before the European Community adopted it. According to
the Coucil of Europe, the flag is appropriate for ALL countries in Europe.

In summary the flag does represents *not only* the EU. It is suitable as
well for Russia, Belarussia (even if its seat is suspended in the Coucil of
Europe), or Kazakhstan and Turkey (even if only a part of these countries
is in Europe).


> and the concept of "United Nations" is not really the same by definition
> as "all countries in the world."
>

Yes but the UN recognizes a set of territories (not always their
government) that covers the whole world (including Antarctica where no
government is also recognized, as well as territorial waters of these
territories, plus the international waters that the UN protects).

Not all countries also are required to become members of the UN (the Holy
See/Vatica is not a full member, but it is recognized; same remark for
Palestine). So the UN has a competence on the whole world, and all people
of the world can legally seek protection from the UN, wherever they live,
or even if they have no country to recognize them a nationality).

If you want to seek territories where the UN has no authority at all, the
nearest ones are on the Moon !
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/747445d3/attachment.html>

From richard.wordingham at ntlworld.com  Sat May 16 15:33:55 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 16 May 2015 21:33:55 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
 <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
Message-ID: <20150516213355.7891b4b6@JRWUBU2>

On Sat, 16 May 2015 18:29:18 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > There is an annoying error.  You appear to assume that U+0302
> > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but
> > they don't; they have the same combining class, namely 230.  I'm
> > going to assume that 0303 is a typo for 0323.
> 
> 
> Not a typo, and I did not made the assumption you suppose because I
> chose then so that they were effectively using the **same** combining
> class, so that they do not commute.

In that case you have an even worse problem.  Neither the trace nor the
string \u0303\u0302\u0302 matches the pattern
(\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the
regular expression
(?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\
0303|?\0303?\u0302)*?\u0302?

You've transformed (\u0302\u0303) into (?\u0302?\0303|?\0303?\u0302),
but that is unnecessary and wrong, because U+0302 and U+0303 do not
commute. 

> It was the key fact of my argument that destroys your argumentation.

Which argument?  Restoring the \u303, the fact that remains that

\u0323\u0323\u0302\u0302\u0302\u0302 is canonically equivalent to
\u0302\u0302\u0323\u0302\u0323\u0302, which clearly matches the
corrected old regex (\u0302\u0302\u0323)*(\u0302\u0303)*\u0302.
However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
corrected new regex
(?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303|?\u0303?\u0302)*?\u0302?

Do you claim that this argument is destroyed?  If it is irrelevant, why
is it irrelevant?  It shows that your transform does not solve the
original problem of missed matches.

> Reread carefully and use the example string I gave and don't assume I
> wanted to write u0323 instead of u0303.

I'm not at all sure what your example string is.  I ran my program to
watch its progression with input \u0323\u0323\u0302\u0302, which does
not match the pattern, and attach the outputs for your scorn.  I have
added comments started by #.

# NDE = new dead end - I could tweak the program so this state is not
entered.

# NDE! = new dead end that might not be easy to avoid.

# ODE = old dead end - derived from a state already labelled ODE or NDE.

# ODE! = old dead end - derived from a state already labelled ODE! or
NDE!.

Here are the run outputs, with blank lines added to assist formatting.

$ ./regex -b
'(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302' '\u0323\u0302\u0323\u0302'
# ignore line wrapping above.

Examining /home/richard/unicode/UCD/7.00/PropertyAliases.txt.
Examining /home/richard/unicode/UCD/7.00/PropertyValueAliases.txt.
Examining /home/richard/unicode/UCD/7.00/SpecialCasing.txt.
Examining /home/richard/unicode/UCD/7.00/Scripts.txt.
Examining /home/richard/unicode/UCD/7.00/PropList.txt.
Simple Unicode regex "\u0323\u0302\u0302"
Simple ASCII regex ""                # I construct A* = (|A+)
Simple Unicode regex "\u0302\u0303"
Simple Unicode regex "\u0302"
Initial states:
 0) LLLL0
# The states are named according to a hierarchy of regexes.
# LL = regex (\u0302\u0302\u0323)*
# LLL = regex (\u0302\u0302\u0323)+
# LLLL = regex \u0302\u0302\u0323.
# This is implemented as 'Simple Unicode regex "\u0323\u0302\u0302"'.
# 0 means about to compare with character at offset 0, i.e. 0
 1) LLRM
# LLR = Empty string regex.
# M = matched
 2) LRLL0
# LR = regex (\u0302\u0303)*
# LRL = regex (\u0302\u0303)+
# LRLL = regex \u0302\u0303
 3) LRRM
# LRR = Empty string regex.
 4) R0
# R = regex \u0302
=0323=00:06:= # Get U+0323 from whole (=0) at byte 0 of argument
LLLL0 => LLLL2
LLLL0 => LLLN001220:0:L2  # NDE!

=0323=012:018:= # Note that string is input in NFD order.
LLLL2 => LLLN001220:2:L2
# Now running LLLL and LLLR, whose states have relative names 2 and L2.
# LLLR is a clone of LLL.
# This recursion enables the recognition of unrecognisable Kleene
# stars.  It makes the automaton non-finite.
# 001 is length in hex of name of relative state of LLLL
# 220 means non-starters of ccc <= 220 will not be fed to LLLL
LLLN001220:0:L2 => LLLN001220:0:N001220:2:L2 # ODE!
=0302=06:012:=
LLLN001220:2:L2 => LLLN001220:4:L2
LLLN001220:2:L2 => LLLN001230:2:L4                      # NDE
LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LL2           # NDE
# L = regex (\u0302\u0302\u0323)*(\u0302\u0303)*        # NDE
LLLN001220:2:L2 => LN00D230:LN001220:2:L2:LN001230:0:L2 # NDE
LLLN001220:2:L2 => N00E230:LLN001220:2:L2:M             # NDE
LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001220:4:L2  # ODE!
LLLN001220:0:N001220:2:L2 => LLLN001230:0:N001230:2:L4  # ODE!
LLLN001220:0:N001220:2:L2 => LN017230:LN001220:0:N001220:2:L2:LL2 # ODE!
LLLN001220:0:N001220:2:L2 => # Line-break is email artefact.
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 # ODE!
LLLN001220:0:N001220:2:L2 => N018230:LLN001220:0:N001220:2:L2:M # ODE!

=0302=018:024:=
LLLN001220:4:L2 => LLLN001220:M:L2 # Redundant - should purge somehow.
LLLN001220:4:L2 => LLLL2
# Regex LLLL 'recognised' - rename LLLRL as LLLL.
LLLN001220:4:L2 => LLLN001230:4:L4                       # NDE
LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LL2            # NDE
LLLN001220:4:L2 => LN00D230:LN001220:4:L2:LN001230:0:L2  # NDE
LLLN001220:4:L2 => N00E230:LLN001220:4:L2:M              # NDE
LLLN001230:2:L4 => LLLN001230:2:LM                       # ODE
LLLN001230:2:L4 => LLLN001230:2:L0                       # ODE
LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LL2            # ODE
LLLN001230:2:L4 => LN00D230:LN001230:2:L4:LN001230:0:L2  # ODE
LLLN001230:2:L4 => N00E230:LLN001230:2:L4:M              # ODE
LN00D230:LN001220:2:L2:LL2 => LN00D230:LN001220:2:L2:LN001230:2:L2 # ODE
LN00D230:LN001220:2:L2:LL2 => N019230:N00D230:LN001220:2:L2:LL2:M  # ODE
LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is e-mail artefact
LN00D230:LN001220:2:L2:LN001230:0:N001230:2:L2           # ODE
LN00D230:LN001220:2:L2:LN001230:0:L2 => # Line-break is email artefact
N023230:N00D230:LN001220:2:L2:LN001230:0:L2:M            # ODE
LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001220:M:L2   # ODE!
LLLN001230:0:N001220:4:L2 => LLLN001230:0:L2             # ODE!
LLLN001230:0:N001220:4:L2 => LLLN001230:0:N001230:4:L4   # ODE!
LLLN001230:0:N001220:4:L2 => LN017230:LN001230:0:N001220:4:L2:LL2 # ODE!
LLLN001230:0:N001220:4:L2 => # Line-break is e-mail artefact.
LN017230:LN001230:0:N001220:4:L2:LN001230:0:L2           # ODE!
LLLN001230:0:N001220:4:L2 => N018230:LLN001230:0:N001220:4:L2:M # ODE!
LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:LM   # ODE!
LLLN001230:0:N001230:2:L4 => LLLN001230:0:N001230:2:L0   # ODE!
LLLN001230:0:N001230:2:L4 => LN017230:LN001230:0:N001230:2:L4:LL2 # ODE!
LLLN001230:0:N001230:2:L4 => # Line-break is e-mail artefact
LN017230:LN001230:0:N001230:2:L4:LN001230:0:L2           # ODE!
LLLN001230:0:N001230:2:L4 => N018230:LLN001230:0:N001230:2:L4:M # ODE!
LN017230:LN001220:0:N001220:2:L2:LL2 =>
LN017230:LN001220:0:N001220:2:L2:LN001230:2:L2           # ODE!
LN017230:LN001220:0:N001220:2:L2:LL2 =>
N023230:N017230:LN001220:0:N001220:2:L2:LL2:M            # ODE!
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 =>
LN017230:LN001220:0:N001220:2:L2:LN001230:0:N001230:2:L2 # ODE!
LN017230:LN001220:0:N001220:2:L2:LN001230:0:L2 =>
N02D230:N017230:LN001220:0:N001220:2:L2:LN001230:0:L2:M  # ODE!
End marker is at 024:OVF

> And you'll see that backtracing is necessary for this case (EVEN if
> you don't care about capture groups but you are only interested in
> the global capture $0).

What I see is the desirability of some optimisation, but no problem in
principle.  Now I might see something different with your intended
example - but until I see it I think my examination would be
overwhelmed by dead-end state propagations.

If you are making the point that a backtracking automaton might need
to backtrack, then I won't dispute that point.  

Richard.


From srl at icu-project.org  Sat May 16 15:39:17 2015
From: srl at icu-project.org (Steven R. Loomis)
Date: Sat, 16 May 2015 13:39:17 -0700
Subject: Tag characters
In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
References: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
Message-ID: <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org>

See the meeting minutes and the actual utr51. 

Enviado desde nuestro iPhone.

> El may 16, 2015, a las 10:07 AM, Doug Ewell <doug at ewellic.org> escribi?:
> 
> L2/15-145R says:
> 
>> On some platforms that support a number of emoji flags, there is
>> substantial demand to support additional flags for the following:
>> [...]
>> Certain supra-national regions, such as Europe (European Union flag)
>> or the world (e.g. United Nations flag). These can be represented
>> using UN M49 3-digit codes, for example "150" for Europe or "001" for
>> World.
> 
> These are uncomfortable equivalence classes. Not all countries in Europe are members of the European Union, and the concept of "United Nations" is not really the same by definition as "all countries in the world."
> 
> The remaining UN M.49 code elements that don't have a 3166-1 equivalent seem wholly unsuited for this mechanism (and those that do, don't need it). There are no flags for "Middle Africa" or "Latin America and the Caribbean" or "Landlocked developing countries."
> 
> Some trans-national organizations might _almost_ seem as if they could be shoehorned into an M.49 code element, like identifying 035 "South-Eastern Asia" with the ASEAN flag, but this would be problematic for the same reasons as 150 and 001.
> 
> Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for "European Union" and "UN" for "United Nations." If these flags are the use cases, why not simply use those alpha-2 code elements, instead of burdening the new mechanism with the 3-digit syntax?
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ???? 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150516/72b5ec51/attachment.html>

From doug at ewellic.org  Sat May 16 16:01:24 2015
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 16 May 2015 15:01:24 -0600
Subject: Tag characters
In-Reply-To: <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org>
References: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
 <02F75A52-3E46-449D-8144-D63A087E8383@icu-project.org>
Message-ID: <9E6D62BF9816458A83577364CB380E54@DougEwell>

Steven R. Loomis wrote:

> See the meeting minutes and the actual utr51.

Sorry, I didn't find anything dealing with numeric codes in Section 
E.1.3 of the meeting minutes, and the copy of UTR #51 at unicode.org 
doesn't appear to have been updated for anything beyond the existing 
RIS. What specifically should I be looking for?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Sun May 17 09:33:15 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 17 May 2015 16:33:15 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150516213355.7891b4b6@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
 <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
 <20150516213355.7891b4b6@JRWUBU2>
Message-ID: <CAGa7JC1B26xbBSGC6UTngZgOF7+c1tF69AWFaxV6fB-yQhnj+A@mail.gmail.com>

2015-05-16 22:33 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sat, 16 May 2015 18:29:18 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > There is an annoying error.  You appear to assume that U+0302
> > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute, but
> > > they don't; they have the same combining class, namely 230.  I'm
> > > going to assume that 0303 is a typo for 0323.
> >
> >
> > Not a typo, and I did not made the assumption you suppose because I
> > chose then so that they were effectively using the **same** combining
> > class, so that they do not commute.
>
> In that case you have an even worse problem.  Neither the trace nor the
> string \u0303\u0302\u0302 matches the pattern
> (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match the
> regular expression
> (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\
> 0303|?\0303?\u0302)*?\u0302?
>
> You've transformed (\u0302\u0303) into (?\u0302?\0303|?\0303?\u0302),
> but that is unnecessary and wrong, because U+0302 and U+0303 do not
> commute.


Oh right! Thanks for pointing, it was intended you can read it as.

(?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\0303)*?\u0302?

But my argument remains because of the presence of \0302 in the second
subregexp (which additionally is a separate capture, but here I'm not
concentrating on the impact in numbered captures, but only on the global
capture aka $0)


> > It was the key fact of my argument that destroys your argumentation.
>
> However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
> corrected new regex
>
> (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302?
>
> Do you claim that this argument is destroyed?  If it is irrelevant, why
> is it irrelevant?  It shows that your transform does not solve the
> original problem of missed matches.
>

Why doesn't it solve it? Note that the notation with tacks is just the
first transform. Of course you can optimize it by factorizing the common
prefixes in each alternative. In the following the 1st and 4th tacks have
some common followers in their lists of characters or character classes
they expect (for advancing to the next tack), but the 2nd and 5th tack
expect different followers.

(?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302?

OK I understand the need for "counting" characters present in regexps when
they are sharing the same combining classes, but counting does not work
correctly, in fact you have to keep counters for each distinct combining
character with non-zero combining class for how they contribute to the
total length of the "star" group. They also don't contribute necessarily to
the same total when the regexp specifies them multiple times (a simple
measurment of the total length of the group is evidently not enough, all
counters must be exact multiples of the number of occurences (counter[c])
of each combining character (c) in the original untransformed content of
each alternative in the star group, and the second factor n of this
multiple must be identical for all counters

    The total length is in pseudo-code: { sum=0; for(c:v in counter) sum +=
v; return sum; } but it has no use by itself.
    If the number of (non-repeated) original untransformed alternative are
in mustoccur[] the check to perform is this pseudo-code:

      var n = null; foreach(c:m in mustoccur) {
        checkthat(counter[c] % m == 0);
        if (n == null) n = counter[c] / m;
        else checkthat(counter[c] / m == n);
      }


> > Reread carefully and use the example string I gave and don't assume I
> > wanted to write u0323 instead of u0303.
>
> I'm not at all sure what your example string is


My example was the original regexp without the notation tacks:
(\u0302\u0302\u0323)*(\u0302\u0303)*\u0302

It exposes some of the critical difficulties, first for returning correct
global matches (but then also for for captures, and the effect of
"ungreedy" options of Perl (and PCRE working in Perl-compatible mode or in
extended mode) and most regexp engines (whose default behavior is
"greedy"): the "ungreedy" option causes significant slowdowns with
additional rollbacks or more work to maintain an efficient backtracing
directly in the current state of the automata (if you attempt to use
deterministic rules everywhere it is possible).

But we know that it's not possible for all regexps in general, otherwise
regexp engines would just be simple LR(n) parsers with bounded n, or even
simpler LALR parsers like Bison/Yacc but without their backtracing support
for "shift"/"reduce" ambiguities, these LALR parsers are also greedy by
default and resolve ambiguities by "shifting" first, leaving the code
determine what to do when after shiting there's a parse error caused by
unexpected values, but LALR parsers do not have a clean way to handle the
correct rollback to the last ambiguous shift/reduce state with a special
match rule, and they do not support trying "reduce" first to get the
"ungreedy" behavior as they cannot return to this state to choose the
"shift" alternative).

That's why since long lexers are written with regexps, and syntaxic
scanners written preferably with LALR parsers which cannot work alone
without a separate lexer. But using LALR parsers does not work with common
languages like Fortran; it works for parsing language like C/C++ because
they are specified so that shift/reduce ambiguities are resolved using
"shift" always (i.e. the greedy behavior).

Very few parser generator support both working mode (except the excellent
PCCS that I have used since the early 1990's when it was still not
rewritten in Java and was a student project in Purdue University, and that
combines all the advantages of regexps and LR parsers, with very clean
control of backtracing, it also supports layered parsing with multiple
local parsers if needed, even without wrining any piece of output code, you
can describe the full syntaxic and lexical rules of almost all languages in
a single specification).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150517/883cc757/attachment.html>

From verdy_p at wanadoo.fr  Sun May 17 09:45:18 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 17 May 2015 16:45:18 +0200
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <20150516213355.7891b4b6@JRWUBU2>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
 <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
 <20150516213355.7891b4b6@JRWUBU2>
Message-ID: <CAGa7JC218FK1HEiuE_xxh9Pm-T68iq=jyiochmD1CkWQm-sKDg@mail.gmail.com>

2015-05-16 22:33 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I'm not at all sure what your example string is.  I ran my program to
> watch its progression with input \u0323\u0323\u0302\u0302, which does
> not match the pattern, and attach the outputs for your scorn.  I have
> added comments started by #.
>

Sorry for not commenting it, this is the internal tricks and outputs of
your program, and your added comments does not allow me to interpret what
all this means, i.e. the exact role of the notations with sequences or "L"
or "R" or "N", and what the "=>" notation means (I suppose this is noting
an advance rule and that the left-hand side is the state before, the
right-hand-side is the state after, but I don't see where is the condition
(the character or character class to match, or an error condition). You've
only "explained" partly the NDE and ODE comments and the "!" when it is
appended.

Is that really what your regexp engine outputs as its internally generated
parser tables (only "friendly" serialized as a "readable" text) ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150517/89263608/attachment.html>

From richard.wordingham at ntlworld.com  Sun May 17 11:52:56 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 17 May 2015 17:52:56 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC218FK1HEiuE_xxh9Pm-T68iq=jyiochmD1CkWQm-sKDg@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
 <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
 <20150516213355.7891b4b6@JRWUBU2>
 <CAGa7JC218FK1HEiuE_xxh9Pm-T68iq=jyiochmD1CkWQm-sKDg@mail.gmail.com>
Message-ID: <20150517175256.1bc136f4@JRWUBU2>

On Sun, 17 May 2015 16:45:18 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > I'm not at all sure what your example string is.  I ran my program
> > to watch its progression with input \u0323\u0323\u0302\u0302, which
> > does not match the pattern, and attach the outputs for your scorn.
> > I have added comments started by #.
> >
> 
> Sorry for not commenting it, this is the internal tricks and outputs
> of your program, and your added comments does not allow me to
> interpret what all this means, i.e. the exact role of the notations
> with sequences or "L" or "R" or "N", and what the "=>" notation means
> (I suppose this is noting an advance rule and that the left-hand side
> is the state before, the right-hand-side is the state after, but I
> don't see where is the condition (the character or character class to
> match, or an error condition). You've only "explained" partly the NDE
> and ODE comments and the "!" when it is appended.

'ODE' and 'NDE' mean the transitions should not occur when I finish my
current set of edits.  The exclamation mark means the optimisation I
first though of wouldn't eliminate it.

> Is that really what your regexp engine outputs as its internally
> generated parser tables (only "friendly" serialized as a "readable"
> text) ?

When running the regex, I really do hold the states in forms like LLLL2
and LLLN001220:2:L2.  (The colons are unnecessary; I included them for
readability.)  It's designed for proof of principle, rather than high
speed.

There is also a tree corresponding to the analysis of the
regex; the nodes record how the lower level regexes are combined.  The
branching nodes in the example are for sequences.  In the simplest
case, a matching expression will, in some canonically equivalent form,
be the concatenation of a string matching the left hand node and a
string matching the right mode.  For iterations ('*' and '+', though I
treat '+' as basic), the tree does not need a corresponding right
branch, as all the information about the regex is held in the left
branch.  An 'L' means that the input sequence is proceeding
through the left branch.  An 'R' means that it has completed its
passage through the left branch, and is now proceeding through the
right branch.  All this would be applicable if I were ignoring
canonical equivalence.

An 'N' (for 'normalisation') means that parsing is passing through the
region where the normalisation has interleaved the left and right hand
component strings.  As I consider each fresh character, I have to
consider its canonical combining class.  The string for the state
records what ccc is blocked from the left hand string.  As I take the
characters from the input string in NFD order, I only need to remember
the highest blocked ccc.  The first character I receive with a lower
ccc will be a starter, at which point I will only be progressing the
right hand component string.  For the state in the parent regex, I
record the 'N' (as opposed to an 'L' or 'R'), the highest blocked ccc,
the state in the left-hand regex and the state in the right-hand regex.

The input characters are recorded in the form

=<character scalar value>=<this character start location>:<next
character start location>:=

The character location is recorded in the from <part><byte offset>.

The part is a single digit.  0 means whole character, 1 means first
character in character decomposing to multiple characters, 2 means
second and so on.

Thus, as the first U+0302 is stored as 6-character escape code
'\u0302', I record the position as:

=0302=06:012:=

I then record the consequential transition from each state to another
state.  As the basic structure is that of a non-deterministic finite
automaton (as at
https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton ),
there may be no or many transitions from a particular state.  There are
no error conditions as such.  As I record each transition, I record
whether there is now a match to the whole regex and whether the state
is a duplicate.  Detecting duplicates is part of the key to the
classical NFA's better resistance to 'pathological inputs' compared to
back-tracking algorithms.

There are two main state numberings for the bottom level regexes.  The
main bottom level regex is a simple regex with no alternates or
groupings.  The engine propagates the simple regex as a string and
records the state as the byte offset of the next character to compare
against.  The regex is stored in Latin-1 or UTF-8.  (Latin-1 is not
suitable for precomposed characters.)  Thus when the first character
input is U+0323 and is compared against the regex \u0323\u0302\u0302,
the state for the regex changes from 0 to 2, as U+0323 occupies 2 bytes
in UTF-8.  This is recorded as an overall state transition 'LLLL0 =>
LLLL2'.  When all characters in the string have been matched, the state
becomes 'M'.  The simple regexes have one-to-many state progressions to
handle iterations and optionality ('*', '+' and '?').

The second system is for Unicode properties.  The state records the
composition of precomposed characters by using the accumulated
codepoint as the state.  However, the state also includes a success
flag for ease of composing the acceptance or otherwise of the overall
state and to determine transitions from one regex to the next.

My program does not calculate what the characters are for a
state transition to occur.  Instead, it calculates what transitions
occur in response to an input character.

Richard.

From richard.wordingham at ntlworld.com  Sun May 17 19:03:02 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 01:03:02 +0100
Subject: Regular Expressions and Canonical Equivalence
In-Reply-To: <CAGa7JC1B26xbBSGC6UTngZgOF7+c1tF69AWFaxV6fB-yQhnj+A@mail.gmail.com>
References: <20150514013129.0b68eb41@JRWUBU2> <20150514085959.433e49af@JRWUBU2>
 <CAGa7JC0s0BPEv-xMRdKDxdtTBZzqPuDJLdtmQdVrQC8pPA5Mzg@mail.gmail.com>
 <20150514191324.1e455c57@JRWUBU2>
 <CAGa7JC1QJjHJBC-guE+eDRR_phM1FasXGQr5EYtACg9sr5P6FQ@mail.gmail.com>
 <20150515081003.1984d0c4@JRWUBU2>
 <CAGa7JC36oAkAG9SnsDdcWnGZC5EdLY1ozBm227=b0u14uqd89g@mail.gmail.com>
 <20150515225703.20771426@JRWUBU2>
 <CAGa7JC2k_gxCcosy102K5gyyVkLKuwmtxVMD6rDAzcgabA--+Q@mail.gmail.com>
 <20150515235422.3e347dc3@JRWUBU2>
 <CAGa7JC3BfNB=0yJ_VOo93hN5AZdCrJiMBF44j_dM-i_Az2D11g@mail.gmail.com>
 <CAGa7JC0svn+froFeR9E3xJPqkRXwLDpcu18hxojpg=mECqnOsQ@mail.gmail.com>
 <20150516160239.16638123@JRWUBU2>
 <CAGa7JC1mg1OVbKJQ56k8Kk5ZR5jYVA9f=is5K6b4VA5CC7daMQ@mail.gmail.com>
 <20150516213355.7891b4b6@JRWUBU2>
 <CAGa7JC1B26xbBSGC6UTngZgOF7+c1tF69AWFaxV6fB-yQhnj+A@mail.gmail.com>
Message-ID: <20150518010302.79f2b871@JRWUBU2>

On Sun, 17 May 2015 16:33:15 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-16 22:33 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > On Sat, 16 May 2015 18:29:18 +0200
> > Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > > 2015-05-16 17:02 GMT+02:00 Richard Wordingham <
> > > richard.wordingham at ntlworld.com>:
> > >
> > > > There is an annoying error.  You appear to assume that U+0302
> > > > COMBINING CIRCUMFLEX ACCENT and U+0303 COMBINING TILDE commute,
> > > > but they don't; they have the same combining class, namely
> > > > 230.  I'm going to assume that 0303 is a typo for 0323.
> > >
> > >
> > > Not a typo, and I did not made the assumption you suppose because
> > > I chose then so that they were effectively using the **same**
> > > combining class, so that they do not commute.
> >
> > In that case you have an even worse problem.  Neither the trace nor
> > the string \u0303\u0302\u0302 matches the pattern
> > (\u0302\u0302\0323)*(\u0302\0303)*\u0302, but the string does match
> > the regular expression
> > (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\
> > 0303|?\0303?\u0302)*?\u0302?
> >
> > You've transformed (\u0302\u0303) into
> > (?\u0302?\0303|?\0303?\u0302), but that is unnecessary and wrong,
> > because U+0302 and U+0303 do not commute.
> 
> 
> Oh right! Thanks for pointing, it was intended you can read it as.
> 
> (?\u0302?\u0302?\0323|?\u0302?\0323?\u0302|?\u0302?\u0302?\0323)*(?\u0302?\0303)*?\u0302?
> 
> But my argument remains because of the presence of \0302 in the second
> subregexp (which additionally is a separate capture, but here I'm not
> concentrating on the impact in numbered captures, but only on the
> global capture aka $0)
> 
> 
> > > It was the key fact of my argument that destroys your
> > > argumentation.
> >
> > However, \u0323\u0323\u0302\u0302\u0302\u0302 does not match the
> > corrected new regex
> >
> > (?\u0302?\u0302?\u0323|?\u0302?\0323?\u0302|?\u0323?\u0302?\u0302)*(?\u0302?\u0303)*?\u0302?
> >
> > Do you claim that this argument is destroyed?  If it is irrelevant,
> > why is it irrelevant?  It shows that your transform does not solve
> > the original problem of missed matches.
> >
> 
> Why doesn't it solve it? 

Sorry, my example wasn't quite right.  It should have two combining
dots below and five circumflexes, not four as I wrote it.  I will
first explain how my NDnear-FA handles it - I have now removed the
generation of the dead end states.

Initial states:
 0) LLLL0 # Starting the \u0302\u0302\u0323 factor,
          # implemented as \u0323\u032\u0320
 1) LLRM  # Completed the zero trip alternative to (\u0302\u0302\u0323)+
          # Not actually useful.
 2) LRLL0 # Starting the \u0302\u0303 factor
 3) LRRM  # Completed the zero trip alternative to (\u0302\u0303)+
 4) R0    # Starting the \u0302 factor

=0323=00:06:=
LLLL0 => LLLL2 # \u0323\u0302\u0302 factor progressed as far as \u0323

=0323=06:012:=
LLLL2 => LLLN001220:2:L2 # Progressing 2 successive repeats of factor.
               # Both have progressed as far as \u0323.
               # Finiteness would restrict me to, say, 3 repeats
               # in progress.
# The states of the finite DFA are a cross product of 3 copies of
# the DFAs for \u0323\u0302\u0302 and 2 copies of the set of relevant
# ccc values.  By no means all of these states are used.

# In the Kleene stars of the regular expression guaranteed by
# recognisability, 3 copies caters for the worst case, xyz, where x
# has a starter and ends in a non-starter, y consists of non-starters
# with the same canonical combining class, and z starts with
# non-starter and contains a starter, e.g.  

# x = \u0f40\u0f74, y = \u0f7a\u0f7a\u0f7a, z = \u0f71\u0f42

# to_NFD(xyz) = \u0f40\u0f71\u0f7a\u0f7a\u0f7a\u0f74\u0f42

=0302=012:018:=
LLLN001220:2:L2 => LLLN001220:4:L2 # Still progressing two factors
               # First has progressed to \u0323\u0302 and second to
               # \u0323.  The other way round has been pruned by the
               # automated observation that if \u0302 is blocked from
               # first factor, the factor cannot be completed. 

=0302=018:024:=
LLLN001220:4:L2 => LLLN001220:M:L2 # Completed the first factor
LLLN001220:4:L2 => LLLL2  # As first factor is complete, remove it from
                          # consideration and relabel second factor as
                          # first.

=0302=024:030:=
LLLL2 => LLLL4  # \u0323\0302\u0302 completed as far as \u0323\u0302

=0302=030:036:=
LLLL4 => LLLLM  # \u0323\u0302\u0302 is complete.
LLLL4 => LRLL0  # So start \u0302\u0303 factor.
LLLL4 => LRRM   # Alternatively, completed the zero trip option of
                # (\u0302\u0303)*
LLLL4 => R0     # Or, we have progressed as far as the final \u0302
LLLL4 => LLLL0  # Or, start another \u0323\u0302\u0302

=0302=036:042:=
LRLL0 => LRLL2   # Got as far as \u0302 in \u0302\u0303
R0 => RM (match) # Or completed the final \u0302.
End marker is at 042:OVF

Could you please talk me through how your system recognises the string
\u0323\u0323\u0302\u0302\u0302\u0302\u0302 as matching the regex.  I
can't work out how it is supposed to work from your description.  

Richard.


From abdo.alrhman.aiman at gmail.com  Mon May 18 06:49:27 2015
From: abdo.alrhman.aiman at gmail.com (=?UTF-8?B?2LnYqNivINin2YTYsdit2YXYp9mGINij2YrZhdmG?=)
Date: Mon, 18 May 2015 14:49:27 +0300
Subject: Arabic diacritics
In-Reply-To: <CAJKta0xwmsii-evPMj7S91UNRhg38wbJ8UYy=ff4EaRPZsUTVw@mail.gmail.com>
References: <CAJxhVHjwZskeV+bT9mQ84mf1n_+89aY9bMqGkE9i1RvW_ncBSQ@mail.gmail.com>
 <CAJKta0xwmsii-evPMj7S91UNRhg38wbJ8UYy=ff4EaRPZsUTVw@mail.gmail.com>
Message-ID: <CAJxhVHjY-P-pnH8Nofd9FSSDO9JGDdmaZxLXtRpcyTmCLo0MJQ@mail.gmail.com>

many thanks, this exactly the needed information :)

respectfully

2015-05-15 19:09 GMT+03:00 Denis Jacquerye <moyogo at gmail.com>:

> You should use ARABIC SHADDA U+0651 in all positions. The presentation
> forms (isolated, medial, final forms) are for compatibility with legacy
> systems.
> See what is said in http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf
> about the Arabic Presentation Forms-B.
>
> Cheers,
>
>
> On Fri, 15 May 2015 at 15:53 ??? ??????? ???? <
> abdo.alrhman.aiman at gmail.com> wrote:
>
>> hi,
>>
>> regarding the Arabic diacritics. e.g. for the Shadda, we
>> have:
>>
>> 1. The form that people type:
>> http://unicode-table.com/en/0651/
>>
>> 2. An Isolated form. It should be the same, but looks different in the
>> Unicode table, which is confusing me now.
>> http://unicode-table.com/en/FE7C/
>>
>> 3. A medial form:
>> http://unicode-table.com/en/FE7D/
>>
>> When do I use 1/2, and when do I use 3?
>>
>> some diacritics has e.g. isolated and medial forms. Some have
>> only one of these forms, some have both. So, where does each of them go?
>>
>> respectfully
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/3290e766/attachment.html>

From doug at ewellic.org  Mon May 18 13:19:01 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 11:19:01 -0700
Subject: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>

L2/15-145R says:

> In CLDR 28, LDML will define a unicode_subdivision_subtag which also
> provides validity criteria for the codes used for regional
> subdivisions (see CLDR ticket #8423). When representing regional
> subdivisions using ISO 3166-2 codes, only those codes that are valid
> for the LDML unicode_subdivision_subtag should be used.

The preliminary subdivisions.xml file includes entries like this:

<subgroup type="GB" contains="UKM GBN SCT EAW ENG WLS NIR"/>
<subgroup type="GB" subtype="SCT" contains="NLK RFW PKN ANS FAL [...] />
<subgroup type="GB" subtype="ENG" contains="GRE HAL HRY KHL NEL [...] />
<subgroup type="GB" subtype="WLS" contains="NTL RCT BGE NWP BGW [...] />
<subgroup type="GB" subtype="NIR" contains="NDN NYM ANT DOW DRY [...] />

In the United Kingdom case above, four of the "subtypes" are identified
with the four countries that make up the UK, and have counties
(districts, boroughs, etc.) "contained" below them. The other three
subtypes (UKM, GBN, EAW) don't really apply to flags and aren't
discussed further here.

Several of the nations in ISO 3166 have this kind of hierarchy. I
haven't checked whether any of them extend to more than two levels of
subdivisions.

Is the new mechanism intended to allow flag tags that include either
"subtype" values or "contains" values? For example:

<1F3F3 E0047 E0042 E002D E0053 E0043 E0054> (GB-SCT)
for the Scottish flag

and

<1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
for the North Lanarkshire council area flag

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From markus.icu at gmail.com  Mon May 18 13:28:18 2015
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 18 May 2015 11:28:18 -0700
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
Message-ID: <CAN49p6pzvgFMtNW9zoWjcrkqy54k7wzcdGGco1ODs9NEsxGCbA@mail.gmail.com>

On Mon, May 18, 2015 at 11:19 AM, Doug Ewell <doug at ewellic.org> wrote:

> Is the new mechanism intended to allow flag tags that include either
> "subtype" values or "contains" values?


As far as I can tell from your quotes, CLDR will say what's valid (plus
containment info), and Unicode permits you to show a flag for any valid tag.
North Lanarkshire seems perfectly fine.

I am curious to see if the redundant hyphen will be part of the syntax.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/0728d534/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 13:35:45 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 19:35:45 +0100
Subject: Regexes, Canonical Equivalence and Backtracking of Input
Message-ID: <20150518193545.51cb95b8@JRWUBU2>

Philippe and I have got bogged down in a long discussion of how to
parse traces of Unicode strings under canonical equivalence against
non-regular Kleene star of regular expressions.  Fortunately, such
expressions can be expected to have very little use.  A seemingly simple
example is the regex \u0f73* i.e. any number of occurrences of U+0F73
TIBETAN VOWEL SIGN II, and not \u0f71\u0f72*. An example of a string
matching under canonical equivalence is 0F71 0F71 0F72 0F72.

I believe we both thought that characters would arrive from the trace
in a deterministic order.  Now, many regular expression engines
back-track their parsing of the input string (no-one has reported
working with input traces).  A possibly useful trick would be for
characters to be taken from the input file in accordance with the
matching to the pattern, with input also back-tracked if matching
fails.  The notion of next character would depend on the state of the
parsing algorithm.

In the example above, the engine would just take the input in the
order 0F71 0F72 0F71 0F72.  Match found, job done.

One advantage of this scheme is that there would be no need for
adjustments to deal with the interleaving of adjacent matches to
successive subexpressions.  There would be no nagging worry that
one's rational expression was not a regular expression when applied
to traces.

Any theoreticians around may be wondering how this magic is achieved.
The simple answer is that the non-finiteness has been transferred to:

(1) the back-tracking through parse options; and
(2) the algorithm to walk through the character sequencing options.

The algorithm itself should be tractable - Mark Davis has published
an algorithm to generate all strings canonically equivalent to a
Unicode string, and what we need might not be so complex.

I offer this thought up as it seems that, for a regex engine working on
traces with deterministic input, the byte code for a regex
concatenation AB or iteration A* is much more complicated than the code
for the subregexes A and B.  I have a worry that the length of the
compiled code might even be exponential with the length of the regex.
(I may be wrong - there might be a limit to what one can do for worst
case complexity of the interleaving.)  Choosing the input to match the
regex would remove this problem.

Richard.

From andrewcwest at gmail.com  Mon May 18 13:37:06 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Mon, 18 May 2015 19:37:06 +0100
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
Message-ID: <CALgEMhxPm_T8=jCR-77Vs7sm_6ZEx5-T+uMjCz52XOefLuEEHA@mail.gmail.com>

On 18 May 2015 at 19:19, Doug Ewell <doug at ewellic.org> wrote:
>
> Is the new mechanism intended to allow flag tags that include either
> "subtype" values or "contains" values? For example:

That is my understanding.

> <1F3F3 E0047 E0042 E002D E0053 E0043 E0054> (GB-SCT)
> for the Scottish flag
>
> and
>
> <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
> for the North Lanarkshire council area flag

I don't believe that North Lanarkshire has an associated flag, which I
think is the case for most UK counties and councils (Cornwall, Devon
and Dorset all have flags, but they may be the exceptions).  In fact
not all of the four nations comprising the UK have a flag -- for
political reasons there is no official flag for Northern Ireland, so I
do not know what an implementation would display for <1F3F3 E0047
E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag
emblazoned with "GB-NIR".

Andrew

From verdy_p at wanadoo.fr  Mon May 18 13:47:19 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 18 May 2015 20:47:19 +0200
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <CAN49p6pzvgFMtNW9zoWjcrkqy54k7wzcdGGco1ODs9NEsxGCbA@mail.gmail.com>
References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
 <CAN49p6pzvgFMtNW9zoWjcrkqy54k7wzcdGGco1ODs9NEsxGCbA@mail.gmail.com>
Message-ID: <CAGa7JC3i5=Lr__2RmiotMttQdD9zMkQA3sE+La+1K_RPSei4Ww@mail.gmail.com>

The hyphen is not redundant in ISO 3166 that defines primary codes with
variable length (even if ISO 3166 part 1 for now only use two-letter codes).
Sometime in a future, two letters will not be enough even in ISO 3166-1, if
countries continue to split/merge (this does not happen frequently but is
occurs every few years; and it will not be possible to reuse old codes that
are maintained for a long period). May be then we'll have ISO 3166-1 codes
using digits (such as "A1" or "1A"), but this will cause some problems to
map them to IETF ccTLD codes (within the DNS root registry).
As well the UN M.49 numeric codes will get full if it continues with its
current allocation scheme (using ranges of numbers by continental regions).
Or the other solution will be to extend the set of allowed letters.

2015-05-18 20:28 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Mon, May 18, 2015 at 11:19 AM, Doug Ewell <doug at ewellic.org> wrote:
>
>> Is the new mechanism intended to allow flag tags that include either
>> "subtype" values or "contains" values?
>
>
> As far as I can tell from your quotes, CLDR will say what's valid (plus
> containment info), and Unicode permits you to show a flag for any valid tag.
> North Lanarkshire seems perfectly fine.
>
> I am curious to see if the redundant hyphen will be part of the syntax.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/04466311/attachment.html>

From verdy_p at wanadoo.fr  Mon May 18 14:05:49 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 18 May 2015 21:05:49 +0200
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <20150518193545.51cb95b8@JRWUBU2>
References: <20150518193545.51cb95b8@JRWUBU2>
Message-ID: <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>

2015-05-18 20:35 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> The algorithm itself should be tractable - Mark Davis has published
> an algorithm to generate all strings canonically equivalent to a
> Unicode string, and what we need might not be so complex.


Even this algorithm from Mark Davis will fail in this case:

- You can use it easily to transform a regexp containing (\u0F73) into a
regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

- But this leaves the same problem for unbounded repetititions with the "+"
or "*" or "{m,}" operators.

- However you can use it for bounded repetitions with "{m,n}", provided
that "n" is not too large because the total number of expendaned
alternatives (without repetitions) explodes exponentially with a power
proportional to "n" (the base of the exponent depends on the basic
non-repeated string and the number of canonical equivalents it has.

Now all the problem is how to do the backtracking, and if it works, and how
to expose the matched captures (which will still be discontiguous,
including $0) and then how you can perform a safe find&replace operation:
it is hard to specify the replacement with simple "$n" placeholders, you
need more complex placeholders for handling discontiguous matches:

$n has to become not just a string, but an object whose default "tostring"
property is the exact content of the match, but other properties are needed
to expose the interleaving characters, or some context before and after the
match (notably when these contexts contain combining characters that are
NOT blocked by the match itself.

Backtracing is an internal thing before even handling matches, they occur
where there is still NO match to return, even if the regexp engine offers a
way to use a callback instead of a basic replacement string containing "$n"
placeholders, so this callback would not be called.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/dde573ff/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 14:33:37 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 20:33:37 +0100
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <CALgEMhxPm_T8=jCR-77Vs7sm_6ZEx5-T+uMjCz52XOefLuEEHA@mail.gmail.com>
References: <20150518111901.665a7a7059d7ee80bb4d670165c8327d.7866019baa.wbe@email03.secureserver.net>
 <CALgEMhxPm_T8=jCR-77Vs7sm_6ZEx5-T+uMjCz52XOefLuEEHA@mail.gmail.com>
Message-ID: <20150518203337.4949e7cc@JRWUBU2>

On Mon, 18 May 2015 19:37:06 +0100
Andrew West <andrewcwest at gmail.com> wrote:

> > <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
> > for the North Lanarkshire council area flag
> 
> I don't believe that North Lanarkshire has an associated flag, which I
> think is the case for most UK counties and councils (Cornwall, Devon
> and Dorset all have flags, but they may be the exceptions).  In fact
> not all of the four nations comprising the UK have a flag -- for
> political reasons there is no official flag for Northern Ireland, so I
> do not know what an implementation would display for <1F3F3 E0047
> E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag
> emblazoned with "GB-NIR".

As the Ulster Banner is still in use, and still does unofficially
represent Northern Ireland, perhaps it should have its own codepoint.

I'm not sure of the strength of the argument for St Patrick's Cross.
Perhaps it too should have its own codepoint, especially if it is
evolving from being a flag of Ireland (apparently not used by the Irish
rugby union team) to a flag of Northern Ireland.

Richard.

From eliz at gnu.org  Mon May 18 14:40:21 2015
From: eliz at gnu.org (Eli Zaretskii)
Date: Mon, 18 May 2015 22:40:21 +0300
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <20150518193545.51cb95b8@JRWUBU2>
References: <20150518193545.51cb95b8@JRWUBU2>
Message-ID: <83mw11ekt6.fsf@gnu.org>

> Date: Mon, 18 May 2015 19:35:45 +0100
> From: Richard Wordingham <richard.wordingham at ntlworld.com>
> 
> Mark Davis has published an algorithm to generate all strings
> canonically equivalent to a Unicode string

Where can I find the description of that algorithm?

From doug at ewellic.org  Mon May 18 15:10:38 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 13:10:38 -0700
Subject: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net>

Markus Scherer <markus dot icu at gmail dot com> wrote:

> As far as I can tell from your quotes, CLDR will say what's valid
> (plus containment info), and Unicode permits you to show a flag for
> any valid tag. North Lanarkshire seems perfectly fine.

I'm under the impression that this will be a standard Unicode mechanism,
defined in principle by TUS and in detail by the upcoming revision of
UTR #51, with data (but no additional rules) supplied by CLDR.

> I am curious to see if the redundant hyphen will be part of the
> syntax.

Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2
requires it (Section 5.2), and the syntax diagram at the end of
L2/15-145R shows it:

B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))

where TH is TAG HYPHEN-MINUS.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Mon May 18 15:14:32 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 13:14:32 -0700
Subject: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net>

I know I'll regret this...

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Sometime in a future, two letters will not be enough even in ISO
> 3166-1, if countries continue to split/merge (this does not happen
> frequently but is occurs every few years; and it will not be possible
> to reuse old codes that are maintained for a long period).

ISO 3166-1 already defines alpha-3 and numeric code elements, as well as
alpha-2.

ISO 3166/MA has added approximately one code element per year on average
since the breakup of the Soviet Union. There are approximately 336
unassigned alpha-2 code elements, and if any of the assigned ones is
withdrawn, it can be recycled in 50 years.

> May be then we'll have ISO 3166-1 codes using digits (such as "A1" or
> "1A"), but this will cause some problems to map them to IETF ccTLD
> codes (within the DNS root registry).

Adapting to this challenge, if and when it arises, should be child's
play for the DNS, which has recently introduced TLDs like
".???????????" (or ".xn--clchc0ea0b2g2a9gcd" if
one prefers).

> As well the UN M.49 numeric codes will get full if it continues with
> its current allocation scheme (using ranges of numbers by continental
> regions). Or the other solution will be to extend the set of allowed
> letters.

UN M.49 numeric code elements (equivalent to ISO 3166-1) are assigned
alphabetically by English country name, or as close as possible, with
some exceptions related to historical names. There are no allocations by
geographical region.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Mon May 18 15:26:43 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 18 May 2015 22:26:43 +0200
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net>
References: <20150518131432.665a7a7059d7ee80bb4d670165c8327d.e4910e849c.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC32RvOqrdmNH4pXo6u77szH+iaAcjf3Su3au=M0Vwphmw@mail.gmail.com>

2015-05-18 22:14 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> I know I'll regret this...
>
You should not

>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > Sometime in a future, two letters will not be enough even in ISO
> > 3166-1, if countries continue to split/merge (this does not happen
> > frequently but is occurs every few years; and it will not be possible
> > to reuse old codes that are maintained for a long period).
>
> ISO 3166-1 already defines alpha-3 and numeric code elements, as well as
> alpha-2.
>

But how to work with the 2 letters limitation when the world wants more
stability in codes (this was an important reason why ISO 639 was not fully
integrated in IETF tags, and why the IETF tags have chosen the stability by
keeping also the codes that hbave been deleted in ISO 639, but only
deprecated in IETF language tags (BCP47).

We've already seen the famous reuse before 50 years (do you remember when
CS was reassigned just a few months after it was discarded after an initial
introduction for some months in Serbia-Montenegro?)

ISO coding standard are known to be unstable. This would also be true of
the UCS if Unicode did not push its stability pact with ISO!

But now let's remembers that parts of ISO 3166 are also included (not
fully) in BCP47 tags that require the stability. IT will prohibit
reassignments by ISO (or if this happens, this will break BCP47 and et IETF
will reject the change and will use another subtag if needed.

So country codes cannot be reassigned (and we can expect many more
merges/splits or changes of regimes in the many troubled areas of the world.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/26d73aff/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 15:32:02 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 21:32:02 +0100
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
Message-ID: <20150518213202.19ef7cd2@JRWUBU2>

On Mon, 18 May 2015 21:05:49 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
> 
> > The algorithm itself should be tractable - Mark Davis has published
> > an algorithm to generate all strings canonically equivalent to a
> > Unicode string, and what we need might not be so complex.
> 
> 
> Even this algorithm from Mark Davis will fail in this case:

How so?  The regexp is \u0F73*, which is converted to a non-capturing
(\u0F71\u0F72)*.

Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
will fail at 0F40 and an attempt will be made starting at the 0F71.
The input string handling part will then present a run of three
non-starters:

\u0F71 \u0F71 \u0F72

I think the process is even simpler than I first thought.

The engine will look for a match for \u0F71, and take it from this
list, leaving \u0F71 \u0F72.

It will then look for a match for \u0F72, and take it form the list,
leaving \u0F71.

It will then look for a match for \u0F71, and take it from the list.

It will then look for a match for \u0F72.  It will fail, and then back
track, disgorging the \0F71.

The input 'stream' now looks like \u0F71 \u0F42.  This will match
nothing; it is after the matching substream. 

The matching substring is:

None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.

Its value, as a trace, is 0F71 0F72.

> - You can use it easily to transform a regexp containing (\u0F73)
> into a regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

That is *not* what I am suggesting.  The regex needs decomposing, but
no other transformations.  It is the string representing the input
trace that is expanded.

> - But this leaves the same problem for unbounded repetititions with
> the "+" or "*" or "{m,}" operators.

Not at all - that is the beauty of the scheme.  On the regex
side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
Putting back the unused fragments of the run of non-starters in the
input trace is the most difficult part.

> Now all the problem is how to do the backtracking,

Yes, that may be more difficult than I thought.  Comparing against
literal characters is simple, but it may be more complicated when
matching against a list of alternative characters.  Back-tracking
schemes may not be set up to try the next character on a list of
alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
string 0F71 0F72.  The alternative (\u0f72|\u0f71) would first take the
0F72, and only on backtracking would it take the 0F71 instead.  This is
an issue with traces that does not appear with strings.

Richard.

From verdy_p at wanadoo.fr  Mon May 18 15:43:42 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 18 May 2015 22:43:42 +0200
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net>
References: <20150518131038.665a7a7059d7ee80bb4d670165c8327d.b91abf14bc.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC2b5wV3Z7LJ6qctphO98d+Ng0cMNFZt=x3E4EUW2VYoXA@mail.gmail.com>

If ever the country codes used in BCP47 becomes full (all pairs of letters
used), just some time before this happens, we could see new prefixes added
before a new range of code. It is possible to use a 1-letter prefix for new
country/territory code extensions, but with some maintenance of BCP47
parsing rules (notably the letter used should not be reordered with other
singleton prefixes)

But I feel it will first be simpler to assign a special 2-letter code like
"C1-" followed by a new new series of 2-letters country codes (ccTLDs will
survive, in fact with the development of new gTLDs not limited to 2
characters, the new countries will prefer asking for a more descriptive
gTLD, even if they don't have a 2-letter ccTLD.

Or 2-letter codes will be deprecated in favor of 3-letter codes (but the
IETF will keep all the existing 2-letter ccTLDs as long as their sponsors
support them (and don't require changing it to another TLD, even if this
breaks existing URLs encoded throughout the web).

There's no requirement for ISO 3166 codes to match exactly with a TLD in
the global DNS (this is already the case since long for the ".uk" ccTLD,
because ".gb" is almost unused). But the stability of couintry codes is
desirable as well in URLs (stored within encoded documented and for which
it will be hard to make global substitutions: the solution could be to use
tracking dates to resolve domain names, but the worldwide DNS currently
does not support this type of query by date and registrars would not like
to have to keep history files for long, and software/OS developers don't
want to include and maintain such data for their domain name resolving
clients).

It is however possible that in some future the existing URLs requiring
domain names will be deprecated in favor of unique IDs (e.g. based on
IPv6): users won't see ndomain names, but labels retreived from some
whois-like database, or shown by search engines and possibly translated. It
would be also an improvement even if this breaks the business of existing
registrars (however registrars will still have business for selling
PKI-related services). These IDs can also be used in URIs. In fact the DNS
system is already antique in its design (and its very strange and complex
encoding for IDNA that no one can read).


2015-05-18 22:10 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Markus Scherer <markus dot icu at gmail dot com> wrote:
>
> > As far as I can tell from your quotes, CLDR will say what's valid
> > (plus containment info), and Unicode permits you to show a flag for
> > any valid tag. North Lanarkshire seems perfectly fine.
>
> I'm under the impression that this will be a standard Unicode mechanism,
> defined in principle by TUS and in detail by the upcoming revision of
> UTR #51, with data (but no additional rules) supplied by CLDR.
>
> > I am curious to see if the redundant hyphen will be part of the
> > syntax.
>
> Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2
> requires it (Section 5.2), and the syntax diagram at the end of
> L2/15-145R shows it:
>
> B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))
>
> where TH is TAG HYPHEN-MINUS.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/02d6b0ca/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 15:46:44 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 21:46:44 +0100
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <83mw11ekt6.fsf@gnu.org>
References: <20150518193545.51cb95b8@JRWUBU2>
	<83mw11ekt6.fsf@gnu.org>
Message-ID: <20150518214644.024f8c42@JRWUBU2>

On Mon, 18 May 2015 22:40:21 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> > Date: Mon, 18 May 2015 19:35:45 +0100
> > From: Richard Wordingham <richard.wordingham at ntlworld.com>
> > 
> > Mark Davis has published an algorithm to generate all strings
> > canonically equivalent to a Unicode string
> 
> Where can I find the description of that algorithm?

Section 5 of http://unicode.org/notes/tn5/ .  There's a lot of detail
missing, and its easy to overlook the Hangul sylables.  The complete
code is rather more complicated than it looks from the wording,
especially if you want successive candidates on successive calls.  You
also need to include the legal permutations of the non-starters - the
code as given only delivers the FCD canonical equivalents.

On further thought, I also think its actually unnecessary for this
application.

Richard.

From verdy_p at wanadoo.fr  Mon May 18 15:56:47 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 18 May 2015 22:56:47 +0200
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <20150518213202.19ef7cd2@JRWUBU2>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
 <20150518213202.19ef7cd2@JRWUBU2>
Message-ID: <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>

Isn't it possible for your basic substitution to transform \uf073 into a
character class [\uf071\uf072\uf073] that the regexp considers as a single
entity to check ?
In that case, backtracking for matching \u0F73*\u0F72 is simpler:
 [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only one
character class (instead of one character).

It is also posible also to transform \u0F73*\u0F72 into the really
equivalent: (\u0F71\0F72)*\u0F72 | (\0F72\u0F71)*\u0F72  | (\0F73)*\u0F72
(assuming that in the non-capturing group you are already performing
canonical reorderings using counters (as many counters as there are
distinct ccc values in these groups, excluding blockers that create groups
always matched separately without any need to use backtrack "through" them:
if this does not match as at a blocking position, there's no other
alternative possible, so this is a definitive non-match)


2015-05-18 22:32 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Mon, 18 May 2015 21:05:49 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> > richard.wordingham at ntlworld.com>:
> >
> > > The algorithm itself should be tractable - Mark Davis has published
> > > an algorithm to generate all strings canonically equivalent to a
> > > Unicode string, and what we need might not be so complex.
> >
> >
> > Even this algorithm from Mark Davis will fail in this case:
>
> How so?  The regexp is \u0F73*, which is converted to a non-capturing
> (\u0F71\u0F72)*.
>
> Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
> will fail at 0F40 and an attempt will be made starting at the 0F71.
> The input string handling part will then present a run of three
> non-starters:
>
> \u0F71 \u0F71 \u0F72
>
> I think the process is even simpler than I first thought.
>
> The engine will look for a match for \u0F71, and take it from this
> list, leaving \u0F71 \u0F72.
>
> It will then look for a match for \u0F72, and take it form the list,
> leaving \u0F71.
>
> It will then look for a match for \u0F71, and take it from the list.
>
> It will then look for a match for \u0F72.  It will fail, and then back
> track, disgorging the \0F71.
>
> The input 'stream' now looks like \u0F71 \u0F42.  This will match
> nothing; it is after the matching substream.
>
> The matching substring is:
>
> None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.
>
> Its value, as a trace, is 0F71 0F72.
>
> > - You can use it easily to transform a regexp containing (\u0F73)
> > into a regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)
>
> That is *not* what I am suggesting.  The regex needs decomposing, but
> no other transformations.  It is the string representing the input
> trace that is expanded.
>
> > - But this leaves the same problem for unbounded repetititions with
> > the "+" or "*" or "{m,}" operators.
>
> Not at all - that is the beauty of the scheme.  On the regex
> side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
> Putting back the unused fragments of the run of non-starters in the
> input trace is the most difficult part.
>
> > Now all the problem is how to do the backtracking,
>
> Yes, that may be more difficult than I thought.  Comparing against
> literal characters is simple, but it may be more complicated when
> matching against a list of alternative characters.  Back-tracking
> schemes may not be set up to try the next character on a list of
> alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
> string 0F71 0F72.  The alternative (\u0f72|\u0f71) would first take the
> 0F72, and only on backtracking would it take the 0F71 instead.  This is
> an issue with traces that does not appear with strings.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/18308ce4/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 16:14:11 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 18 May 2015 22:14:11 +0100
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
 <20150518213202.19ef7cd2@JRWUBU2>
 <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>
Message-ID: <20150518221411.4c508924@JRWUBU2>

On Mon, 18 May 2015 22:56:47 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Isn't it possible for your basic substitution to transform \uf073
> into a character class [\uf071\uf072\uf073] that the regexp considers
> as a single entity to check ?
> In that case, backtracking for matching \u0F73*\u0F72 is simpler:
>  [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only
> one character class (instead of one character).

I'm still waiting for your explanation of how your scheme for European
diacritics (as used in SE Asia) would work.  This thread is intended for
the idea of using the regex to decide which character to take as the
next character from the input trace.  In the other thread, I'm still not
sure whether you're working with traces or strings.

Richard.

From doug at ewellic.org  Mon May 18 16:38:19 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 14:38:19 -0700
Subject: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net>

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> ISO 3166-1 already defines alpha-3 and numeric code elements, as well
>> as alpha-2.
>
> But how to work with the 2 letters limitation when the world wants
> more stability in codes (this was an important reason why ISO 639 was
> not fully integrated in IETF tags, and why the IETF tags have chosen
> the stability by keeping also the codes that hbave been deleted in ISO
> 639, but only deprecated in IETF language tags (BCP47).

I assume you're aware of the extent of my involvement in BCP 47, so this
is a semi-rhetorical question.

If and when ISO 3166/MA manages to use up all of the remaining 336
unassigned code elements -- nearly half of the TOTAL possible code space
of 676 two-letter combinations -- the corresponding numeric code
elements will be assigned as BCP 47 region subtags instead.

> We've already seen the famous reuse before 50 years (do you remember
> when CS was reassigned just a few months after it was discarded after
> an initial introduction for some months in Serbia-Montenegro?)

What actually happened was, 'CS' was withdrawn for Czechoslovakia and
then assigned to Serbia and Montenegro. At that time, the waiting period
was five years; the 'CS' incident is what resulted in the change to 50
years.

> But now let's remembers that parts of ISO 3166 are also included (not
> fully) in BCP47 tags that require the stability. IT will prohibit
> reassignments by ISO (or if this happens, this will break BCP47 and et
> IETF will reject the change and will use another subtag if needed.

Again, I'm guessing you already know that I know how BCP 47 works.

ISO 3166/MA can recycle alpha-2 code elements 50 years after withdrawal
if they feel like it. BCP 47 can't prevent that. That's why BCP 47 has a
mechanism to work around that possibility.

> So country codes cannot be reassigned (and we can expect many more
> merges/splits or changes of regimes in the many troubled areas of the
> world.

Changes of regimes don't usually result in new 3166 code elements. The
same is true for merges (look at DE/DD or YE/YD). New and changed
country names usually do.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Mon May 18 16:55:02 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 14:55:02 -0700
Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net>

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> If ever the country codes used in BCP47 becomes full (all pairs of
> letters used), just some time before this happens, we could see new
> prefixes added before a new range of code. It is possible to use a
> 1-letter prefix for new country/territory code extensions, but with
> some maintenance of BCP47 parsing rules (notably the letter used
> should not be reordered with other singleton prefixes)

This would be a major revision to BCP 47, it would have nothing to do
with reordering, and it would not in any case involve 1-letter prefixes,
which already have a different meaning. And the time frame we are
talking about is reminiscent of Ken's estimate of when 17 planes will no
longer be enough for Unicode.

> But I feel it will first be simpler to assign a special 2-letter code
> like "C1-" followed by a new new series of 2-letters country codes

We actually thought about this stuff over in LTRU. Really.

I'm not the least bit concerned about the DNS. Five years from now they
could be assigning TLDs consisting entirely of emoji.

This is no longer relevant to flag tags or anything else Unicode.
 
--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Mon May 18 17:08:27 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 00:08:27 +0200
Subject: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net>
References: <20150518143819.665a7a7059d7ee80bb4d670165c8327d.dd6af4f7c2.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC3ZOiaHyOajvPTXcdHiD2xhk3GNyWcwKMJUCdc8EEKerg@mail.gmail.com>

2015-05-18 23:38 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > So country codes cannot be reassigned (and we can expect many more
> > merges/splits or changes of regimes in the many troubled areas of the
> > world.
>
> Changes of regimes don't usually result in new 3166 code elements. The
> same is true for merges (look at DE/DD or YE/YD). New and changed
> country names usually do.


I just included merges only to be complete because they frequently occur a
little time after a split (and not with the former part).

But of course merges are much less frequent than splits. An in today's
globalized world, splits are even easier than they were in the past (where
merges were the results of invasions/wars/conquests).

The rate of splits is in fact accelerating in history, even in countries
living in peace, this does not mean that they terminate all their
partnerships, just that they take the right to create their own alliances.
There are reasons for them: cultural (language), national taxes, economic
difficulties in some regions, unemployment, management of resources (water,
constructible or cultivable soils) but the most important reasons is
political (defiance between political parties, or brutality against
minorities and mutual misunderstanding)...

In the last 50 years the most important changes came from decolonialisation
and its independances (that was completed at end the the 1970's). But now
we are seeing splits for much smaller entities, and this can occur in many
more places.

With ISO 3166-2 the situation within countries is much more complex and
more frequent (in Europe most countries are undergoing large changes in
their administrative divisions, the changes that will occur next year in
French regions is still not taken into account in ISO 3166-2, as well as
the change that is already effective within one department, splitted in two
parts with only one which remains as a department, the other one being a
group of communes erected into a new territorial collectivity taking all
powers of its former department, for local adminsitration only, but with
the national power still not divided in what is now a "circonscription
d?partementale" with the same departmental prefecture as before the split.

The hierarchical model of subdivisions has in fact lots of exceptions (look
into Spain, UK, Germany, it was already true for France and US, but now it
is also occuring even in the Metropolitan area). In fact we can see several
parallel layers of subdivisions, but for different legal roles/missions.

The ISO 3166-1 also assumes that everything is a country, but it is already
wrong with some dependant territories (not all) of France, UK, US, the
Netherlands, Spain and possibly some islands of China. And these codes also
don't map correctly to effective national divisions (the encoding for
claims in Antartica remains ambiguous, depending on who uses the data).
There are also reserves for things that are not countries but groups of
countries (EU, WIPO areas...), and there could exist new codes for other
international alliances (these look like "merges" except that they are not
full merges and the entities continue to coexist separately).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/34cdd120/attachment.html>

From verdy_p at wanadoo.fr  Mon May 18 17:25:33 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 00:25:33 +0200
Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net>
References: <20150518145502.665a7a7059d7ee80bb4d670165c8327d.255a63ba7a.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC1=_Vc-_WWcAv=YjxJGVeLJM34nFeh8WtZJU76raSTXkA@mail.gmail.com>

2015-05-18 23:55 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > If ever the country codes used in BCP47 becomes full (all pairs of
> > letters used), just some time before this happens, we could see new
> > prefixes added before a new range of code. It is possible to use a
> > 1-letter prefix for new country/territory code extensions, but with
> > some maintenance of BCP47 parsing rules (notably the letter used
> > should not be reordered with other singleton prefixes)
>
> This would be a major revision to BCP 47, it would have nothing to do
> with reordering,


It woiuld have to do because all subtags after the pricmary language subtag
in BCP47 are optional, and you can distincguish them only by their length
*or* by the role assigned to specific singletons: there's already the "x"
singleton exception (that is ordered at end), but other singletons are
currently described to use a canonical order but it is used only for
encoding variants unrelated to region subtags or even to the languages.

Very few singletons are used in fact (the singleton subtags occuring at
start of ther tag are also treated separately from others: it could also be
used to support new syntaxes for BCP47 tags, but fow we just have "i-",
deprecated but still valid, and "x-" for private use; for all other letters
there's no parsing defined for now, their syntax is unknown and they are
not interchangeable without a standard, so they are used only for private
use; another constraint comes from the length limit of subtags: the first
subtag is either a special singleton, or a primary language code using 2 or
3 letters for now; some BCP47 use an empty first subtag, i.e. the tag
starts by an hyphen; double hyphens could be used as extensions to chhange
locally the parsing rules and possibly return to the next logical subtag
and could be used to encode international organization without needing a
formal "exceptional reservation" in ISO 3166-1; for example "*-EU" in could
have been encoded as "--O-EU" and we could have the same system for NATO,
EEA, EFTA... There's still ample space for extensions of parsing rules in
BCP47, but not in ISO3166.)

ISO 3166 also encodes some 4-letter codes but they are not used in BCP47
(so there's no confusion with 4-letter script codes).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/1ff67d01/attachment.html>

From doug at ewellic.org  Mon May 18 17:50:50 2015
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 18 May 2015 15:50:50 -0700
Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes
Message-ID: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net>

This is why I knew I would regret it.

Clearing up some errors here. No more posts from me on this non-Unicode
topic after this one.

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> This would be a major revision to BCP 47, it would have nothing to do
>> with reordering,
>
> It woiuld have to do because all subtags after the pricmary language
> subtag in BCP47 are optional, and you can distincguish them only by
> their length *or* by the role assigned to specific singletons: there's
> already the "x" singleton exception (that is ordered at end), but
> other singletons are currently described to use a canonical order but
> it is used only for encoding variants unrelated to region subtags or
> even to the languages.

All non-initial singletons introduce an extension, except for 'x' which
introduces a private-use sequence, and which must be last.

Even if an extension were defined to hold top-level region information,
WHICH WILL NEVER HAPPEN, it would not matter whether that extension
appeared before or after other extensions, because it would be an
extension and not a region subtag.

> but fow we just have "i-", deprecated but still valid,

"i-" is not deprecated.

> for all other letters there's no parsing defined for now, their syntax
> is unknown and they are not interchangeable without a standard, so
> they are used only for private use

Extension 't' was defined in 2011 and 'u' in 2010. They have
well-defined syntax, specified in RFC 6497 and 6067 respectively.

Undefined singletons may not be used for private use.

> some BCP47 use an empty first subtag, i.e. the tag starts by an
> hyphen;

Absolutely, utterly false.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Mon May 18 18:25:54 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 01:25:54 +0200
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <20150518221411.4c508924@JRWUBU2>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
 <20150518213202.19ef7cd2@JRWUBU2>
 <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>
 <20150518221411.4c508924@JRWUBU2>
Message-ID: <CAGa7JC00PqnDU0-FRZA-yn-oihNSLSORKHgEZAMrre8k8B9Z1g@mail.gmail.com>

I don't work with strings, but with what you seem to call "traces", but
that I call sets of states (they are in fact bitsets, which may be
compacted or just stored as arrays of bytes containing just 1 usefull bit,
but which may be a bit faster; byte arrays are just simpler to program).,
in a stack (I'll use bitsets later to make the structure more compact, if
needed, but for now this is fast enough and not memory intensive even for
large regexps with many repetitions with "+/*/{m,n}" or variable parts).
The internal matcher uses NFD, but needs to track the positions in the
original buffered input for returning captured matches.

There's some optiomization to reduce the size of the bitsets, by defining
classes. The representation of classes in Unicode is more challenging than
with plain ASCII or ISO8859-*, for this reason I limit their length
(differences between the smallest and highest code point), and over this
size the classes are just defined as a sorted string of pairs of
codepoints: I can perform a binary search in that string and look at the
position: with an even position the character is part of the class, with an
odd position, the character is not part of it).

Thanks to a previous message you posted, I noted that my code deos not work
correctly with Hangul precomposed syllables (I perform the decompoisition
to NFD of the input on the fly in the input buffer, but the buffer is
incorrectly advanced when there's a match to the next character, and it can
skip one or two characters of the original input instead of code points in
the NFD transformed input. (I don't have extensive cases for testing
Hangul, I have much more for Latin, Greek, Cyrillic and Arabic, but also
too few for Hebrew where "pathological" cases of regexps are certainly more
likely to occur than in Latin, even with Vietnamese and its frequent double
diacritics).

For now with the complex cases of replacements, I have no precise syntax
defined for specifiying replacements as as simple string with placeholders
I just allow these matches to be passed as objects (rather than just
strings) to a callback that performs the substitutions itself using the
array of captures given by the engine to the callback; I have no idea for
now about how to handle the special cases occuring when computing the
actual replacements:

The callback can insert/delete subsequences everywhere in the input buffer
which is limited in size by the extent of $0, plus any intermediate
characters when there's a discontinuity, plus their left and right contexts
when the match still does not include the full combining sequences (for
most uses cases, the left context is empty, but the right context is
frequently non-empty and contains all combining characters on over the last
base which is part of the match; the callback also does not necessarily
have to modify the input buffer it it does not want to perform replacements
in it, but in that case the input buffer is readonly and I don't need to
feed the contexts which remain empty. There are also left and right context
variables for *each* capture group (some of them may be partly or fully in
another returned capture group).

Finally a question:

I suppose that like many programmers you have read the famous "green
dragon" book of Sethi/Aho/Ullman books about compilers. I can understand
the terminology they use when spoeaking about automatas (and that is found
in many other places), but apparently you are using some terms that I have
to guess from their context.
Good books on the subjext are now becoming difficutlt to find (or they are
more expensive now), and too difficult to use on the web (for such very
technical topics, it really helps to have a printed copy, that you an
annotate, explore, or have beside you instead of on a screen (and printing
ebooks is not an option if they are voluminous). May be you have other
books to recommend. But finding these books in libraries is now becoming
difficult when many are closing or reducing their proposed collections (and
I don't like buying books on the Internet). For the rest, I tend to just
describe what I've made or used or experimented, even if the terms are not
the best ones (some of my references are in French, and dificutl to
translate).

On difficult topics like this one, I'm not paid to perform research and I
can only do that in my spare time from time to time, until I can make
something stable enough for a limited use (without experimental features)
In the past I could work on such research topic, but now we are pressed to
use extisting libraries and not pass lot of time, we sell smaller
incremental but limtied improvements and we know what is volutarily limited
and left unimplemented.


2015-05-18 23:14 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Mon, 18 May 2015 22:56:47 +0200
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> > Isn't it possible for your basic substitution to transform \uf073
> > into a character class [\uf071\uf072\uf073] that the regexp considers
> > as a single entity to check ?
> > In that case, backtracking for matching \u0F73*\u0F72 is simpler:
> >  [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only
> > one character class (instead of one character).
>
> I'm still waiting for your explanation of how your scheme for European
> diacritics (as used in SE Asia) would work.  This thread is intended for
> the idea of using the regex to decide which character to take as the
> next character from the input trace.  In the other thread, I'm still not
> sure whether you're working with traces or strings.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/b28f307a/attachment.html>

From verdy_p at wanadoo.fr  Mon May 18 18:36:21 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 01:36:21 +0200
Subject: [OT] RE: Flag tags with U+1F3F3 and subtypes
In-Reply-To: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net>
References: <20150518155050.665a7a7059d7ee80bb4d670165c8327d.cdebbe1b8e.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC2ep5_pQL=ox_PpWr37cx9426-=O2Bn_0dEnfMt4timbg@mail.gmail.com>

2015-05-19 0:50 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> > but fow we just have "i-", deprecated but still valid,
>
> "i-" is not deprecated.


In the IANA database they are all replaced. I call that "deprecated" a bit
abusively, but there's no longer any interest in them.

>> for all other letters there's no parsing defined for now, their syntax
>> is unknown and they are not interchangeable without a standard, so
>> they are used only for private use

> Extension 't' was defined in 2011 and 'u' in 2010. They have
> well-defined syntax, specified in RFC 6497 and 6067 respectively.

You are speaking of extensions subtags after the initial subtag, I did not
discuss them.

I was just speaking about the initial subtag (before the first hyphen),
where "t" and "u" are not defined: only "x" and "i" are defined there ("i"
is not defined in the other singletons for trailing subtags).

> Undefined singletons may not be used for private use.

For private use (meaning NOT for interchanges) NOTHING is forbidden, you
are never bound to any standard. There are lots of places where these
private extensions are used and not discussed.

>> some BCP47 use an empty first subtag, i.e. the tag starts by an
>> hyphen;

> Absolutely, utterly false.

Absolutely, utterly true, but a word was missing in my sentence "some BCP
47 extensions" (which are private, local only to a specific software in its
internal data).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/fd723294/attachment.html>

From richard.wordingham at ntlworld.com  Mon May 18 19:44:17 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 19 May 2015 01:44:17 +0100
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <CAGa7JC00PqnDU0-FRZA-yn-oihNSLSORKHgEZAMrre8k8B9Z1g@mail.gmail.com>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
 <20150518213202.19ef7cd2@JRWUBU2>
 <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>
 <20150518221411.4c508924@JRWUBU2>
 <CAGa7JC00PqnDU0-FRZA-yn-oihNSLSORKHgEZAMrre8k8B9Z1g@mail.gmail.com>
Message-ID: <20150519014417.38d7115a@JRWUBU2>

On Tue, 19 May 2015 01:25:54 +0200
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I don't work with strings, but with what you seem to call "traces",

For the concept of traces, Wikipedia suffices:
https://fr.wikipedia.org/wiki/Mono%C3%AFde_des_traces .

As far as text manipulation is concerned, the word 'trace' is an
idealisation of how Latin text is written.  Base letters advance the
writing point, so they commute with nothing - canonical combining class
0. Ideally, marks of different canonical combining classes do not
interact with one another when writing, so they commute.  In general,
marks of the same canonical combining class interact with one another,
be it only to move the subsequent one further from the base letter, so
they do not commute.

The traces I refer to are the equivalence classes of Unicode modulo
canonical equivalence.  To apply the theory, I have to regard
decomposable characters as notations for sequences of 1 to 4
indecomposable characters.  The notion works with compatibility
equivalence, and one could use a stronger notion of equivalence, so
that compatibility ideographs did not have singleton decompositions.

Thus, as strings, \u0323\u0302 and \u0302\u0323 are distinct, but as
traces, they are identical.

The lexicographic normal form that is most useful is simply NFD.  The
indecomposable characters are ordered by canonical combining class and
then it doesn't matter; one may as well use codepoint.

> but that I call sets of states (they are in fact bitsets, which may be
> compacted or just stored as arrays of bytes containing just 1 usefull
> bit, but which may be a bit faster; byte arrays are just simpler to
> program)., in a stack (I'll use bitsets later to make the structure
> more compact, if needed, but for now this is fast enough and not
> memory intensive even for large regexps with many repetitions with
> "+/*/{m,n}" or variable parts). 

Your 'bitset' sounds like a general purpose type, and to be an
implementation detail that surfaces in your discussion.

> The internal matcher uses NFD, but
> needs to track the positions in the original buffered input for
> returning captured matches.

That's how I'm working.  I do not regard decomposable characters as
atomic; I am emotionally happy with working with fractions of
characters.

> ... Greek, Cyrillic and Arabic, but also too few for Hebrew where
> "pathological" cases of regexps are certainly more likely to occur
> than in Latin, even with Vietnamese and its frequent double
> diacritics).

I was just thinking respecting canonical equivalence might be very
useful for Hebrew, particularly when dealing with text with accents.

> Finally a question:
> 
> I suppose that like many programmers you have read the famous "green
> dragon" book of Sethi/Aho/Ullman books about compilers. I can
> understand the terminology they use when spoeaking about automatas
> (and that is found in many other places), but apparently you are
> using some terms that I have to guess from their context.

No, I started off by hunting the web to try and work out what was
special about a regular expression, and found the articles in
Wikipedia quite helpful.  When working out how to make matching
respect canonical compliance, I started out with normalising strings
to NFD.  Only after I had generalised the closure properties of
regular languages from strings to these representative forms (with the
exception of Kleene star) did I finally discover what I had long
suspected, that I was not the first person to investigate regular
expressions on non-free monoids.  What does surprise me is that I
cannot find any evidence that any one else has made the connection
between trace monoids and Unicode strings under canonical equivalence.
I would like update the article on the trace monoid with its most
important example, Unicode strings under canonical equivalence, but,
alas, that seems to be 'original research'!

I'm beginning to think that 'letting the regex choose the input
character' might be a better method of dealing with interleaving of
subexpressions even for 'non-deterministic' engines, i.e. those which
follow all possible paths in parallel.  I'll have to compare the
relevant complexities.

> Good books on the subjext are now becoming difficutlt to find (or
> they are more expensive now), and too difficult to use on the web
> (for such very technical topics, it really helps to have a printed
> copy, that you an annotate, explore, or have beside you instead of on
> a screen (and printing ebooks is not an option if they are
> voluminous). May be you have other books to recommend.

Google Books, in English, gives access to a very helpful chapter on
regular languages in trace monoids in 'the Book of Traces'.

I found Russ Cox's Internet notes on regular expressions helpful, though
not everyone agrees with his love of non-determinism.

Richard.

From mark at macchiato.com  Tue May 19 00:18:59 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Mon, 18 May 2015 22:18:59 -0700
Subject: Tag characters
In-Reply-To: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
References: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
Message-ID: <CAJ2xs_F6_oMoN3NoiO-q7HOz6kX0F6UEaYyhTZu=KFKjcVedYQ@mail.gmail.com>

?A few notes.

A more concrete proposal will be in a PRI to be issued soon, and people
will have a chance to comment more then. (I'm not trying to discourage
discussion, just pointing out that there will be something more concrete
relatively soon to comment on?people are pretty busy getting 8.0 out the
door right now.)

The principal reason for 3 digit codes is because that is the mechanism
used by BCP47 in case ISO screws up codes (as they did for CS).

The syntax does not need to follow the 3166 syntax - the codes correspond
but are not the same anyway. So we didn't see the necessity for the hyphen,
syntactically.

There is a difference between EU and UN; the former is in BCP47. That being
said, we could look at making the exceptionally reserved codes valid for
this purpose (or at least the UN code). It appears that there are only 3
exceptionally reserved codes that aren't in BCP47: EZ, UK, UN.

Just because a code is valid doesn't mean that there is a flag associated
with it. Just like the fact that you can have the BCP47 code ja-Ahom-AQ
doesn't mean that it denotes anything useful. I'd expect vendors to not
waste time with non-existent flags. However, we could also discuss having a
mechanism in CLDR to help provide guidelines as to which subdivisions are
suitable as flags.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Sat, May 16, 2015 at 10:07 AM, Doug Ewell <doug at ewellic.org> wrote:

> L2/15-145R says:
>
>  On some platforms that support a number of emoji flags, there is
>> substantial demand to support additional flags for the following:
>> [...]
>> Certain supra-national regions, such as Europe (European Union flag)
>> or the world (e.g. United Nations flag). These can be represented
>> using UN M49 3-digit codes, for example "150" for Europe or "001" for
>> World.
>>
>
> These are uncomfortable equivalence classes. Not all countries in Europe
> are members of the European Union, and the concept of "United Nations" is
> not really the same by definition as "all countries in the world."
>
> The remaining UN M.49 code elements that don't have a 3166-1 equivalent
> seem wholly unsuited for this mechanism (and those that do, don't need it).
> There are no flags for "Middle Africa" or "Latin America and the Caribbean"
> or "Landlocked developing countries."
>
> Some trans-national organizations might _almost_ seem as if they could be
> shoehorned into an M.49 code element, like identifying 035 "South-Eastern
> Asia" with the ASEAN flag, but this would be problematic for the same
> reasons as 150 and 001.
>
> Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for
> "European Union" and "UN" for "United Nations." If these flags are the use
> cases, why not simply use those alpha-2 code elements, instead of burdening
> the new mechanism with the 3-digit syntax?
>
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150518/c849d36b/attachment.html>

From verdy_p at wanadoo.fr  Tue May 19 07:57:58 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 14:57:58 +0200
Subject: Tag characters
In-Reply-To: <CAJ2xs_F6_oMoN3NoiO-q7HOz6kX0F6UEaYyhTZu=KFKjcVedYQ@mail.gmail.com>
References: <794493C42D714C3C8A58D2F45AA36663@DougEwell>
 <CAJ2xs_F6_oMoN3NoiO-q7HOz6kX0F6UEaYyhTZu=KFKjcVedYQ@mail.gmail.com>
Message-ID: <CAGa7JC2kkPffSKkMNndjeocp+MTM5k1kyngpe3cCeHW9=6imGQ@mail.gmail.com>

2015-05-19 7:18 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> There is a difference between EU and UN; the former is in BCP47. That
> being said, we could look at making the exceptionally reserved codes valid
> for this purpose (or at least the UN code). It appears that there are only
> 3 exceptionally reserved codes that aren't in BCP47: EZ, UK, UN.
>

There are also reserved codes for WIPO areas; there are special codes
requested by ITU and UPU or not removed from ISO3166 also on their demand
for maintaining their own standards (may be there will be other codes
requested by IATA and OACI or some international railways organisation, or
maritime organisation for oceans in the "international waters"). Thanks for
now we don't have to handle specific "region" code for the Moon or
"divisions" of the solar system, or even for some groups of orbital
airspace over the Earth (from stratospheric to geostationnary), as for now
they are still considered international (and country laws only apply to
individual equipements or when they have to fall back to ground or
preferably oceans)...
We could as well imagine other regions like poles, or hemispheres, or 1
hour (15?) bands of  longitude (excluding polar areas within
arctic/antarctic circle or within the +/-85?circle, commonly used in
geography for showing maps with Mercator projections)
There are various standards that define codes for their regions; some of
them have political importances, and some have specific localized data
associated to them and for which there must not exist collisions with
existing or future ISO3166-1 country codes. For such applications however
aplpications should use the concept of "namespace" to qualify each code
source (ISO3166 being just one of them, IETF being another one, the local
application using another namespace if needed for its regions; the same
remark also applies if there's need of private codes for "pseudo-languages"
or "pseudo-language-variants" or "pseudo-scripts"), and with the mechanism
of namespaces you could even track versions (like it is used in XMLNS)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/418295e5/attachment.html>

From verdy_p at wanadoo.fr  Tue May 19 07:58:33 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 19 May 2015 14:58:33 +0200
Subject: Regexes, Canonical Equivalence and Backtracking of Input
In-Reply-To: <20150519014417.38d7115a@JRWUBU2>
References: <20150518193545.51cb95b8@JRWUBU2>
 <CAGa7JC0T82DYBwuYwhn-f61eSy=TQMKRZuvh5TZmkQjZ=yuU9w@mail.gmail.com>
 <20150518213202.19ef7cd2@JRWUBU2>
 <CAGa7JC1HL=NaH-uCVgAM9SQm8SH7iQXq5rMgXHyqZ8Eq13fE-A@mail.gmail.com>
 <20150518221411.4c508924@JRWUBU2>
 <CAGa7JC00PqnDU0-FRZA-yn-oihNSLSORKHgEZAMrre8k8B9Z1g@mail.gmail.com>
 <20150519014417.38d7115a@JRWUBU2>
Message-ID: <CAGa7JC1HD5cG9zMMt_9rCUwTeULmsorG2TiB2R5AXiHgf-FkMQ@mail.gmail.com>

2015-05-19 2:44 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> > Good books on the subjext are now becoming difficutlt to find (or
> > they are more expensive now), and too difficult to use on the web
> > (for such very technical topics, it really helps to have a printed
> > copy, that you an annotate, explore, or have beside you instead of on
> > a screen (and printing ebooks is not an option if they are
> > voluminous). May be you have other books to recommend.
>
> Google Books, in English, gives access to a very helpful chapter on
> regular languages in trace monoids in 'the Book of Traces'.
>

[OT]
It's interesting to see that books on this topic were published mostly
after 1994. As I terminated my training cursus at this period, the subject
was largely not covered before; now that I live in a small city where
there's no good scientific library finding just some books in English on
such topics is extremely rare (the only books I see are those published in
French in the "for Dummies" series and I find them completely
ininteresting. As a consequence I buy much less scientific books now.

However Wikipedia is not a convenient place for extensive (but progressive)
coverage of a topic (the one page limit has a consequence: it's difficult
to learn from these articles, and you can read them only if you already
know most of its covered topics or you don't need to navigate randomly over
many pages through random links). Wikiepdia remains useful only if you can
isolate your search to a few smaller subtopics. Wikibooks and Wikisource
would be more useful for such extensive studies, but their contents is very
small (and for legal resons, Wikisource cannot contain many scientific
books about theories that were written after WW2 : unfortunately this
covers almost all researches being performed on comuting theories that
exploded only after th 1960's, and in many areas the researches were also
protected by extensive patents in addition to copyrights; so the
interesting books are published in English, extremely rarely translated,
have a limited distribution, they are also expensive and not found except
in very few libraries and only in some cities that have a scientific
university ; public libraries also don't have these books, which are too
expensive)

Now there's the net, but even Google books only exposes just some pages
(for the rest, Google books propose books that are even more expensive than
on normal libraries, and from random sellers that are frequently not
trustable : e.g. I will never buy anything from Amazon if Amazon is not the
seller, or from other similar large platform on which you don't know who is
the seller, or because the seller also wants us to pay absive
delivery/shipping costs without even giving any warranty on the product and
without even allowing us to trace the command; there are too many abusers,
or that sell products with severe defects; I prefer using French online
selling platforms; in addition this saves money on taxes if the seller is
in the EU, otherwise we experiment long delivery delays in tex customs, and
we also need to pay the tax on delivery, in addition to the initial cost,
plus the currency change fees by the bank; all these can easily double the
total cost, but at end there's also a big deception on the product and it's
impossible to return it and get a refund)

In summary, it is really bad that libraries are disappearing in many
places, or are reducd to sell only a limited catalog "for the dummies" or
popular books advertized in the medias. The variety of available books for
sale is dramatically decreasing now.

The net cannot replace these books that you want to read slowly and keep as
reference for later reuse... except if the e-books you can buy online offer
an option to get a "print on demand" with a good quality with reasonable
costs and delays for the delivery (some French editors are proposing this
"on demand printing" service, even for books from some other foreign
editors). Note that is is not limtied to just scientific books, the system
could be used for delivering all kinds of books (including litterature,
photography, magazines, newspapers, or rare research papers available only
in one public university library that could get some fees helping them to
renew their own purchases...)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150519/76d5492e/attachment.html>

From doug at ewellic.org  Tue May 19 10:19:09 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 19 May 2015 08:19:09 -0700
Subject: Tag characters
Message-ID: <20150519081909.665a7a7059d7ee80bb4d670165c8327d.adffac01f6.wbe@email03.secureserver.net>

Re: Tag characters

Mark Davis ? <mark at macchiato dot com> wrote:

> A more concrete proposal will be in a PRI to be issued soon, and
> people will have a chance to comment more then.

I'll hold off on most other questions until the PRI appears.

> The principal reason for 3 digit codes is because that is the
> mechanism used by BCP47 in case ISO screws up codes (as they did for
> CS).

Hopefully the MA will adhere to the new 50-year limit. The example given
in the proposal talked about trans-national flags.

> The syntax does not need to follow the 3166 syntax - the codes
> correspond but are not the same anyway. So we didn't see the necessity
> for the hyphen, syntactically.

Well, the codes are the same, but you're defining a new syntax, so you
get to remove the hyphen if you want to. But again, the proposal didn't
say that.

> There is a difference between EU and UN; the former is in BCP47.

I didn't know that was relevant to flag tagging.

> Just because a code is valid doesn't mean that there is a flag
> associated with it.

Of course not. I'd also not expect CLDR or Unicode or even vendors to
keep track of every state and territory flag around the world. Vendors
will support some subset of flags of their choice, just as they
currently do, and that's consistent with existing Unicode principles
about not having to display every possible character.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From wjgo_10009 at btinternet.com  Tue May 19 11:25:37 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Tue, 19 May 2015 17:25:37 +0100 (BST)
Subject: Tag characters
Message-ID: <3532384.57721.1432052737187.JavaMail.defaultUser@defaultHost>

Doug Ewell wrote:


> Hopefully the MA will adhere to the new 50-year limit. The example given
in the proposal talked about trans-national flags.


What is MA please?


A 50-year limit seems far too short a time. With that figure, a document could have its meaning retrospectively changed at least 20 years before its copyright runs out, and maybe a lot longer before its copyright runs out, maybe as much as 80 years before its copyright runs out, or even longer!


Surely for archiving our culture, and the British Library is actively archiving, there should never be a retrospective change of meaning.


William Overington


19 May 2015


From doug at ewellic.org  Tue May 19 12:01:14 2015
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 19 May 2015 10:01:14 -0700
Subject: Tag characters
Message-ID: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

>> Hopefully the MA will adhere to the new 50-year limit.
>
> What is MA please?

Maintenance Agency:
http://www.iso.org/iso/home/standards/country_codes.htm

> A 50-year limit seems far too short a time.

There are two types of people: those who feel 50 years is too short, and
those who feel it is too long.

Fifty years is much better than five, which was the previous limit.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From petercon at microsoft.com  Tue May 19 22:22:28 2015
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 20 May 2015 03:22:28 +0000
Subject: Tag characters
In-Reply-To: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
Message-ID: <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>

Evidently there were more than two type of people. There are those who feel 50 years is long enough; there are others who feel that five years is long enough; there are likely others that feel 75 or 30 or some other values are long enough. Then there are also those who feel that any finite length is probably not long enough.


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Tuesday, May 19, 2015 10:01 AM
To: Unicode Mailing List
Cc: William_J_G Overington
Subject: Re: Tag characters

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

>> Hopefully the MA will adhere to the new 50-year limit.
>
> What is MA please?

Maintenance Agency:
http://www.iso.org/iso/home/standards/country_codes.htm

> A 50-year limit seems far too short a time.

There are two types of people: those who feel 50 years is too short, and those who feel it is too long.

Fifty years is much better than five, which was the previous limit.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From wjgo_10009 at btinternet.com  Wed May 20 11:29:28 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 20 May 2015 17:29:28 +0100 (BST)
Subject: Tag characters
In-Reply-To: <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>

Peter Constable wrote as follows.

> Evidently there were more than two type of people. There are those who feel 50 years is long enough; there are others who feel that five years is long enough; there are likely others that feel 75 or 30 or some other values are long enough. Then there are also those who feel that any finite length is probably not long enough.

Unicode is about long-term stability.

Hopefully the people in charge of the codes to be used for the flags will agree never to reuse a code.

Whether they do or not, would it be good to add an option into the tag coding of the flags whereby at the end one may optionally add TAG COLON then at least four TAG DIGIT characters, those TAG DIGIT characters representing the year?

This feature would be ready if a future archivist finds the need to edit a text from years before so that it would display as its author intended, and indeed an author could use the method now so as to lock in his or her meaning.

This could also be of use now so as to display such items as the flag of the USA at various historical periods. It would be helpful if a particular year were chosen for normalization purposes: for example for the flag of the USA used in the 1940s and most of the 1950s have one particular year rather than just using any year within the period when that particular design of flag was in use.

Also for other flags at various historical periods.

It has been speculated that had Scotland left the United Kingdom as a result of the referendum last year (in the event, the people voted for Scotland to stay in the United Kingdom) that the flag of the United Kingdom would have become changed, though some people advocated keeping it the same anyway.

William Overington

20 May 2015


From doug at ewellic.org  Wed May 20 12:35:34 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 20 May 2015 10:35:34 -0700
Subject: Tag characters
Message-ID: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net>

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

> Hopefully the people in charge of the codes to be used for the flags
> will agree never to reuse a code.

Normally I would completely agree about the need for archival stability.

In this case, however, we are talking about flags used primarily as
emoji, like the one in my signature block. People will pop these flags
into their text messages alongside "party" or "celebration" icons. I'm
not sure the requirement for stability is quite as critical as it might
be.

However...

> Whether they do or not, would it be good to add an option into the tag
> coding of the flags whereby at the end one may optionally add TAG
> COLON then at least four TAG DIGIT characters, those TAG DIGIT
> characters representing the year?

It's remarkable how similar this suggestion is to a discussion between
Philippe and me two years ago. There is currently no well-known coding
system for flags -- the owner of the "Flags of the World" site doesn't
know of one -- and there should be. (The term "flag code" already has
two meanings that are very different from this, which makes it hard to
find information.)

Getting UTC to accept the extended syntax of a standard like this would,
of course, require that the standard gain reasonable acceptance and
popularity beforehand. Requiring it to become an ISO standard might not
be unreasonable.

If you want to discuss this specific idea further, please write to me
privately and *not to the list*.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Wed May 20 13:38:14 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 20 May 2015 20:38:14 +0200
Subject: Tag characters
In-Reply-To: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net>
References: <20150520103534.665a7a7059d7ee80bb4d670165c8327d.e4427fe41b.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC0f+PuLO_CQ=d5dzhDt8UpkMD4rgmRwtNVX-2YEgTYt3g@mail.gmail.com>

Well for now a reasonnably stable standard exists: URLs, that can point to
a collection of pagenames (each site can choose its own registry to
name/encode the flags)

URLs are then returening images (you can make a site that can return images
in several formats and with variable sizes as well or with some transforms
such as rotations, flips, animations...

Instad of just isolated URLs, you can organize them into a base URL or
static URL with query (acting as a resolver address), and then append the
URN (name or code of the flags, which can include historic variants), and
then allow the base URL to be replaced : keep just the part of the URL (end
of pathname, or part of the query string) as "standard" and you get what is
generally termed a "mirror". Mirrors however are not nececessarily bound to
remain in the web, they can be any locals store (e.g in a local IP file, or
a folder in your filesystem).

Basically, even the existing FOTW site (and its mirrors) can be already
seen as supporting these relatively stable URNs (provded that the site is
not retructuring constantly its URLs and file names are kept or at least
resolved by keeping internally redirecting links)

So what is need is just a way to support URLs. However URLs today can be
IRIs and contain most of Unicode and we cannot duplicte this code. It is
however possible to do that by using the chracter sets used by Punycode
(for domain names). But if FOTW just designs a naming convnetion for the
paths it supports, so that it will use only a restricted set (ASCII
letters, digits, and punctuation, with only some restrictions on slashs and
controls) it is possible to use them as partial path names (excluding also
file extension in file names) that can be used as URNs, and act as
identifiers (all other parameters: size, transforms, image formats...
should be separate parameters). And with this restricted set, it is
possible to encode them in a stable (but still very extensible) way.


2015-05-20 19:35 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
> wrote:
>
> > Hopefully the people in charge of the codes to be used for the flags
> > will agree never to reuse a code.
>
> Normally I would completely agree about the need for archival stability.
>
> In this case, however, we are talking about flags used primarily as
> emoji, like the one in my signature block. People will pop these flags
> into their text messages alongside "party" or "celebration" icons. I'm
> not sure the requirement for stability is quite as critical as it might
> be.
>
> However...
>
> > Whether they do or not, would it be good to add an option into the tag
> > coding of the flags whereby at the end one may optionally add TAG
> > COLON then at least four TAG DIGIT characters, those TAG DIGIT
> > characters representing the year?
>
> It's remarkable how similar this suggestion is to a discussion between
> Philippe and me two years ago. There is currently no well-known coding
> system for flags -- the owner of the "Flags of the World" site doesn't
> know of one -- and there should be. (The term "flag code" already has
> two meanings that are very different from this, which makes it hard to
> find information.)
>
> Getting UTC to accept the extended syntax of a standard like this would,
> of course, require that the standard gain reasonable acceptance and
> popularity beforehand. Requiring it to become an ISO standard might not
> be unreasonable.
>
> If you want to discuss this specific idea further, please write to me
> privately and *not to the list*.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150520/ca43d566/attachment.html>

From doug at ewellic.org  Wed May 20 13:57:53 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 20 May 2015 11:57:53 -0700
Subject: Tag characters
Message-ID: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> Well for now a reasonnably stable standard exists: URLs, that can
> point to a collection of pagenames (each site can choose its own
> registry to name/encode the flags)

URLs are the opposite of stability. Anyone can post whatever they like,
publish the URL, then change or remove the content at any time.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From richard.wordingham at ntlworld.com  Wed May 20 17:28:56 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 20 May 2015 23:28:56 +0100
Subject: Tag characters
In-Reply-To: <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
 <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
Message-ID: <20150520232856.01363823@JRWUBU2>

On Wed, 20 May 2015 17:29:28 +0100 (BST)
William_J_G Overington <wjgo_10009 at btinternet.com> wrote:

> This could also be of use now so as to display such items as the flag
> of the USA at various historical periods. It would be helpful if a
> particular year were chosen for normalization purposes: for example
> for the flag of the USA used in the 1940s and most of the 1950s have
> one particular year rather than just using any year within the period
> when that particular design of flag was in use.

That is a singularly poor example.  An example that would jar is the
use of the tricolour to represent France in an account of the Hundred
Years' War, or the present German flag to represent Germany in an
account of the Second World War.

A problem we have is that flags are not stable enough to use in plain
text that is to last a human lifetime.

> It has been speculated that had Scotland left the United Kingdom as a
> result of the referendum last year (in the event, the people voted
> for Scotland to stay in the United Kingdom) that the flag of the
> United Kingdom would have become changed, though some people
> advocated keeping it the same anyway.

It won't be kept if England secedes from the UK so as to leave the
European Union. It may not be a likely outcome, but it's certainly a
possibility.

Richard.

From verdy_p at wanadoo.fr  Wed May 20 18:47:01 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 21 May 2015 01:47:01 +0200
Subject: Tag characters
In-Reply-To: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>

URLs were initially deisgned to be stable (and this is still a strong
recommendation).

However I did not describe just URLs but URNs (whose URLs are just
resolvers locating them).
URNs share with URLs (and URIs in general, as well the UCS) the initial "U"
which is intended to be universal (both in space but also in time). The
problem being that it is still open to anyone that do not want to maintain
this stability (but also because URLs have a limit of time which is the
time of registration of their domain name, this limits their universility
in time).
The web also is currently having difficulties to maintain its universitlity
in space (look for ongoing political discussions for its "neutrality").

URNs however should be stable... provided that there's a stable registry
for maintaining the references. (the UCS is stable only because this
registry exists and is managed by a joint authority which is also still
managed and with enough participants so that no other attempts are made to
concurrence it with the same success).

Stability laregely depends on the status of the standard that supports it,
and by the number of interested people that want to participate. It is
never warrantied over a long time as any particopant may decide to retive
from the project). But stability also requires that the participants do not
change their mind in that project. Such such is less likely to occur if
there are lot of users of the standard. Even the UCS has had its own
history of instability in its early versions. And it's very difficult to
maintain this stability when frequently there are people that contest this
stability (sometimes in the UCS this means that a new proprerty must be
designed to satisfy more people, but this also adds to the total cost of
management of the whole standard, however new sets of characters are now
slowing down.

The remaining ones are a few isolates to complement existing scripts, or
scripts that are extremely similar in structure to existing ones, for which
compeltely new solutions rarely need to be designed. Most important
difficulties are solved, even for the remaining scripts that need to be
encoded ... except the more recent addition of emojis where we still cannot
see how they will be bounded in scope (and I count flags within emojis),
and scripts with complex layouts for which there are still missing standard
solutions (e.g. SignWriting, hieroglyphs and old cuneiforms).

We'll probably have more discussions about conventional symbols used in
signalisation (e.g. signals on roads, including traffic lights, and marks
on the ground), or conventional signs on products (standard conformance
marks...) and various security related symbols. We know we are stable only
for alphabetic/phonetic scripts, but we have lots of candidate symbols and
ideograms (whose creation and explosion in definitely not terminated, and
do not concern just CJK scripts). The industry and legislations are
creating new symbols every day around the world... and also deprecating a
lot at almost the same rate.

So yes URLs can be stable, but only those from recognized standard bodies
that want to maintain them stable (e.g. URLs to W3C standards are stable...
but not necessarilyt all tose linking to temporary discussions. The same is
true for URLs to temporary work documents used by the UTC or ISO, or W3C
themselves where docuemtns may be moved elsewherein some archives and with
other formats, loosing some formatting details).

2015-05-20 20:57 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > Well for now a reasonnably stable standard exists: URLs, that can
> > point to a collection of pagenames (each site can choose its own
> > registry to name/encode the flags)
>
> URLs are the opposite of stability. Anyone can post whatever they like,
> publish the URL, then change or remove the content at any time.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/d84f6a7b/attachment.html>

From asmus-inc at ix.netcom.com  Wed May 20 19:15:28 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Wed, 20 May 2015 17:15:28 -0700
Subject: Tag characters
In-Reply-To: <20150520232856.01363823@JRWUBU2>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
 <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
 <20150520232856.01363823@JRWUBU2>
Message-ID: <555D23A0.2000808@ix.netcom.com>

Have there been any discussions of the flag alphabet? (Signal flags).

They are not that infrequently used online or in print, although the 
concentration tends to be higher in publications/sites geared to 
nautical audiences (not that different from chess pieces and chess 
publications).

Now, before you leap on the "it's just  a font" bandwagon, consider that 
the signal flags not only represent letters and digits, but also contain 
special pennants for functions like "repeat once" to "repeat four times" 
as well as a number of special flags that are associated with two-letter 
codes.

Also, the use of certain individual flags has conventional meanings 
other than the letter itself, so a reference to the flag in text would 
not necessarily survive a font substitution, because you'd lose the fact 
that you are talking about flags.

Some of these uses have spread to enthusiasts, for example divers like 
to use the old "PO" flag (that curiously is now obsolete for this 
purpose) as a logo for their sport. The "diver down flag" (flag "A") is 
now a different one in the International Regulations for the Prevention 
of Collisions at Sea (IRPCAS), but for e-moji style use that would not 
matter as the other one (whatever it's origin) is now the recognized 
tribal symbol for divers.

It seems to me that when schemes for representing sets of flags are 
discussed, it would be useful to keep open the ability to use the same 
scheme for signal flags -- perhaps with a different base character to 
avoid collisions in the letter codes.

A./

From richard.wordingham at ntlworld.com  Wed May 20 20:08:06 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 21 May 2015 02:08:06 +0100
Subject: Tag characters
In-Reply-To: <555D23A0.2000808@ix.netcom.com>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
 <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
 <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com>
Message-ID: <20150521020806.3bbaea6e@JRWUBU2>

On Wed, 20 May 2015 17:15:28 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> Have there been any discussions of the flag alphabet? (Signal flags).

> It seems to me that when schemes for representing sets of flags are 
> discussed, it would be useful to keep open the ability to use the
> same scheme for signal flags -- perhaps with a different base
> character to avoid collisions in the letter codes.

If these are worthy of coding, I think the Unified Canadian Aboriginal
Syllabics would be a better model - encode the form, not the semantic.
Braille is another precedent.

Richard.

From Shawn.Steele at microsoft.com  Wed May 20 20:14:57 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 21 May 2015 01:14:57 +0000
Subject: Tag characters
In-Reply-To: <20150521020806.3bbaea6e@JRWUBU2>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
 <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
 <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com>
 <20150521020806.3bbaea6e@JRWUBU2>
Message-ID: <BLUPR03MB1378AA063521D0AF81570D9682C10@BLUPR03MB1378.namprd03.prod.outlook.com>

I've always been a bit partial to them and found it odd that they are intentionally not included in Unicode.  Especially the novel concepts like the repeats.

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Wednesday, May 20, 2015 6:08 PM
To: unicode at unicode.org
Subject: Re: Tag characters

On Wed, 20 May 2015 17:15:28 -0700
"Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:

> Have there been any discussions of the flag alphabet? (Signal flags).

> It seems to me that when schemes for representing sets of flags are 
> discussed, it would be useful to keep open the ability to use the same 
> scheme for signal flags -- perhaps with a different base character to 
> avoid collisions in the letter codes.

If these are worthy of coding, I think the Unified Canadian Aboriginal Syllabics would be a better model - encode the form, not the semantic.
Braille is another precedent.

Richard.


From doug at ewellic.org  Wed May 20 21:11:25 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 20 May 2015 20:11:25 -0600
Subject: Tag characters
In-Reply-To: <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
Message-ID: <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell>

Philippe Verdy wrote:

> URLs were initially deisgned to be stable (and this is still a strong
> recommendation).
[+ 559 words]

It doesn't matter if they were designed to be stable. Users don't keep 
them stable.

I can't believe we're debating whether URLs are stable on a list where 
people have raised concerns about whether 50 years is stable enough for 
ISO 3166-1.

In any event, URLs that point to images would be an awful basis for an 
encoding.

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From eric.muller at efele.net  Wed May 20 23:57:09 2015
From: eric.muller at efele.net (Eric Muller)
Date: Wed, 20 May 2015 21:57:09 -0700
Subject: Tag characters
In-Reply-To: <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell>
Message-ID: <555D65A5.4090705@efele.net>

On 5/20/2015 7:11 PM, Doug Ewell wrote:
> In any event, URLs that point to images would be an awful basis for an 
> encoding.

I would make an exception for the URL 
http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html.

Eric.


From asmus-inc at ix.netcom.com  Thu May 21 00:13:17 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Wed, 20 May 2015 22:13:17 -0700
Subject: Tag characters
In-Reply-To: <BLUPR03MB1378AA063521D0AF81570D9682C10@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150519100114.665a7a7059d7ee80bb4d670165c8327d.d8120b6cd4.wbe@email03.secureserver.net>
 <BLUPR03MB1205F2E2FAF5FBA2EB90399D5C20@BLUPR03MB120.namprd03.prod.outlook.com>
 <7859769.54831.1432139368462.JavaMail.defaultUser@defaultHost>
 <20150520232856.01363823@JRWUBU2> <555D23A0.2000808@ix.netcom.com>
 <20150521020806.3bbaea6e@JRWUBU2>
 <BLUPR03MB1378AA063521D0AF81570D9682C10@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <555D696D.1000309@ix.netcom.com>

On 5/20/2015 6:14 PM, Shawn Steele wrote:
> I've always been a bit partial to them and found it odd that they are intentionally not included in Unicode.  Especially the novel concepts like the repeats.

:)

If I were to write an actual proposal I would suggest naming them after 
their international/modern use, but with the understanding that the 
actual interpretation would be based on whatever signalling system you 
intend to follow.

None of the existing users would be helped by having them named after 
their shapes and colors. That is because some of the shapes and colors 
are a bit complex an nobody I know learns them by description.

In a way, this is also what we do for many standard alphabets. We encode 
LATIN SMALL LETTER O, not "small letter looking like a round circle", 
and we leave it to the language whether to pronounce that long like an 
"oh"  or short, as in "hot" (for English) or more as an "oo" sound, as 
in Swedish. We pick a conventional name for the element of the alphabet, 
and then allow variations in use. (Some of the consonants show much 
greater variation in pronunciation).

When I said "naming" we should use the alphabetic abbreviations that 
they are associated with so that we can fit them into an open ended 
system, like the other flags. Then, whatever techniques we will be using 
(such as UFLs - Universal Flag Locators) would apply to them analogously 
to the national flags.

A./
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
> Sent: Wednesday, May 20, 2015 6:08 PM
> To: unicode at unicode.org
> Subject: Re: Tag characters
>
> On Wed, 20 May 2015 17:15:28 -0700
> "Asmus Freytag (t)" <asmus-inc at ix.netcom.com> wrote:
>
>> Have there been any discussions of the flag alphabet? (Signal flags).
>> It seems to me that when schemes for representing sets of flags are
>> discussed, it would be useful to keep open the ability to use the same
>> scheme for signal flags -- perhaps with a different base character to
>> avoid collisions in the letter codes.
> If these are worthy of coding, I think the Unified Canadian Aboriginal Syllabics would be a better model - encode the form, not the semantic.
> Braille is another precedent.
>
> Richard.
>
>


From asmus-inc at ix.netcom.com  Thu May 21 00:14:45 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Wed, 20 May 2015 22:14:45 -0700
Subject: Tag characters
In-Reply-To: <555D65A5.4090705@efele.net>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
Message-ID: <555D69C5.9040901@ix.netcom.com>

On 5/20/2015 9:57 PM, Eric Muller wrote:
> On 5/20/2015 7:11 PM, Doug Ewell wrote:
>> In any event, URLs that point to images would be an awful basis for 
>> an encoding.
>
> I would make an exception for the URL 
> http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html.
>
> Eric.
>
>
>
Currently that gives me


          Not Found

        The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was
        not found on this server.


:)

However, I agree, all we need to do is create a UFL (Universal Flag 
Locator) and we can keep it as stable as we want.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150520/83cb8bd5/attachment.html>

From petercon at microsoft.com  Thu May 21 10:46:01 2015
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 21 May 2015 15:46:01 +0000
Subject: Tag characters
In-Reply-To: <555D69C5.9040901@ix.netcom.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
Message-ID: <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>

Would Unicode really want to get into the business of running a UFL service?


P

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t)
Sent: Wednesday, May 20, 2015 10:15 PM
To: Eric Muller; unicode at unicode.org
Subject: Re: Tag characters

On 5/20/2015 9:57 PM, Eric Muller wrote:
On 5/20/2015 7:11 PM, Doug Ewell wrote:

In any event, URLs that point to images would be an awful basis for an encoding.

I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html.

Eric.


Currently that gives me
Not Found

The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was not found on this server.

:)

However, I agree, all we need to do is create a UFL (Universal Flag Locator) and we can keep it as stable as we want.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/8c841a5f/attachment.html>

From shizhao at gmail.com  Thu May 21 10:06:12 2015
From: shizhao at gmail.com (shi zhao)
Date: Thu, 21 May 2015 15:06:12 +0000
Subject: =?UTF-8?B?c2ltcGxpZmllZCBDaGluZXNlIHdvcmRzIO+8iOWcnyvku47vvIk=?=
Message-ID: <CAG9jPxzpsdbVXGr_cODmEr2aHpvfMvEFznq9xgdwefzVVkdPbw@mail.gmail.com>

simplified Chinese words ??+?, Hanyupinyin: zong1?don't in unihan.

simplified Chinese: (?+?)
traditional Chinese: *? (U+3661**)*

*see http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661&useutf8=false
<http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661&useutf8=false>*

*http://glyphwiki.org/wiki/u2ff0-u571f-u4ece
<http://glyphwiki.org/wiki/u2ff0-u571f-u4ece>*


*http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014&filename=KJSY201404019&v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=&filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96
<http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014&filename=KJSY201404019&v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=&filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/b4a98bb9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ???_???_????????_???.pdf
Type: application/pdf
Size: 133340 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/b4a98bb9/attachment-0001.pdf>

From verdy_p at wanadoo.fr  Thu May 21 11:25:16 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 21 May 2015 18:25:16 +0200
Subject: Tag characters
In-Reply-To: <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell>
Message-ID: <CAGa7JC0HVeU3KP623_cp3giq7dfoHGTprqsdQisa1Vzj8sXJNg@mail.gmail.com>

2015-05-21 4:11 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy wrote:
>
>  URLs were initially deisgned to be stable (and this is still a strong
>> recommendation).
>>
> [+ 559 words]
>
> It doesn't matter if they were designed to be stable. Users don't keep
> them stable.
>
> I can't believe we're debating whether URLs are stable on a list where
> people have raised concerns about whether 50 years is stable enough for ISO
> 3166-1.
>

I just say that the URL encoding itself is stable and allows to use them
for stable references. The W3C itself uses URIs (in fact just URLs, even if
they don't return a resource when queried) for making the XML schemas
identifiables. In SGML there are similar stable identifiers (but in a
naming scheme). In both cases they are meant to make identifiers unique and
stable over time.

An URL does NOT have to return a stable content, it JUST has to remain
stable by itself. There's absolutely no obligation for its associated
content to be accessible or retrievable. It will survive even if the
referenced content is later changed or deleted: an URL is a valid URI, it
is an identifier.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/5d5dc65b/attachment.html>

From andrewcwest at gmail.com  Thu May 21 12:49:57 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Thu, 21 May 2015 18:49:57 +0100
Subject: =?UTF-8?B?UmU6IHNpbXBsaWZpZWQgQ2hpbmVzZSB3b3JkcyDvvIjlnJ8r5LuO77yJ?=
In-Reply-To: <CAG9jPxzpsdbVXGr_cODmEr2aHpvfMvEFznq9xgdwefzVVkdPbw@mail.gmail.com>
References: <CAG9jPxzpsdbVXGr_cODmEr2aHpvfMvEFznq9xgdwefzVVkdPbw@mail.gmail.com>
Message-ID: <CALgEMhyjwBb1J5JT2MH84WOP2d25AOz3LBdgdGZfDgMnAWd1vw@mail.gmail.com>

Hi Shi Zhao,

The character ?+? is not yet in Unicode, but it is scheduled for
inclusion in CJK Extension F.  You can see the character here
(http://www.unicode.org/L2/L2014/14271-n4637.pdf on p. 148), but you
should not rely on the code point which will surely change.

Andrew


On 21 May 2015 at 16:06, shi zhao <shizhao at gmail.com> wrote:
> simplified Chinese words ??+?, Hanyupinyin: zong1?don't in unihan.
>
> simplified Chinese: (?+?)
> traditional Chinese: ? (U+3661)
>
> see
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=3661&useutf8=false
> http://glyphwiki.org/wiki/u2ff0-u571f-u4ece
>
> http://www.cnki.net/kcms/detail/Detail.aspx?dbname=CJFD2014&filename=KJSY201404019&v=MjA1NzdMdktMaWZZZDdHNEg5WE1xNDlFYllRSGZYZ3h2UjhRbUV3SlRReVFybVJFRnJDVVJMK2ZZdVJ1RkN2bFU=&filetitle=%E4%BB%8E%E8%AF%AF%E5%90%8D%E2%80%9C%E9%B8%A1%E6%9E%9E%E8%8F%8C%E2%80%9D%E7%9C%8B%E7%A7%91%E6%8A%80%E5%90%8D%E8%AF%8D%E8%A7%84%E8%8C%83%E5%8C%96
>
>


From eik at iki.fi  Thu May 21 14:52:34 2015
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Thu, 21 May 2015 22:52:34 +0300
Subject: Tag characters
In-Reply-To: <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <005901d093ff$aec230d0$0c469270$@fi>

I don?t think so.

 
Sincerely, Erkki

 
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Peter Constable
L?hetetty: 21. toukokuuta 2015 18:46
Vastaanottaja: Asmus Freytag (t); Eric Muller; unicode at unicode.org
Aihe: RE: Tag characters

 
Would Unicode really want to get into the business of running a UFL service?

 
P

 
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t)
Sent: Wednesday, May 20, 2015 10:15 PM
To: Eric Muller; unicode at unicode.org
Subject: Re: Tag characters

 
On 5/20/2015 9:57 PM, Eric Muller wrote:

On 5/20/2015 7:11 PM, Doug Ewell wrote: 

In any event, URLs that point to images would be an awful basis for an encoding. 


I would make an exception for the URL http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html. 

Eric. 


Currently that gives me


Not Found


The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was not found on this server.


:)

However, I agree, all we need to do is create a UFL (Universal Flag Locator) and we can keep it as stable as we want.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/6225fe1e/attachment.html>

From asmus-inc at ix.netcom.com  Thu May 21 15:25:56 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 21 May 2015 13:25:56 -0700
Subject: Tag characters
In-Reply-To: <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <555E3F54.6020907@ix.netcom.com>

On 5/21/2015 8:46 AM, Peter Constable wrote:
>
> Would Unicode really want to get into the business of running a UFL 
> service?
>

I suspect both Eric and I may have have been slightly tongue-in-cheek 
with respect to UFLs...

... not sure about anybody else.

Cheers,

A./
>
> P
>
> *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of 
> *Asmus Freytag (t)
> *Sent:* Wednesday, May 20, 2015 10:15 PM
> *To:* Eric Muller; unicode at unicode.org
> *Subject:* Re: Tag characters
>
> On 5/20/2015 9:57 PM, Eric Muller wrote:
>
>     On 5/20/2015 7:11 PM, Doug Ewell wrote:
>
>         In any event, URLs that point to images would be an awful
>         basis for an encoding.
>
>
>     I would make an exception for the URL
>     http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html
>     <http://unicode.org/Public/8.0.0/ucd/StandardizedFlags.html>.
>
>     Eric.
>
>
> Currently that gives me
>
>
>           Not Found
>
>         The requested URL /Public/8.0.0/ucd/StandardizedFlags.html was
>         not found on this server.
>
>
> :)
>
> However, I agree, all we need to do is create a UFL (Universal Flag 
> Locator) and we can keep it as stable as we want.
>
> A./
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150521/6178e01e/attachment.html>

From wjgo_10009 at btinternet.com  Fri May 22 06:01:13 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 22 May 2015 12:01:13 +0100 (BST)
Subject: Tag characters and localizable sentence technology (from Tag
 characters)
Message-ID: <32759766.22530.1432292473336.JavaMail.defaultUser@defaultHost>

Tag characters and localizable sentence technology (from Tag characters)
I refer to the following documents, the first about localizable sentences and the second about, amongst other matters, applying tag characters using a new encoding format.
http://www.unicode.org/L2/L2013/13079-loc-sentance.pdf
http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf
Starting from the idea of the markup bubble from the first document and applying the tag method and the ISO standard document method from the second document, there arises the following possibility for the future for localizable sentence technology.
A single character would be added into Unicode, the name of the character being
LOCALIZABLE SENTENCE BASE CHARACTER
and then the plain text encoding of a particular localizable sentence would be defined as being expressed as the LOCALIZABLE SENTENCE BASE CHARACTER character followed by the code for the localizable sentence specified in the ISO [number] document, the code being expressed using tag characters.
Please find attached a design for the glyph for the LOCALIZABLE SENTENCE BASE CHARACTER character.
I designed the glyph by adapting and then combining the designs for localizable sentence markup bubble brackets from the first of the two documents referenced earlier in this text.
Each localizable sentence, carefully written so as to avoid in use any reliance as to meaning on any sentence previously used in the same document, would have a meaning expressed in words and possibly also have a glyph: more commonly used localizable sentences each having a glyph yet not all other localizable sentences necessarily having a glyph, though some could have a glyph, as desired.
William Overington
22 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150522/9d20836e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: glyph_for_localizable_sentence_base_character.png
Type: image/png
Size: 872 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150522/9d20836e/attachment.png>

From baskar115 at gmail.com  Sat May 23 07:41:36 2015
From: baskar115 at gmail.com (baskar raj)
Date: Sat, 23 May 2015 18:11:36 +0530
Subject: Regarding Unicode for new Symbol
Message-ID: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>

Hi,
Is it possible to get a Unicode for a new symbol, designed for a commonly
used word, For Example lets say "and" . which can be used in conjunction
with numbers or letters. so is it possible to file application seeking
Unicode....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150523/3ed226ff/attachment.html>

From tomasek at etf.cuni.cz  Sat May 23 13:50:19 2015
From: tomasek at etf.cuni.cz (Petr Tomasek)
Date: Sat, 23 May 2015 20:50:19 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp
In-Reply-To: <20150330.000738.23342035.wl@gnu.org>
References: <BLU404-EAS651A9EE3385BA17D31F1E5D1F60@phx.gbl>
 <CA+p4_H2rJ9cCzuJxb4vCrzS1SQ5WpKNks14eizz9+Ez2XhwZAQ@mail.gmail.com>
 <EF5387B8-67B7-48A3-BC67-C43C4599D684@evertype.com>
 <20150330.000738.23342035.wl@gnu.org>
Message-ID: <20150523185019.GA7442@ebed.etf.cuni.cz>

On Mon, Mar 30, 2015 at 12:07:38AM +0200, Werner LEMBERG wrote:
> 
> > That?s quite some variety. There are also the three-quarter flat and
> > sharp in Western music to consider.  I?ll be able to dig into this
> > after I get back to Ireland from Sweden on Friday.
> 
> You should check the Standard Music Font Layout (SmuFL) for details;
> it also has a freely available font that covers it.
> 
>   http://www.smufl.org
> 
> The recent version of the specification can be found at
> 
>   http://www.smufl.org/files/smufl-1.12.pdf
> 
>      Werner

Hm, it seems that there is much more to be encoded in Unicode than just
the quarter-tone signs...

Petr


From asmus-inc at ix.netcom.com  Sat May 23 14:09:33 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sat, 23 May 2015 12:09:33 -0700
Subject: Regarding Unicode for new Symbol
In-Reply-To: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
Message-ID: <5560D06D.1080305@ix.netcom.com>

On 5/23/2015 5:41 AM, baskar raj wrote:
> Hi,
> Is it possible to get a Unicode for a new symbol, designed for a 
> commonly used word, For Example lets say "and" . which can be used in 
> conjunction with numbers or letters. so is it possible to file 
> application seeking Unicode....
>
>
Generally, there is a problem with newly invented symbols (for any 
purpose). It is often impossible to predict whether they will become 
successful, get widely adopted and thus become an essential part of 
written text.

When Unicode encodes something, it is permanent. If it encodes a symbol 
that ultimately fails or quickly falls out of use, that failure is now 
permanent. That fact alone forces Unicode to be very cautious.

There are some obvious exceptions. New currency symbols are being 
invented regularly. But as soon as they are officially declared, 
practically everyone using that currency has a need to use that symbol 
in text. Such symbols are practically guaranteed to be successful in a 
way that other novel symbols are not.

Your case sounds like more of the latter; it would seem highly uncertain 
whether people will adopt your invention. As a result, Unicode would 
most likely want to encode your symbol only after it has proven itself, 
and not as a first step.

So, while it is "possible" it appears extremely unlikely in this case, 
unless there are circumstances that you have not mentioned, such as 
official government support in form of a spelling reform or something of 
that nature.

A./


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150523/90c028e5/attachment.html>

From verdy_p at wanadoo.fr  Sat May 23 17:45:38 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 24 May 2015 00:45:38 +0200
Subject: Regarding Unicode for new Symbol
In-Reply-To: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
Message-ID: <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>

But there's already a symbol encoded for this common word: it is part of
the ASCII subset (&) and is already encoded as a symbol (even if initially
it was designed as a cursive simplification of a ligature for the Latin
letters "et" (and used also within words containing these letters, in
addition to the MAtin word "et" itself).
Some fonts still make the ligature more evident, but as a symbol it allows
more variation of its shape (it is also used in a trademark symbol for the
Orange telecommunication group, with a specific design, but for such usage
as a logo, the encoded character is not suitable: logos are transported as
images to specify also this shape and color design, not encoded in the
character itself).

2015-05-23 14:41 GMT+02:00 baskar raj <baskar115 at gmail.com>:

> Hi,
> Is it possible to get a Unicode for a new symbol, designed for a commonly
> used word, For Example lets say "and" . which can be used in conjunction
> with numbers or letters. so is it possible to file application seeking
> Unicode....
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150524/efa7fdce/attachment.html>

From verdy_p at wanadoo.fr  Sat May 23 18:00:59 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 24 May 2015 01:00:59 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp
In-Reply-To: <20150523185019.GA7442@ebed.etf.cuni.cz>
References: <BLU404-EAS651A9EE3385BA17D31F1E5D1F60@phx.gbl>
 <CA+p4_H2rJ9cCzuJxb4vCrzS1SQ5WpKNks14eizz9+Ez2XhwZAQ@mail.gmail.com>
 <EF5387B8-67B7-48A3-BC67-C43C4599D684@evertype.com>
 <20150330.000738.23342035.wl@gnu.org>
 <20150523185019.GA7442@ebed.etf.cuni.cz>
Message-ID: <CAGa7JC1KwtvyfG+J4Q5tLcuNek0JfRO4eynhyAsUr873W0+5sg@mail.gmail.com>

2015-05-23 20:50 GMT+02:00 Petr Tomasek <tomasek at etf.cuni.cz>:

> Hm, it seems that there is much more to be encoded in Unicode than just
> the quarter-tone signs..
>

Clearly not a valid arguments against encoding a character. There are
plenty of characters still not encoded even in scripts already encoded,
this never meant that the encoded part should have been stalled until the
set was "complete".
Each ecoded character has to be evaluated individually, even if it makes
sense to add them in groups when their association in that group is
necessary to make them usable (for example it would have been a non-sense
in any language to encode only Latin vowels without any consonnant, but it
would have been meaningful to encoded only basic Arabic consonnants and
postpone the encoding of basic vowels.
The merits of an encoding proposal is measured by its usage and usability
in a well-established (orthographic) convention. It is important then to
explore what is this convention and why more than 1 character are needed
together for that convention. Then we can compare with other competing
conventionw what they have in common (this is what Unicode considers a
"script", even if it is not necessarily for writing spoken languages).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150524/fd97e486/attachment.html>

From baskar115 at gmail.com  Sat May 23 23:55:50 2015
From: baskar115 at gmail.com (baskar raj)
Date: Sun, 24 May 2015 10:25:50 +0530
Subject: Regarding Unicode for new Symbol
In-Reply-To: <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
 <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>
Message-ID: <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>

i just gave "and" as an example (verdy), i am just curious to know if we
propose a symbol for a word does Unicode encode it or accept when it is
already used by a small community of users, shall we claim in letter like
symbols (00?4F). (any possibility)
or we can only implement in private use area until it is recognized - which
is not possible for small mediums to get widely recognized other than
bigger names like Microsoft or Apple proposing.

On Sun, May 24, 2015 at 4:15 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> But there's already a symbol encoded for this common word: it is part of
> the ASCII subset (&) and is already encoded as a symbol (even if initially
> it was designed as a cursive simplification of a ligature for the Latin
> letters "et" (and used also within words containing these letters, in
> addition to the MAtin word "et" itself).
> Some fonts still make the ligature more evident, but as a symbol it allows
> more variation of its shape (it is also used in a trademark symbol for the
> Orange telecommunication group, with a specific design, but for such usage
> as a logo, the encoded character is not suitable: logos are transported as
> images to specify also this shape and color design, not encoded in the
> character itself).
>
> 2015-05-23 14:41 GMT+02:00 baskar raj <baskar115 at gmail.com>:
>
>> Hi,
>> Is it possible to get a Unicode for a new symbol, designed for a commonly
>> used word, For Example lets say "and" . which can be used in conjunction
>> with numbers or letters. so is it possible to file application seeking
>> Unicode....
>>
>>
>>
>


-- 
Kind Regards,

M Baskar Raj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150524/3a588c15/attachment.html>

From eik at iki.fi  Sun May 24 03:02:49 2015
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Sun, 24 May 2015 11:02:49 +0300
Subject: VS: Regarding Unicode for new Symbol
In-Reply-To: <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
 <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>
 <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>
Message-ID: <000001d095f8$07cfeec0$176fcc40$@fi>

You are not the first one to come up with this kind of a proposal (even for sentences), which has never received any noticeable support - for good reasons, I might add. 

 
Erkki I. Kolehmainen

Tilkankatu 12 A 3, 00300 Helsinki, Finland

Mob: +358400825943, Tel / Fax (by arr.): +358943682643

 
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta baskar raj
L?hetetty: 24. toukokuuta 2015 07:56
Vastaanottaja: verdy_p at wanadoo.fr; unicode Unicode Discussion; asmus-inc at ix.netcom.com
Aihe: Re: Regarding Unicode for new Symbol

 
i just gave "and" as an example (verdy), i am just curious to know if we propose a symbol for a word does Unicode encode it or accept when it is already used by a small community of users, shall we claim in letter like symbols (00?4F). (any possibility) 

or we can only implement in private use area until it is recognized - which is not possible for small mediums to get widely recognized other than bigger names like Microsoft or Apple proposing.

 
On Sun, May 24, 2015 at 4:15 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

But there's already a symbol encoded for this common word: it is part of the ASCII subset (&) and is already encoded as a symbol (even if initially it was designed as a cursive simplification of a ligature for the Latin letters "et" (and used also within words containing these letters, in addition to the MAtin word "et" itself).

Some fonts still make the ligature more evident, but as a symbol it allows more variation of its shape (it is also used in a trademark symbol for the Orange telecommunication group, with a specific design, but for such usage as a logo, the encoded character is not suitable: logos are transported as images to specify also this shape and color design, not encoded in the character itself).

 
2015-05-23 14:41 GMT+02:00 baskar raj <baskar115 at gmail.com>:

Hi, 

Is it possible to get a Unicode for a new symbol, designed for a commonly used word, For Example lets say "and" . which can be used in conjunction with numbers or letters. so is it possible to file application seeking Unicode....


-- 

Kind Regards,

M Baskar Raj

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150524/d2c123e2/attachment.html>

From richard.wordingham at ntlworld.com  Sun May 24 04:25:53 2015
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 24 May 2015 10:25:53 +0100
Subject: Regarding Unicode for new Symbol
In-Reply-To: <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
 <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>
 <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>
Message-ID: <20150524102553.1ce9a877@JRWUBU2>

On Sun, 24 May 2015 10:25:50 +0530
baskar raj <baskar115 at gmail.com> wrote:

> i just gave "and" as an example (verdy), i am just curious to know if
> we propose a symbol for a word does Unicode encode it or accept when
> it is already used by a small community of users, shall we claim in
> letter like symbols (00?4F). (any possibility)
> or we can only implement in private use area until it is recognized -
> which is not possible for small mediums to get widely recognized
> other than bigger names like Microsoft or Apple proposing.

In general, a private use character can be promoted by including it in a
generally useful font and providing soft keyboards that allow its use.

There are two major exceptions to this - combining marks and characters
that require a rendering engine.  It might even be possible to get round
these problems in many cases with a *lot* of ingenuity in the soft
keyboards.  I believe AAT fonts are a solution for the Apple
world, but OpenType may be more difficult, and may need tackling
application by application and rendered by renderer even with open
source software. Another possible method would be to subvert the
rendering engine.

For open source applications, fonts using (SIL) Graphite often work.
While Tai Tham was being encoded, I successfully used the PUA for
generating word lists and successfully converted them to Unicode once
the encoding was approved.  My viewing tools were limited, and I was
delighted when OpenOffice started supporting Graphite and when a
version of Firefox appeared that also supported Graphite.

There is another solution, which is *bad* but can work well for a short
period.  That solution is for a font to hijack a code point with the
desired properties relevant to rendering.  One solution along these
lines, which may not yet be usable, would be to use a character with the
right properties and then use a variation sequence to substitute one's
own unrelated glyph.  Gaps in character assignments tend to be used for
these purposes (Lao is a good example), but renderer support varies.  I
remember that Windows XP initially didn't support U+0BB6 TAMIL LETTER
SHA when using its native rendering stack.

Richard.


From tomasek at etf.cuni.cz  Sun May 24 06:32:40 2015
From: tomasek at etf.cuni.cz (Petr Tomasek)
Date: Sun, 24 May 2015 13:32:40 +0200
Subject: preliminary proposal: New Unicode characters for Arabic music
 half-flat and half-sharp
In-Reply-To: <CAGa7JC1KwtvyfG+J4Q5tLcuNek0JfRO4eynhyAsUr873W0+5sg@mail.gmail.com>
References: <BLU404-EAS651A9EE3385BA17D31F1E5D1F60@phx.gbl>
 <CA+p4_H2rJ9cCzuJxb4vCrzS1SQ5WpKNks14eizz9+Ez2XhwZAQ@mail.gmail.com>
 <EF5387B8-67B7-48A3-BC67-C43C4599D684@evertype.com>
 <20150330.000738.23342035.wl@gnu.org>
 <20150523185019.GA7442@ebed.etf.cuni.cz>
 <CAGa7JC1KwtvyfG+J4Q5tLcuNek0JfRO4eynhyAsUr873W0+5sg@mail.gmail.com>
Message-ID: <20150524113240.GA15445@ebed.etf.cuni.cz>

On Sun, May 24, 2015 at 01:00:59AM +0200, Philippe Verdy wrote:
> 2015-05-23 20:50 GMT+02:00 Petr Tomasek <tomasek at etf.cuni.cz>:
> 
> > Hm, it seems that there is much more to be encoded in Unicode than just
> > the quarter-tone signs..
> >
> 
> Clearly not a valid arguments against encoding a character.

Where do I argue against encoding a character?

I was just surprised by how many musical symbols are there which
would benefit from being encoded in unicode. Not less and not more.

P.T.

> There are
> plenty of characters still not encoded even in scripts already encoded,
> this never meant that the encoded part should have been stalled until the
> set was "complete".
> Each ecoded character has to be evaluated individually, even if it makes
> sense to add them in groups when their association in that group is
> necessary to make them usable (for example it would have been a non-sense
> in any language to encode only Latin vowels without any consonnant, but it
> would have been meaningful to encoded only basic Arabic consonnants and
> postpone the encoding of basic vowels.
> The merits of an encoding proposal is measured by its usage and usability
> in a well-established (orthographic) convention. It is important then to
> explore what is this convention and why more than 1 character are needed
> together for that convention. Then we can compare with other competing
> conventionw what they have in common (this is what Unicode considers a
> "script", even if it is not necessarily for writing spoken languages).
> 


From samjnaa at gmail.com  Sun May 24 07:25:02 2015
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Sun, 24 May 2015 17:55:02 +0530
Subject: 25CC for dotted circle, but what for dashed box?
Message-ID: <CAH-HCWU4a_FEsOhrgphx+qNOOSfoVCjGPzBSpxNvGMseg_hJoA@mail.gmail.com>

I hope the subject line makes it clear. What character is to be used
when a dashed box such as that shown for special-rendering characters
in the code chart is required to be actually shown in text?

-- 
Shriramana Sharma ???????????? ????????????


From samjnaa at gmail.com  Sun May 24 10:36:10 2015
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Sun, 24 May 2015 21:06:10 +0530
Subject: 25CC for dotted circle, but what for dashed box?
In-Reply-To: <d2ac370455db40aaa39170e58b3a414d@DFM-TK5MBX15-06.exchange.corp.microsoft.com>
References: <CAH-HCWU4a_FEsOhrgphx+qNOOSfoVCjGPzBSpxNvGMseg_hJoA@mail.gmail.com>
 <d2ac370455db40aaa39170e58b3a414d@DFM-TK5MBX15-06.exchange.corp.microsoft.com>
Message-ID: <CAH-HCWW6kw9KdeYUQ6U-MK002hHC7Q9cFrDxT_ijAhNUxmjSMg@mail.gmail.com>

Nice -- I was searching for "DASHED BOX" since that's what TUS 7.0 ch
24.1 refers to it as and there are too many "SQUARE" characters...

-- 
Shriramana Sharma ???????????? ????????????


From verdy_p at wanadoo.fr  Sun May 24 13:52:26 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 24 May 2015 20:52:26 +0200
Subject: Regarding Unicode for new Symbol
In-Reply-To: <20150524102553.1ce9a877@JRWUBU2>
References: <CADm1dQbahEeL9sXh3RZ5-VbuYS8T-C2xHMae36TPEF8muvLdsA@mail.gmail.com>
 <CAGa7JC2u9q4ZO8SNkUbUiK8aJRGcSJDETE=-DKju69hEhcxQJQ@mail.gmail.com>
 <CADm1dQaVTL8KjnBpNo-drCdpfbFaD_KxqXkT_n+VO1YWAHo4vg@mail.gmail.com>
 <20150524102553.1ce9a877@JRWUBU2>
Message-ID: <CAGa7JC2Yh6fSxX3MA7wtncdhJjrXVnm=6uG0BTP4-kxFB6b=tA@mail.gmail.com>

2015-05-24 11:25 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> There is another solution, which is *bad* but can work well for a short
> period.  That solution is for a font to hijack a code point with the
> desired properties relevant to rendering.
>

It is not so bad when the usage is limited to some documents using specific
fonts designed for this purpose. OK it is not fully interchangeable, but it
can be good for the start (including for creating documents showing the
proposal for a new encoding).

However if we want to limit the propagation of this "bad" encoding in
documents not specifically linked to a specific font, a good solution is to
embed that font directly in the document (the PDF format is suitable for
that, but you can also do that with HTML documents using embedded SVG
images, which can themselves be embedded in SVG fonts embeddable in the
document itself). No need to use a variation sequence (unless it is also
recognized spcifically in that embedded font)
But it is not general enough for all complex scripts that require specific
layout rules (GSUB/GPOS), notably when they are contextual. In summary we
come back to the use of collections of glyphs (SVG) without actually any
text rendering engine.

With HTML5, the embedding of SVG is greatly facilitated (and can also be
automated with some custom javascripts transforming an easily compable
syntax into a sequence of text and images). You can even apply some limited
CSS styling that can apply to both the text and inline SVG images, provided
your SVG is designed to be scalable within the current text line metrics,
for example when it uses a "viewport" attribute but not the "width" and
"hight" attribute that should be set by the default HTML box model: it will
work however reliably only for full clusters occupying the standard line
height and vertical alignment relative to the baseline, not for individual
characters if they are combining or using some contextual layout). Now it's
up to you to invent your own syntax for making the transform into sequences
of plain-text and inline images).

However you won't get some font-specific features such as hinting for small
font sizes (SVG fonts currently have no standard way to include hinting
instructions in order to transform the geometry of paths according to the
physical device, and there are also difficulties with the specification of
sizes in CSS, for example on Hi-DPI displays such as smartphones, or with
the zoom in/out feature of browsers: it requires fine tuning not with the
CSS "logical pixel" unit, scaled in logical "dpi", but with the newer
"ddpx" unit, plus some other metrics related to subpixels of the rendering
surface, or relative alignment of pixels with the physical positions, which
are not necessarily in a simple grid, but mapped using "screening" technics
which are very common when printing).

As far as I known "font hinting" is still a work in progress (since long),
it is also very complex in TrueType/OpenType and has no real standard (only
a few specialists can use it to design specific fonts and it is not easily
reusable elsewhere), so nothing in this domain is supported by SVG fonts
(for small font sizes the current solution is still to use bitmap images
instead, assuming that the HTML rendering engine is using its best efforts
to map the logical pixels of bitmaps into physical pixels or subpixels on
the rendering surface, and to preserve their intended color gamut and
contrasts without excessive distortions); in fact TrueType/OpenType or SVG
and CSS does not even have any decent support for "screening" technics,
like those that exist in PostScript since several decennials; and for this
reasons, publishers still **love** PostScript for the fine tuning of the
typography and images and for getting the best final result that the final
printing medium can support.

So PostScript fonts are definitely not dead, but still not enough supported
and used for display due to lack of equivalent support in OSes and browsers
(even in HTML5, there's still no decent support in the newest "canvas",
that still have lots of quirks at this level, and that also don't support
any suitable screening). And most popular printers do not even have
Postscript (it is replaced by some capabilities of the printer drivers
doing all the work, via the more limited graphic APIs of the OS used by
applications: those printers only support simple bitmaps).

It is then still difficult to create for the widest range of devices any
document elbedding simultaneously plain-text rendered with fonts, scalable
images (such as SVG), and bitmap images (including photography),
without first assuming some physical properties of the rendering surface
(but also taking into account local preferences of the final user, such as
zoom level, or colorimetric profiles, or choice of paper and print quality,
or multiple displays). The "WYSIWYG" concept is just an advertized goal,
but still a myth as it is largely not implemented or supported.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150524/e66b484a/attachment.html>

From eric.muller at efele.net  Tue May 26 08:48:51 2015
From: eric.muller at efele.net (Eric Muller)
Date: Tue, 26 May 2015 06:48:51 -0700
Subject: Tag characters
In-Reply-To: <555E3F54.6020907@ix.netcom.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
 <555E3F54.6020907@ix.netcom.com>
Message-ID: <556479C3.9040805@efele.net>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150526/d364ef03/attachment.html>

From pzi at ingerman.org  Tue May 26 09:45:37 2015
From: pzi at ingerman.org (Peter Zilahy Ingerman, PhD)
Date: Tue, 26 May 2015 10:45:37 -0400
Subject: Tag characters
In-Reply-To: <556479C3.9040805@efele.net>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
 <555E3F54.6020907@ix.netcom.com> <556479C3.9040805@efele.net>
Message-ID: <55648711.5000203@ingerman.org>

Aww... I was SURE you meant UFOs!

On 2015-05-26 09:48, Eric Muller wrote:
> On 5/21/2015 1:25 PM, Asmus Freytag (t) wrote:
>> On 5/21/2015 8:46 AM, Peter Constable wrote:
>>>
>>> Would Unicode really want to get into the business of running a UFL 
>>> service?
>>>
>>
>> I suspect both Eric and I may have have been slightly tongue-in-cheek 
>> with respect to UFLs...
>
> Actually, I was serious.
>
> Eric.
>
> No virus found in this message.
> Checked by AVG - www.avg.com <http://www.avg.com>
> Version: 2015.0.5961 / Virus Database: 4354/9871 - Release Date: 05/26/15
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150526/44e53a57/attachment.html>

From wjgo_10009 at btinternet.com  Wed May 27 02:53:52 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 27 May 2015 08:53:52 +0100 (BST)
Subject: Tag characters
In-Reply-To: <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
Message-ID: <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost>

Peter Constable wrote as follows:
> Would Unicode really want to get into the business of running a UFL service?
Well, Unicode is about precision, interoperability and long-term stability, and, given, in relation to one particular specified base character followed by some tag characters, that a particular sequence of Unicode characters is intended to lead to the display of an image representing a particular flag, it seems to me highly reasonable that the Unicode Technical Committee might seriously consider providing that facility.
William Overington
27 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/d7cc68a6/attachment.html>

From moyogo at gmail.com  Wed May 27 03:18:13 2015
From: moyogo at gmail.com (Denis Jacquerye)
Date: Wed, 27 May 2015 08:18:13 +0000
Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?=
 =?UTF-8?Q?ts?=
In-Reply-To: <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
 <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
Message-ID: <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>

The South China Morning Post published a similar infographic:
A world of languages - and how many speak them
http://www.scmp.com/infographics/article/1810040/infographic-world-languages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/77e12d71/attachment.html>

From mark at macchiato.com  Wed May 27 05:22:38 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 27 May 2015 12:22:38 +0200
Subject: =?UTF-8?Q?Re=3A_FYI=3A_The_world=E2=80=99s_languages=2C_in_7_maps_and_char?=
 =?UTF-8?Q?ts?=
In-Reply-To: <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
 <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
 <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>
Message-ID: <CAJ2xs_HL3cbNvWRxKrLJNJNKZCzjx=2y4v+XqFk22vqkh3Mang@mail.gmail.com>

Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
wrong by almost a power of 10.


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye <moyogo at gmail.com> wrote:

> The South China Morning Post published a similar infographic:
> A world of languages - and how many speak them
>
> http://www.scmp.com/infographics/article/1810040/infographic-world-languages
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/f7263f0e/attachment.html>

From moyogo at gmail.com  Wed May 27 09:59:37 2015
From: moyogo at gmail.com (Denis Jacquerye)
Date: Wed, 27 May 2015 14:59:37 +0000
Subject: FYI: The world's languages, in 7 maps and charts
In-Reply-To: <CAJ2xs_HL3cbNvWRxKrLJNJNKZCzjx=2y4v+XqFk22vqkh3Mang@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
 <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
 <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>
 <CAJ2xs_HL3cbNvWRxKrLJNJNKZCzjx=2y4v+XqFk22vqkh3Mang@mail.gmail.com>
Message-ID: <CAJKta0y0eMYCaORS7UcXeyBrjy6AdT9JLWZEBnrafLQK6RWdiw@mail.gmail.com>

The data used to build the infographic comes from Ethnologue.com.
http://www.ethnologue.com/language/deu does not indicate the Standard
German L1 population in Austria and gives a population of 727?000 Standard
German L1 speakers in Switzerland (the difference is counted as Swiss
German L1 speakers).

On Wed, 27 May 2015 at 11:22 Mark Davis ?? <mark at macchiato.com> wrote:

> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
> wrong by almost a power of 10.
>
>
> Mark <https://google.com/+MarkDavis>
>
> *? Il meglio ? l?inimico del bene ?*
>
> On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye <moyogo at gmail.com>
> wrote:
>
>> The South China Morning Post published a similar infographic:
>> A world of languages - and how many speak them
>>
>> http://www.scmp.com/infographics/article/1810040/infographic-world-languages
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/ce42e0ad/attachment.html>

From clarkcox3 at gmail.com  Wed May 27 10:57:38 2015
From: clarkcox3 at gmail.com (clarkcox3 at gmail.com)
Date: Wed, 27 May 2015 08:57:38 -0700
Subject: FYI: The world's languages, in 7 maps and charts
In-Reply-To: <CAJKta0y0eMYCaORS7UcXeyBrjy6AdT9JLWZEBnrafLQK6RWdiw@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
 <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
 <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>
 <CAJ2xs_HL3cbNvWRxKrLJNJNKZCzjx=2y4v+XqFk22vqkh3Mang@mail.gmail.com>
 <CAJKta0y0eMYCaORS7UcXeyBrjy6AdT9JLWZEBnrafLQK6RWdiw@mail.gmail.com>
Message-ID: <60B6D84E-453F-489E-9F16-8BEB2919833B@gmail.com>

If the various Chinese languages/dialects are similar enough to be counted in a single category, then certainly Swiss German Is similar enough to the German spoken in Germany and Austria to be counted in the same category. 

Sent from my iPhone

> On May 27, 2015, at 07:59, Denis Jacquerye <moyogo at gmail.com> wrote:
> 
> The data used to build the infographic comes from Ethnologue.com.
> http://www.ethnologue.com/language/deu does not indicate the Standard German L1 population in Austria and gives a population of 727?000 Standard German L1 speakers in Switzerland (the difference is counted as Swiss German L1 speakers).
> 
>> On Wed, 27 May 2015 at 11:22 Mark Davis ?? <mark at macchiato.com> wrote:
>> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland wrong by almost a power of 10.
>> 
>> 
>> Mark
>> 
>> ? Il meglio ? l?inimico del bene ?
>> 
>>> On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye <moyogo at gmail.com> wrote:
>>> The South China Morning Post published a similar infographic:
>>> A world of languages - and how many speak them
>>> http://www.scmp.com/infographics/article/1810040/infographic-world-languages
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/3b6b2dc4/attachment.html>

From petercon at microsoft.com  Wed May 27 11:10:46 2015
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 27 May 2015 16:10:46 +0000
Subject: Tag characters
In-Reply-To: <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost>
References: <20150520115753.665a7a7059d7ee80bb4d670165c8327d.dc2923e7dd.wbe@email03.secureserver.net>
 <CAGa7JC1_18=7VZmzFeu46j-_-mQEntZ2QussgOf3YV=ozgqHsQ@mail.gmail.com>
 <FD1CF676C11D455BBEB2E32F5A1E1C48@DougEwell> <555D65A5.4090705@efele.net>
 <555D69C5.9040901@ix.netcom.com>
 <BLUPR03MB120236CCBEBB6A645DEB39ED5C10@BLUPR03MB120.namprd03.prod.outlook.com>
 <6653799.7367.1432713232779.JavaMail.defaultUser@defaultHost>
Message-ID: <BLUPR03MB120858FE08817F988531681D5CB0@BLUPR03MB120.namprd03.prod.outlook.com>

Well, the same reasoning could also argue for the contra-positive (a?b ? ?b??a): that UTC should not consider endorsing such a tag scheme.

Peter

From: William_J_G Overington [mailto:wjgo_10009 at btinternet.com]
Sent: Wednesday, May 27, 2015 12:54 AM
To: unicode at unicode.org; Peter Constable; eric.muller at efele.net; asmus-inc at ix.netcom.com
Subject: Re: Tag characters

Peter Constable wrote as follows:

> Would Unicode really want to get into the business of running a UFL service?

Well, Unicode is about precision, interoperability and long-term stability, and, given, in relation to one particular specified base character followed by some tag characters, that a particular sequence of Unicode characters is intended to lead to the display of an image representing a particular flag, it seems to me highly reasonable that the Unicode Technical Committee might seriously consider providing that facility.

William Overington

27 May 2015


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/ab56f787/attachment.html>

From wjgo_10009 at btinternet.com  Wed May 27 11:26:07 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 27 May 2015 17:26:07 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
Message-ID: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>

Tag characters and in-line graphics (from Tag characters)
This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice.
The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications.
The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character.
The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character.
The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer.
Examples of displays:
Each example is left to right along the line then lines down the page from upper to lower.
7r means 7 pixels red
7r5y means 7 pixels red then 5 pixels yellow
7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue
Examples of colours available:
k black
n brown
r red
o orange
y yellow
g green (0, 255, 0)
b blue
m magenta
e grey
w white
c cyan
p pink
d dark grey
i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1)
f deeper green (foliage colour) (0, 128, 0)
Next line request:
- moves to the next line
Local palette requests:
192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64)
7,2u means 7 pixels using local palette colour 2
Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document:
3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels
3h here local glyph 3 is being used
The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font.
Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted.
William Overington
27 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/07dc0bb0/attachment.html>

From doug at ewellic.org  Wed May 27 12:06:41 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 27 May 2015 10:06:41 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
Message-ID: <20150527100641.665a7a7059d7ee80bb4d670165c8327d.9c484cc1df.wbe@email03.secureserver.net>

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

> Please feel free to suggest improvements.

http://en.wikipedia.org/wiki/Scalable_Vector_Graphics

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Wed May 27 12:49:31 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 27 May 2015 10:49:31 -0700
Subject: Tag characters
Message-ID: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net>

On Tuesday, May 19, Mark Davis ?? <mark at macchiato dot com> wrote:

> A more concrete proposal will be in a PRI to be issued soon,

If the new mechanism is intended "for Unicode 8.0," as stated in the
minutes at http://www.unicode.org/L2/L2015/15107.htm#143-M1 ...

... and if Unicode 8.0 is "planned for release in June, 2015," as stated
on the Beta Review page...

... and if June 2015 starts in less than a week...

... shouldn't we be seeing that PRI real soon now?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From kenwhistler at att.net  Wed May 27 13:08:44 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 27 May 2015 11:08:44 -0700
Subject: Tag characters
In-Reply-To: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net>
References: <20150527104931.665a7a7059d7ee80bb4d670165c8327d.7f06f3d380.wbe@email03.secureserver.net>
Message-ID: <5566082C.4060904@att.net>

Doug,

Read on in the minutes to the next day. 143-C27 and related actions.

There are a few things to keep in mind here.

1. The un-deprecation of the tags U+E0020..U+E007E *is* part of
the UCD for Unicode 8.0. The change has already taken place in
the revised beta files now posted (see PropList.txt), and will be part of
the 8.0 release next month.

2. UTR #51, while scheduled to come out at the same time
as the Unicode 8.0 release, is a UTR and is not formally either
a part of the Unicode Standard per se, nor a formal part of
the Unicode 8.0 release.

3. As per the minutes, when the approved version of UTR #51 is
first published, more or less simultaneously with the Unicode 8.0
release (and explaining other aspects of emoji related to the release,
such as the use of emoji modifiers), it will *not* yet contain
the flag-tag discussion and mechanism.

4. Once the PRI is up, it will be used as the basis for the next proposed
update of UTR #51. And the review of that proposed update and
publication of the *subsequent* revision of UTR #51 need not wait for
the next Unicode release (9.0 in summer, 2016). So at that point,
the flag-tag mechanism will be available for use *with* Unicode 8.0 --
it just won't be a formal part of the release per se.

Clear?

--Ken

On 5/27/2015 10:49 AM, Doug Ewell wrote:
> On Tuesday, May 19, Mark Davis ?? <mark at macchiato dot com> wrote:
>
>> A more concrete proposal will be in a PRI to be issued soon,
> If the new mechanism is intended "for Unicode 8.0," as stated in the
> minutes at http://www.unicode.org/L2/L2015/15107.htm#143-M1 ...
>
> ... and if Unicode 8.0 is "planned for release in June, 2015," as stated
> on the Beta Review page...
>
> ... and if June 2015 starts in less than a week...
>
> ... shouldn't we be seeing that PRI real soon now?
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>


From doug at ewellic.org  Wed May 27 14:06:26 2015
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 27 May 2015 12:06:26 -0700
Subject: Tag characters
Message-ID: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net>

Ken Whistler <kenwhistler at att dot net> wrote:

> Read on in the minutes to the next day. 143-C27 and related actions.

Ah. Thank you. Now I understand what Steven meant by "read the minutes,"
too.

That's the problem with reading individual items in meeting minutes:
each item is a snapshot in time, and the next day of the meeting might
have brought no change, or a big change.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From mark at macchiato.com  Wed May 27 14:10:53 2015
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 27 May 2015 21:10:53 +0200
Subject: FYI: The world's languages, in 7 maps and charts
In-Reply-To: <CAJKta0y0eMYCaORS7UcXeyBrjy6AdT9JLWZEBnrafLQK6RWdiw@mail.gmail.com>
References: <281809194-1431470843-cardhu_decombobulator_blackberry.rim.net-1396854058-@b13.c4.bise6.blackberry>
 <000401d08d5e$a811de90$f8359bb0$@gmail.com>
 <CAGa7JC3SoBsqTHN84MxVa6zi32hcc624h2qWT6JdOmOXBZg72A@mail.gmail.com>
 <CAFZj83Nh2QS2M-zHVFD7Wb3uMCnzW4c_AYnaZBP7tABzYAACmQ@mail.gmail.com>
 <CAJKta0zmQxF7fPZ4e0dEjix9jx2eXbK43jGn2Ob-oAz7N5tDrA@mail.gmail.com>
 <CAJ2xs_HL3cbNvWRxKrLJNJNKZCzjx=2y4v+XqFk22vqkh3Mang@mail.gmail.com>
 <CAJKta0y0eMYCaORS7UcXeyBrjy6AdT9JLWZEBnrafLQK6RWdiw@mail.gmail.com>
Message-ID: <CAJ2xs_EF2K5Gx_Q9oQDE=KE9sLrMnyie_a8ft1O+td65mk-VuQ@mail.gmail.com>

I think it is gives a misleading picture to only include mother-language
speakers, rather than all languages (at a reasonable level of fluency).
Every Swiss German is fluent in High German.

Part of the problem is that it is very hard to get good data on the
multiple languages that people speak?a huge number of people are fluent in
more than one?and on the level of fluency in each. That alone makes it
difficult to do accurate representations. That level of accuracy may not be
necessary to get a general picture, but when the map purports to go into
great detail...


Mark <https://google.com/+MarkDavis>

*? Il meglio ? l?inimico del bene ?*

On Wed, May 27, 2015 at 4:59 PM, Denis Jacquerye <moyogo at gmail.com> wrote:

> The data used to build the infographic comes from Ethnologue.com.
> http://www.ethnologue.com/language/deu does not indicate the Standard
> German L1 population in Austria and gives a population of 727?000 Standard
> German L1 speakers in Switzerland (the difference is counted as Swiss
> German L1 speakers).
>
> On Wed, 27 May 2015 at 11:22 Mark Davis [image: ?]? <mark at macchiato.com>
> wrote:
>
>> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
>> wrong by almost a power of 10.
>>
>>
>> Mark <https://google.com/+MarkDavis>
>>
>> *? Il meglio ? l?inimico del bene ?*
>>
>> On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye <moyogo at gmail.com>
>> wrote:
>>
>>> The South China Morning Post published a similar infographic:
>>> A world of languages - and how many speak them
>>>
>>> http://www.scmp.com/infographics/article/1810040/infographic-world-languages
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/071f86e9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u2615.png
Type: image/png
Size: 1616 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/071f86e9/attachment.png>

From jimbreen at gmail.com  Wed May 27 18:15:28 2015
From: jimbreen at gmail.com (Jim Breen)
Date: Thu, 28 May 2015 09:15:28 +1000
Subject: FYI: The world?s languages, in 7 maps and charts
Message-ID: <CABHGxq4Wy-G5Ae9YrTpeWgwcSW5Ag5mEde-i8jNHQK1BZhi18w@mail.gmail.com>

 "Mark Davis" wrote:

>> Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
>> wrong by almost a power of 10.

I was a little surprised to see only 15.6 Australians speak English, which led
me to wonder what the other 8 million of us speak.

I see that the ethnologue site they used quotes the 2006 Australian census
as saying the population was 15.6 million. I can't imagine where they got
that, as that census reported the population as being just under 20 million. The
2011 census recorded the population at 21.7 million. I guess if they are
prone to using inaccurate data from old sources, it explains some of
the other oddities in that map.

Jim Breen

-- 
Jim Breen
Adjunct Snr Research Fellow, Japanese Studies Centre, Monash University

From mark at kli.org  Wed May 27 18:41:39 2015
From: mark at kli.org (Mark E. Shoulson)
Date: Wed, 27 May 2015 19:41:39 -0400
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
Message-ID: <55665633.8040503@kli.org>

I think I've figured out the philosophy WJGO is trying to follow here.

"We should have a way to encode graphics in Unicode"
"We should have a way to encode programming instructions in Unicode"
How about
"We should have a way to encode sound-waves in Unicode"?
Or
"We should have a way to encode *moving* graphics, maybe with sound, in 
Unicode"?

Now, he didn't say the last two, in fairness to him.  But I think that's 
the thinking.  WJGO, not *everything* computers do has to be part of 
Unicode.  Doing so essentially makes *everything* that wants to support 
"Unicode" have to be... well, pretty much *everything* all other 
computers are.  We have graphics formats that encode graphics; they're 
*good* at it.  They're made for it. We have sound formats for encoding 
sounds.  We have various bytecodes for programming--different ones, 
written by different people, that do things in different ways, because 
one size does not fit all.  Unicode can't be the one size.  It was never 
intended to.  Don't make Unicode into an operating system, or worse, THE 
operating system.  It's a character encoding.  For encoding characters.

~mark

On 05/27/2015 12:26 PM, William_J_G Overington wrote:
> Tag characters and in-line graphics (from Tag characters)
>
>
> This document suggests a way to use the method of a base character 
> together with tag characters to produce a graphic. The approach is 
> theoretical and has not, at this time, been tried in practice.
>
>
> The application in mind is to enable the graphic for an emoji 
> character to be included within a plain text stream, though there will 
> hopefully be other applications.
>


From srl at icu-project.org  Wed May 27 22:04:21 2015
From: srl at icu-project.org (Steven R. Loomis)
Date: Wed, 27 May 2015 22:04:21 -0500
Subject: Tag characters
In-Reply-To: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net>
References: <20150527120626.665a7a7059d7ee80bb4d670165c8327d.ff6d41f607.wbe@email03.secureserver.net>
Message-ID: <5AC7D996-8BA2-4F31-9BD8-5B8B18026C96@icu-project.org>

Thanks Ken; and yes Doug; http://www.unicode.org/L2/L2015/15107.htm#143-C27 was the reference I was looking for when I wrote my too- brief reply earlier. My apologies. 

S

Enviado desde nuestro iPhone.

> On May 27, 2015, at 2:06 PM, Doug Ewell <doug at ewellic.org> wrote:
> 
> Ken Whistler <kenwhistler at att dot net> wrote:
> 
>> Read on in the minutes to the next day. 143-C27 and related actions.
> 
> Ah. Thank you. Now I understand what Steven meant by "read the minutes,"
> too.
> 
> That's the problem with reading individual items in meeting minutes:
> each item is a snapshot in time, and the next day of the meeting might
> have brought no change, or a big change.
> 
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150527/7ca2fa10/attachment.html>

From wjgo_10009 at btinternet.com  Thu May 28 06:50:09 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 28 May 2015 12:50:09 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <55665633.8040503@kli.org>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
Message-ID: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>

Responding to Mark E. Shoulson:

The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage.

The following may be useful as a guide to the original problem that I am trying to solve.

http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term

I tried to apply the brilliant new "base character followed by tag characters" format to the problem.

In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format.

William Overington

28 May 2015


From idou747 at gmail.com  Wed May 27 23:48:23 2015
From: idou747 at gmail.com (Chris)
Date: Thu, 28 May 2015 14:48:23 +1000
Subject: Arrow dingbats
Message-ID: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>


Unicode has the arrow dingbats ???????

in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
conspicuously missing is the right arrow

The closest one can find is 27a1 ?BLACK RIGHT ARROW"
?

But everywhere I can see that has this arrow, it looks a lot different to the other arrows with a narrower body and head.

Whose fault is this, and who will fix it?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/db485b62/attachment.html>

From doug at ewellic.org  Thu May 28 09:53:42 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 07:53:42 -0700
Subject: "Unicode of Death"
Message-ID: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>

Unicode is in the news today as some folks with waaay too much time on
their hands have discovered a string consisting of Latin, Arabic,
Devanagari, and CJK characters that crashes Apple devices when it
appears as a pop-up message.

Although most people seem to identify it correctly as a CoreText bug,
there are a handful, as you might expect, who attribute it to some shady
weirdness in Unicode itself. My favorite quote from a Reddit user was
this:

"Every character you use has a unicode value which tells your phone what
to display. One of the unicode values is actually never-ending and so
when the phone tries to read it it goes into an infinite loop which
crashes it."

I've read TUS Chapter 4 and UTR #23 and I still can't find the
"never-ending" Unicode property.

Perhaps astonishingly to some, the string displays fine on all my
Windows devices. Not all apps get the directionality right, but no
crashes.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Thu May 28 10:03:41 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 08:03:41 -0700
Subject: Arrow dingbats
Message-ID: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net>

Chris <idou747 at gmail dot com> wrote:

> Unicode has the arrow dingbats ???????
>
> in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
> conspicuously missing is the right arrow
>
> The closest one can find is 27a1 ?BLACK RIGHT ARROW"
> ?
>
> But everywhere I can see that has this arrow, it looks a lot different
> to the other arrows with a narrower body and head.
>
> Whose fault is this, and who will fix it?

U+2B95 RIGHTWARDS BLACK ARROW ? might be a better fit.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From boldewyn at gmail.com  Thu May 28 10:34:53 2015
From: boldewyn at gmail.com (Manuel Strehl)
Date: Thu, 28 May 2015 17:34:53 +0200
Subject: Arrow dingbats
In-Reply-To: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net>
References: <20150528080341.665a7a7059d7ee80bb4d670165c8327d.212293419c.wbe@email03.secureserver.net>
Message-ID: <CAEZUo2cWALuWfoL1kyBKmEdD78e8eqCT7o2vMaQwNnFncpzpDg@mail.gmail.com>

Interesting! Out of curiosity: How come this was recognized in Unicode 7?
Is that documented anywhere?

2015-05-28 17:03 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Chris <idou747 at gmail dot com> wrote:
>
> > Unicode has the arrow dingbats ???????
> >
> > in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
> > conspicuously missing is the right arrow
> >
> > The closest one can find is 27a1 ?BLACK RIGHT ARROW"
> > ?
> >
> > But everywhere I can see that has this arrow, it looks a lot different
> > to the other arrows with a narrower body and head.
> >
> > Whose fault is this, and who will fix it?
>
> U+2B95 RIGHTWARDS BLACK ARROW ? might be a better fit.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/2615101c/attachment.html>

From timothy at greenwood.name  Thu May 28 10:47:10 2015
From: timothy at greenwood.name (Tim Greenwood)
Date: Thu, 28 May 2015 15:47:10 +0000
Subject: "Unicode of Death"
In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
Message-ID: <CAL-0zF=5525M2RA-ioggTiFfj2bB=pTn3Ephs5zvhV3CYCyfWQ@mail.gmail.com>

Must be that same evil Unicode consortium that is destroying civilization
by inventing emoji. The Guardian article has been edited since yesterday,
when it did actually claim that Unicode invented all emoji.
http://gu.com/p/4997q

On Thu, May 28, 2015 at 11:04 AM Doug Ewell <doug at ewellic.org> wrote:

> Unicode is in the news today as some folks with waaay too much time on
> their hands have discovered a string consisting of Latin, Arabic,
> Devanagari, and CJK characters that crashes Apple devices when it
> appears as a pop-up message.
>
> Although most people seem to identify it correctly as a CoreText bug,
> there are a handful, as you might expect, who attribute it to some shady
> weirdness in Unicode itself. My favorite quote from a Reddit user was
> this:
>
> "Every character you use has a unicode value which tells your phone what
> to display. One of the unicode values is actually never-ending and so
> when the phone tries to read it it goes into an infinite loop which
> crashes it."
>
> I've read TUS Chapter 4 and UTR #23 and I still can't find the
> "never-ending" Unicode property.
>
> Perhaps astonishingly to some, the string displays fine on all my
> Windows devices. Not all apps get the directionality right, but no
> crashes.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/2ca4c3a7/attachment.html>

From shervinafshar at gmail.com  Thu May 28 11:06:01 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 09:06:01 -0700
Subject: "Unicode of Death"
In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
Message-ID: <CA+ONODk4SAYij1fcxKTiCe+JGDHf1EGUeBrnNAYDW6iC-WT2KQ@mail.gmail.com>

>
> Unicode is in the news today as some folks with waaay too much time on
> their hands have discovered a string consisting of Latin, Arabic,
> Devanagari, and CJK characters that crashes Apple devices when it
> appears as a pop-up message.
>

We should be thankful to those folks "waaay too much time on their hands"
to discover these for us all.

Although most people seem to identify it correctly as a CoreText bug,


Any good technical write up about this?


? Shervin

On Thu, May 28, 2015 at 7:53 AM, Doug Ewell <doug at ewellic.org> wrote:

> Unicode is in the news today as some folks with waaay too much time on
> their hands have discovered a string consisting of Latin, Arabic,
> Devanagari, and CJK characters that crashes Apple devices when it
> appears as a pop-up message.
>
> Although most people seem to identify it correctly as a CoreText bug,
> there are a handful, as you might expect, who attribute it to some shady
> weirdness in Unicode itself. My favorite quote from a Reddit user was
> this:
>
> "Every character you use has a unicode value which tells your phone what
> to display. One of the unicode values is actually never-ending and so
> when the phone tries to read it it goes into an infinite loop which
> crashes it."
>
> I've read TUS Chapter 4 and UTR #23 and I still can't find the
> "never-ending" Unicode property.
>
> Perhaps astonishingly to some, the string displays fine on all my
> Windows devices. Not all apps get the directionality right, but no
> crashes.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/54e645dd/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 11:12:25 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 18:12:25 +0200
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
Message-ID: <CAGa7JC2ErXZv7D6z9bEBH7+ohqpGce4WPz584uLs+_APhdAw0Q@mail.gmail.com>

There's no advantage because what you want to create is effectively another
markup language with its own syntax (but requiring new obscure characters
that most applications and users will not be able to interpret and render
correctly in the way intended by you, and with still many things you have
forgotten about the specific needs for images (e.g. colorimetry profiles,
aspect ratio of pixels with bitmaps, undesired effects that must be
controled such as "moir?" artefacts).

You don't need new characters to create a markup language and its syntax.
Today the world goes very well with HTML(5) which is now the bext markup
language for document (including for inserting embedded images that don't
require any external request, or embedding special effects on images, such
as animation or dynamic layouts for adapting the document to the redering
device, with the help of CSS and Javascript that are also embeddable).

At least with HTML5 they don't try to reinvent the image formats and
there's ample space for supporting multiple images formats tuned for
specific needs (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and
video, and synchronization of images and audio in time for videos, or with
user interactions. They are designed separately and benefit from patient
researches made since long (your desired format, still undocumented, is
largely under the level needed for images, independantly of the markup
syntax you want to create to support them, and independantly of the fact
that you also want to encode these syntaxic elements with new characters,
something that is absolutely not needed for any markup language)

In summary, you are reinventing the wheel.

2015-05-28 13:50 GMT+02:00 William_J_G Overington <wjgo_10009 at btinternet.com
>:

> Responding to Mark E. Shoulson:
>
> The big advantage of this new format is that the result is an unambiguous
> Unicode plain text file and could be placed within a file of plain text
> without having to make the whole document a markup file to some format.
> Plain text is the key advantage.
>
> The following may be useful as a guide to the original problem that I am
> trying to solve.
>
> http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term
>
> I tried to apply the brilliant new "base character followed by tag
> characters" format to the problem.
>
> In the future, maybe Serif DrawPlus will have the ability to export a
> picture to this new format.
>
> William Overington
>
> 28 May 2015
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/527bc1ed/attachment.html>

From doug at ewellic.org  Thu May 28 11:16:31 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 09:16:31 -0700
Subject: Arrow dingbats
Message-ID: <20150528091631.665a7a7059d7ee80bb4d670165c8327d.c67f478dad.wbe@email03.secureserver.net>

Manuel Strehl <boldewyn at gmail dot com> wrote:

> Interesting! Out of curiosity: How come this was recognized in Unicode
> 7?
> Is that documented anywhere?

NamesList.txt contains this entry for the left arrow:

2B05    LEFTWARDS BLACK ARROW
        x (black rightwards arrow - 27A1)
        x (rightwards black arrow - 2B95)

I don't know how U+2B95 came to be encoded in 7.0 when all of the
similar U+2B0x arrows had been in place since 4.0. Presumably, before
then, it was felt that U+27A1 was an appropriate fit, though as Chris
idou747 pointed out, not all fonts show perfect symmetry here.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Thu May 28 11:18:16 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 09:18:16 -0700
Subject: "Unicode of Death"
Message-ID: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net>

Shervin Afshar <shervinafshar at gmail dot com> wrote:

> Any good technical write up about this?

Haven't seen one yet. Just a lot of "OMG, look at this" so far.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From shervinafshar at gmail.com  Thu May 28 12:13:16 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 10:13:16 -0700
Subject: "Unicode of Death"
In-Reply-To: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net>
References: <20150528091816.665a7a7059d7ee80bb4d670165c8327d.955113905b.wbe@email03.secureserver.net>
Message-ID: <CA+ONODn6EP1oujg33ki1H4rph7Zv8+AVqPwU4o7_KsQAj6zovw@mail.gmail.com>

I'm no iOS dev, but it seems like CoreText is trying[1] to truncate text
for SpringBoard (to shorten it with ellipses to fit the notification box)
and it crashes and burns with a segmentation fault[2]. FWIW, Reddit
abides[3][4] and reacts with "Unicode Suppressor"[5]...heh...as if!

[1]: http://pastebin.com/cQyQE7Ws
[2]:
http://stackoverflow.com/questions/12601286/i-am-getting-a-lot-of-sigsegv-exception-in-my-ios-app-crash-report-and-that-too
[3]:
https://www.reddit.com/r/apple/comments/37e8c1/malicious_text_message/crm4h4x
[4]:
http://www.reddit.com/r/iphone/comments/37eaxs/um_can_someone_explain_this_phenomenon/crm3adg
[5]: https://www.myrepospace.com/profile/effective/688319/Unicode_Suppresor

? Shervin

On Thu, May 28, 2015 at 9:18 AM, Doug Ewell <doug at ewellic.org> wrote:

> Shervin Afshar <shervinafshar at gmail dot com> wrote:
>
> > Any good technical write up about this?
>
> Haven't seen one yet. Just a lot of "OMG, look at this" so far.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/c5631757/attachment.html>

From andrewcwest at gmail.com  Thu May 28 14:13:02 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Thu, 28 May 2015 20:13:02 +0100
Subject: Arrow dingbats
In-Reply-To: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>
References: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>
Message-ID: <CALgEMhyNTjpC0CeQB_wPciiD6ZC8OvYqH1YdExv98OZtPa3V_w@mail.gmail.com>

On 28 May 2015 at 05:48, Chris <idou747 at gmail.com> wrote:
>
> Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
> conspicuously missing is the right arrow
>
> But everywhere I can see that has this arrow, it looks a lot different to
> the other arrows with a narrower body and head.
>
> Whose fault is this,

The three left/up/downwards black arrows were added at the request of
North Korea, so I guess you can blame Kim Jong-Il for the missing
rightwards arrow ... perhaps the North Korean army never went to the
right.

> and who will fix it?

It was fixed in Unicode 7.0 last year with the addition of U+2B95
RIGHTWARDS BLACK ARROW.  Of course, it may not be fixed for you and
other users unless you have a font installed that supports all the
arrows in a consistent style.

I don't know why the character was added in 7.0, but it may have been
prompted by the same question as yours that was asked on this list in
2013 <http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0078.html>.

Andrew


From verdy_p at wanadoo.fr  Thu May 28 14:46:55 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 21:46:55 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
Message-ID: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>

Is there a symbol that can represent the "Bunny hill" symbol used in North
America and some other American territories with mountains, to designate
the ski pistes open to novice skiers (those pistes are signaled with green
signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in
America, but we have a symbol for the circle; red pistes in Europe are
signaled by a blue square in America, but we have a symbol for the square;
black pistes in Europe are signaled by a black diamond in America, but we
also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal,
equivalent to green pistes in Europe (this is a problem for webpages
related to skiing: do we have to embed an image ?).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/ab477193/attachment.html>

From shervinafshar at gmail.com  Thu May 28 14:59:55 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 12:59:55 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
Message-ID: <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>

Single and double diamond?

https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg


? Shervin

On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/a8db767d/attachment.html>

From leoboiko at namakajiri.net  Thu May 28 15:02:07 2015
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Thu, 28 May 2015 17:02:07 -0300
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
Message-ID: <CAJ6uix7N+KcW7fnMB=7qDf6j-JV5NCqKESXQG7z1ckwqFOU43w@mail.gmail.com>

You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING
UPWARD POINTING TRIANGLE, and pretend the triangle is a hill.  ?? ?

If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW
CAPPED MOUNTAIN.  Or anything else.


2015-05-28 16:46 GMT-03:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/ceaf8756/attachment.html>

From Shawn.Steele at microsoft.com  Thu May 28 15:04:11 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 28 May 2015 20:04:11 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
Message-ID: <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>

So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I?m drawing a blank on a specific bunny sign, in my experience those are usually just green.

Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode?

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/7d93ef7f/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 15:03:43 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 22:03:43 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
Message-ID: <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>

Well also these symbols, if you want (these are not really "diamonds"), but
the wordpress page forgets the "bunny hill". It starts only with the green
circle (in fact a black disc colored in green) which maps to blue pistes in
Europe.

2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:

> Single and double diamond?
>
> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>
> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>
> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>
>
> ? Shervin
>
> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> Is there a symbol that can represent the "Bunny hill" symbol used in
>> North America and some other American territories with mountains, to
>> designate the ski pistes open to novice skiers (those pistes are signaled
>> with green signs in Europe).
>>
>> I'm looking for the symbol itself, not the color, or the form of the sign.
>>
>> For example blue pistes in Europe are designed with a green circle in
>> America, but we have a symbol for the circle; red pistes in Europe are
>> signaled by a blue square in America, but we have a symbol for the square;
>> black pistes in Europe are signaled by a black diamond in America, but we
>> also have such "black" diamond in Unicode.
>>
>> But I can't find an equivalent to the American "Bunny hill" signal,
>> equivalent to green pistes in Europe (this is a problem for webpages
>> related to skiing: do we have to embed an image ?).
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/c4d20457/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 15:10:23 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 22:10:23 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <CAGa7JC1VbeWp=PwpE_PM6BLEOmR4spjQ1z6fpQJr8iLQeDtpAA@mail.gmail.com>

A single "black diamond" symbol would be sufficient I think (in fact a
black square rotated 45?, not the same as the symbol of card decks which
typically has borders rounded inward)

The effective color does not really matter here, it can be generated by
styling the text, something necessary anyway with the European piste colors
that don't use any specific symbol, but signs that are most frequently
circular, or sometimes shaped as squares, or "diamonds"). So for the "black
diamond" it just means that this is a symbol fully filled with the text
color (like other Unicode characters named with "BLACK".


2015-05-28 22:04 GMT+02:00 Shawn Steele <Shawn.Steele at microsoft.com>:

>  So is double black diamond a separate symbol?  Or just two of the black
> diamond?
>
>
>
> And Blue-Black?
>
>
>
> I?m drawing a blank on a specific bunny sign, in my experience those are
> usually just green.
>
>
>
> Aren?t there a lot of cartography symbols for various systems that aren?t
> present in Unicode?
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy
> *Sent:* Thursday, May 28, 2015 12:47 PM
> *To:* unicode Unicode Discussion
> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes
> for novices
>
>
>
> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
>
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
>
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
>
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/850c1595/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 15:11:26 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 22:11:26 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAJ6uix7N+KcW7fnMB=7qDf6j-JV5NCqKESXQG7z1ckwqFOU43w@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CAJ6uix7N+KcW7fnMB=7qDf6j-JV5NCqKESXQG7z1ckwqFOU43w@mail.gmail.com>
Message-ID: <CAGa7JC1pDu-0dshn7CRp6maGRwxYUzxAWb30+rf1EnxjUgPtuA@mail.gmail.com>

Very poor suggestion I think. This is a single symbol by itself.

2015-05-28 22:02 GMT+02:00 Leonardo Boiko <leoboiko at namakajiri.net>:

> You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING
> UPWARD POINTING TRIANGLE, and pretend the triangle is a hill.  [image: ??]
> ?
>
> If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW
> CAPPED MOUNTAIN.  Or anything else.
>
>
> 2015-05-28 16:46 GMT-03:00 Philippe Verdy <verdy_p at wanadoo.fr>:
>
> Is there a symbol that can represent the "Bunny hill" symbol used in North
>> America and some other American territories with mountains, to designate
>> the ski pistes open to novice skiers (those pistes are signaled with green
>> signs in Europe).
>>
>> I'm looking for the symbol itself, not the color, or the form of the sign.
>>
>> For example blue pistes in Europe are designed with a green circle in
>> America, but we have a symbol for the circle; red pistes in Europe are
>> signaled by a blue square in America, but we have a symbol for the square;
>> black pistes in Europe are signaled by a black diamond in America, but we
>> also have such "black" diamond in Unicode.
>>
>> But I can't find an equivalent to the American "Bunny hill" signal,
>> equivalent to green pistes in Europe (this is a problem for webpages
>> related to skiing: do we have to embed an image ?).
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/f7356719/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u1f407.png
Type: image/png
Size: 1902 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/f7356719/attachment.png>

From shervinafshar at gmail.com  Thu May 28 15:11:12 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 13:11:12 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
 <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
Message-ID: <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>

Well...to pick the nit, these shapes are rhombi; known colloquially as
"diamonds".

So what's the symbol for "bunny hill" in Europe?

? Shervin

On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Well also these symbols, if you want (these are not really "diamonds"),
> but the wordpress page forgets the "bunny hill". It starts only with the
> green circle (in fact a black disc colored in green) which maps to blue
> pistes in Europe.
>
> 2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>
>> Single and double diamond?
>>
>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>>
>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>>
>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>>
>>
>> ? Shervin
>>
>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> Is there a symbol that can represent the "Bunny hill" symbol used in
>>> North America and some other American territories with mountains, to
>>> designate the ski pistes open to novice skiers (those pistes are signaled
>>> with green signs in Europe).
>>>
>>> I'm looking for the symbol itself, not the color, or the form of the
>>> sign.
>>>
>>> For example blue pistes in Europe are designed with a green circle in
>>> America, but we have a symbol for the circle; red pistes in Europe are
>>> signaled by a blue square in America, but we have a symbol for the square;
>>> black pistes in Europe are signaled by a black diamond in America, but we
>>> also have such "black" diamond in Unicode.
>>>
>>> But I can't find an equivalent to the American "Bunny hill" signal,
>>> equivalent to green pistes in Europe (this is a problem for webpages
>>> related to skiing: do we have to embed an image ?).
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/d1c46252/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 15:16:01 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 22:16:01 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
 <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
 <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>
Message-ID: <CAGa7JC3Bdj_+n78H0-Wd8UuikBG098ecyzLMPLqVok3-=cVd+g@mail.gmail.com>

I saif it: there's no symbol in Europe for pistes, just colors. The
American "Bunny hill" maps to "green" pistes in Europe.
(the European piste colors are used also for drawing their ways on maps,
not just found in signages).
Piste signs are typically all the same shape in the same station (most
often discs) and the text on it (if present) shows the name or number of
the piste in the station, or just an arrow showing the direction to follow.

2015-05-28 22:11 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:

> Well...to pick the nit, these shapes are rhombi; known colloquially as
> "diamonds".
>
> So what's the symbol for "bunny hill" in Europe?
>
> ? Shervin
>
> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> Well also these symbols, if you want (these are not really "diamonds"),
>> but the wordpress page forgets the "bunny hill". It starts only with the
>> green circle (in fact a black disc colored in green) which maps to blue
>> pistes in Europe.
>>
>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>
>>> Single and double diamond?
>>>
>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>>>
>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>>>
>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>>>
>>>
>>> ? Shervin
>>>
>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>> wrote:
>>>
>>>> Is there a symbol that can represent the "Bunny hill" symbol used in
>>>> North America and some other American territories with mountains, to
>>>> designate the ski pistes open to novice skiers (those pistes are signaled
>>>> with green signs in Europe).
>>>>
>>>> I'm looking for the symbol itself, not the color, or the form of the
>>>> sign.
>>>>
>>>> For example blue pistes in Europe are designed with a green circle in
>>>> America, but we have a symbol for the circle; red pistes in Europe are
>>>> signaled by a blue square in America, but we have a symbol for the square;
>>>> black pistes in Europe are signaled by a black diamond in America, but we
>>>> also have such "black" diamond in Unicode.
>>>>
>>>> But I can't find an equivalent to the American "Bunny hill" signal,
>>>> equivalent to green pistes in Europe (this is a problem for webpages
>>>> related to skiing: do we have to embed an image ?).
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/69611481/attachment.html>

From shervinafshar at gmail.com  Thu May 28 15:25:02 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 13:25:02 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAGa7JC3Bdj_+n78H0-Wd8UuikBG098ecyzLMPLqVok3-=cVd+g@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
 <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
 <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>
 <CAGa7JC3Bdj_+n78H0-Wd8UuikBG098ecyzLMPLqVok3-=cVd+g@mail.gmail.com>
Message-ID: <CA+ONODkHi4Kxo=Ft1C9UXtYnuajw0kxX8wV-uJ0X0boyGFqx7Q@mail.gmail.com>

Makes sense. But it doesn't seem like we need any new symbols. I think one
of these should do for hard and extra-hard slopes:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g=

Also, I'm not at all against making use of the actual [image: ??]we have. I
will not hold my breath for a combining rabbit symbol though.

? Shervin

On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I saif it: there's no symbol in Europe for pistes, just colors. The
> American "Bunny hill" maps to "green" pistes in Europe.
> (the European piste colors are used also for drawing their ways on maps,
> not just found in signages).
> Piste signs are typically all the same shape in the same station (most
> often discs) and the text on it (if present) shows the name or number of
> the piste in the station, or just an arrow showing the direction to follow.
>
> 2015-05-28 22:11 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>
>> Well...to pick the nit, these shapes are rhombi; known colloquially as
>> "diamonds".
>>
>> So what's the symbol for "bunny hill" in Europe?
>>
>> ? Shervin
>>
>> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> Well also these symbols, if you want (these are not really "diamonds"),
>>> but the wordpress page forgets the "bunny hill". It starts only with the
>>> green circle (in fact a black disc colored in green) which maps to blue
>>> pistes in Europe.
>>>
>>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>>
>>>> Single and double diamond?
>>>>
>>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>>>>
>>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>>>>
>>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>>>>
>>>>
>>>> ? Shervin
>>>>
>>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>>> wrote:
>>>>
>>>>> Is there a symbol that can represent the "Bunny hill" symbol used in
>>>>> North America and some other American territories with mountains, to
>>>>> designate the ski pistes open to novice skiers (those pistes are signaled
>>>>> with green signs in Europe).
>>>>>
>>>>> I'm looking for the symbol itself, not the color, or the form of the
>>>>> sign.
>>>>>
>>>>> For example blue pistes in Europe are designed with a green circle in
>>>>> America, but we have a symbol for the circle; red pistes in Europe are
>>>>> signaled by a blue square in America, but we have a symbol for the square;
>>>>> black pistes in Europe are signaled by a black diamond in America, but we
>>>>> also have such "black" diamond in Unicode.
>>>>>
>>>>> But I can't find an equivalent to the American "Bunny hill" signal,
>>>>> equivalent to green pistes in Europe (this is a problem for webpages
>>>>> related to skiing: do we have to embed an image ?).
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/e1f4b311/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u1f407.png
Type: image/png
Size: 1902 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/e1f4b311/attachment.png>

From leoboiko at namakajiri.net  Thu May 28 15:33:40 2015
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Thu, 28 May 2015 17:33:40 -0300
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CA+ONODkHi4Kxo=Ft1C9UXtYnuajw0kxX8wV-uJ0X0boyGFqx7Q@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
 <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
 <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>
 <CAGa7JC3Bdj_+n78H0-Wd8UuikBG098ecyzLMPLqVok3-=cVd+g@mail.gmail.com>
 <CA+ONODkHi4Kxo=Ft1C9UXtYnuajw0kxX8wV-uJ0X0boyGFqx7Q@mail.gmail.com>
Message-ID: <CAJ6uix5GJExNkSbV03-KAzmqjJTxBxFKZvrsgyM28hRp8z=wjg@mail.gmail.com>

Serious question: Has someone discussed a generic combining mechanism? I
mean, characters with an effect like "combine the last two".  Say, '!' +
'?' + COMBINING OVERLAY = '?'.  '!' + '!' + COMBINING SIDE BY SIDE = '?',
and so on.  Similar in spirit to the Ideographic Description Characters,
but meant to actually tell the rendering system to combine stuff.

2015-05-28 17:25 GMT-03:00 Shervin Afshar <shervinafshar at gmail.com>:

> Makes sense. But it doesn't seem like we need any new symbols. I think one
> of these should do for hard and extra-hard slopes:
>
>
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g=
>
> Also, I'm not at all against making use of the actual [image: ??]we have.
> I will not hold my breath for a combining rabbit symbol though.
>
> ? Shervin
>
> On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> I saif it: there's no symbol in Europe for pistes, just colors. The
>> American "Bunny hill" maps to "green" pistes in Europe.
>> (the European piste colors are used also for drawing their ways on maps,
>> not just found in signages).
>> Piste signs are typically all the same shape in the same station (most
>> often discs) and the text on it (if present) shows the name or number of
>> the piste in the station, or just an arrow showing the direction to follow.
>>
>> 2015-05-28 22:11 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>
>>> Well...to pick the nit, these shapes are rhombi; known colloquially as
>>> "diamonds".
>>>
>>> So what's the symbol for "bunny hill" in Europe?
>>>
>>> ? Shervin
>>>
>>> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>> wrote:
>>>
>>>> Well also these symbols, if you want (these are not really "diamonds"),
>>>> but the wordpress page forgets the "bunny hill". It starts only with the
>>>> green circle (in fact a black disc colored in green) which maps to blue
>>>> pistes in Europe.
>>>>
>>>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>>>
>>>>> Single and double diamond?
>>>>>
>>>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>>>>>
>>>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>>>>>
>>>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>>>>>
>>>>>
>>>>> ? Shervin
>>>>>
>>>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>>>> wrote:
>>>>>
>>>>>> Is there a symbol that can represent the "Bunny hill" symbol used in
>>>>>> North America and some other American territories with mountains, to
>>>>>> designate the ski pistes open to novice skiers (those pistes are signaled
>>>>>> with green signs in Europe).
>>>>>>
>>>>>> I'm looking for the symbol itself, not the color, or the form of the
>>>>>> sign.
>>>>>>
>>>>>> For example blue pistes in Europe are designed with a green circle in
>>>>>> America, but we have a symbol for the circle; red pistes in Europe are
>>>>>> signaled by a blue square in America, but we have a symbol for the square;
>>>>>> black pistes in Europe are signaled by a black diamond in America, but we
>>>>>> also have such "black" diamond in Unicode.
>>>>>>
>>>>>> But I can't find an equivalent to the American "Bunny hill" signal,
>>>>>> equivalent to green pistes in Europe (this is a problem for webpages
>>>>>> related to skiing: do we have to embed an image ?).
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/716d512b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u1f407.png
Type: image/png
Size: 1902 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/716d512b/attachment.png>

From doug at ewellic.org  Thu May 28 15:44:22 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 13:44:22 -0700
Subject: Arrow dingbats
Message-ID: <20150528134422.665a7a7059d7ee80bb4d670165c8327d.cf04b7950e.wbe@email03.secureserver.net>

Andrew West <andrewcwest at gmail dot com> wrote:

> I don't know why the character was added in 7.0, but it may have been
> prompted by the same question as yours that was asked on this list in
> 2013 <http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0078.html>.

And the answer, from Michel Suignard in
http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0079.html :

> Rejoice!
> Added in 2B95 in Unicode 7.0
>
> (was added when the Wingdings set was added with Amendment 1 of
> 10646:2012, part of the target set for 7.0)

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Thu May 28 15:59:03 2015
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 28 May 2015 13:59:03 -0700
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
Message-ID: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>

http://www.signsofthemountains.com/what-do-the-symbols-on-ski-trail-signs-mean-d/
http://news.outdoortechnology.com/2015/02/04/ski-slope-rating-symbols-mean-really-mean/

Looks like a green circle is the symbol for a beginner slope. (The first
link also shows that "piste" is the European word for what we call a
trail, run, or slope). There is no difference between a "bunny slope"
and a "beginner" or "novice" slope.

Unicode has some suitable filled circles (particularly U+2B24 and
U+25CF), and it has a green apple, heart, and book, but as yet no green
circle.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Thu May 28 16:00:29 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 23:00:29 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAJ6uix5GJExNkSbV03-KAzmqjJTxBxFKZvrsgyM28hRp8z=wjg@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CA+ONODmrdDC5Aqm17VQGb2ztMSgNz=h4nMQHd7uAriGxT-sbYw@mail.gmail.com>
 <CAGa7JC2d6XxEpw65qTqwQKaU7v=PNBrX0Nu-TScedePBHQ4zNA@mail.gmail.com>
 <CA+ONOD=9wYqeMa44AyAve93QLqLtHSLkdwF5_HL4xrMsnWDLFQ@mail.gmail.com>
 <CAGa7JC3Bdj_+n78H0-Wd8UuikBG098ecyzLMPLqVok3-=cVd+g@mail.gmail.com>
 <CA+ONODkHi4Kxo=Ft1C9UXtYnuajw0kxX8wV-uJ0X0boyGFqx7Q@mail.gmail.com>
 <CAJ6uix5GJExNkSbV03-KAzmqjJTxBxFKZvrsgyM28hRp8z=wjg@mail.gmail.com>
Message-ID: <CAGa7JC1cGD=TW0YdbGtG=oSOyxUh9yYMnKd9826H_3R=vFTn6Q@mail.gmail.com>

What you'd like is in act similar to the zero-width joiner, between two
combining sequences, to make them overlap. A sort of "negative-width"
joiner that we could call "overlay joiner".
So  '!' + OVERLAY JOINER + '?' = '?'.

But in legacy charsets, this role was encoded as a BACKSPACE control (it
was used to produce combining accents as well, by combining a letter and a
*spacing* accent), and I think it is still a solution for the same problem
without needing a new character.
So   '!' + BACKSPACE + '?' = '?'.


2015-05-28 22:33 GMT+02:00 Leonardo Boiko <leoboiko at namakajiri.net>:

> Serious question: Has someone discussed a generic combining mechanism? I
> mean, characters with an effect like "combine the last two".  Say, '!' +
> '?' + COMBINING OVERLAY = '?'.  '!' + '!' + COMBINING SIDE BY SIDE = '?',
> and so on.  Similar in spirit to the Ideographic Description Characters,
> but meant to actually tell the rendering system to combine stuff.
>
> 2015-05-28 17:25 GMT-03:00 Shervin Afshar <shervinafshar at gmail.com>:
>
> Makes sense. But it doesn't seem like we need any new symbols. I think one
>> of these should do for hard and extra-hard slopes:
>>
>>
>> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3Aname%3D%2FDIAMOND%2F%3A%5D&g=
>>
>> Also, I'm not at all against making use of the actual [image: ??]we
>> have. I will not hold my breath for a combining rabbit symbol though.
>>
>> ? Shervin
>>
>> On Thu, May 28, 2015 at 1:16 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> I saif it: there's no symbol in Europe for pistes, just colors. The
>>> American "Bunny hill" maps to "green" pistes in Europe.
>>> (the European piste colors are used also for drawing their ways on maps,
>>> not just found in signages).
>>> Piste signs are typically all the same shape in the same station (most
>>> often discs) and the text on it (if present) shows the name or number of
>>> the piste in the station, or just an arrow showing the direction to follow.
>>>
>>> 2015-05-28 22:11 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>>
>>>> Well...to pick the nit, these shapes are rhombi; known colloquially as
>>>> "diamonds".
>>>>
>>>> So what's the symbol for "bunny hill" in Europe?
>>>>
>>>> ? Shervin
>>>>
>>>> On Thu, May 28, 2015 at 1:03 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>>> wrote:
>>>>
>>>>> Well also these symbols, if you want (these are not really
>>>>> "diamonds"), but the wordpress page forgets the "bunny hill". It starts
>>>>> only with the green circle (in fact a black disc colored in green) which
>>>>> maps to blue pistes in Europe.
>>>>>
>>>>> 2015-05-28 21:59 GMT+02:00 Shervin Afshar <shervinafshar at gmail.com>:
>>>>>
>>>>>> Single and double diamond?
>>>>>>
>>>>>> https://bbliss176.files.wordpress.com/2011/02/symbols2_jpg.jpg
>>>>>>
>>>>>> http://1.bp.blogspot.com/_2Rc9ifOGLYg/TO5fF0XNTSI/AAAAAAAAIxE/RJPvVDD6gLM/s1600/caution-double-black-diamond.jpg
>>>>>>
>>>>>> http://thumbs.dreamstime.com/z/double-black-diamond-sign-legend-ski-slopes-map-40955860.jpg
>>>>>>
>>>>>>
>>>>>> ? Shervin
>>>>>>
>>>>>> On Thu, May 28, 2015 at 12:46 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there a symbol that can represent the "Bunny hill" symbol used in
>>>>>>> North America and some other American territories with mountains, to
>>>>>>> designate the ski pistes open to novice skiers (those pistes are signaled
>>>>>>> with green signs in Europe).
>>>>>>>
>>>>>>> I'm looking for the symbol itself, not the color, or the form of the
>>>>>>> sign.
>>>>>>>
>>>>>>> For example blue pistes in Europe are designed with a green circle
>>>>>>> in America, but we have a symbol for the circle; red pistes in Europe are
>>>>>>> signaled by a blue square in America, but we have a symbol for the square;
>>>>>>> black pistes in Europe are signaled by a black diamond in America, but we
>>>>>>> also have such "black" diamond in Unicode.
>>>>>>>
>>>>>>> But I can't find an equivalent to the American "Bunny hill" signal,
>>>>>>> equivalent to green pistes in Europe (this is a problem for webpages
>>>>>>> related to skiing: do we have to embed an image ?).
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/1e0203f1/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emoji_u1f407.png
Type: image/png
Size: 1902 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/1e0203f1/attachment.png>

From billposer2 at gmail.com  Thu May 28 16:01:55 2015
From: billposer2 at gmail.com (Bill Poser)
Date: Thu, 28 May 2015 14:01:55 -0700
Subject: "Unicode of Death"
In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
Message-ID: <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>

No doubt the evil Unicode Consortium is in league with the Trilateral
Commission, the Elders of Zion,and the folks at NASA who faked the moon
landing.... :)

On Thu, May 28, 2015 at 7:53 AM, Doug Ewell <doug at ewellic.org> wrote:

> Unicode is in the news today as some folks with waaay too much time on
> their hands have discovered a string consisting of Latin, Arabic,
> Devanagari, and CJK characters that crashes Apple devices when it
> appears as a pop-up message.
>
> Although most people seem to identify it correctly as a CoreText bug,
> there are a handful, as you might expect, who attribute it to some shady
> weirdness in Unicode itself. My favorite quote from a Reddit user was
> this:
>
> "Every character you use has a unicode value which tells your phone what
> to display. One of the unicode values is actually never-ending and so
> when the phone tries to read it it goes into an infinite loop which
> crashes it."
>
> I've read TUS Chapter 4 and UTR #23 and I still can't find the
> "never-ending" Unicode property.
>
> Perhaps astonishingly to some, the string displays fine on all my
> Windows devices. Not all apps get the directionality right, but no
> crashes.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/988a2ece/attachment.html>

From Shawn.Steele at microsoft.com  Thu May 28 16:07:11 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 28 May 2015 21:07:11 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <55677762.3060805@oracle.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
Message-ID: <BLUPR03MB137803CA2943AEAC1118AF4F82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>

I?m wondering if it?s a regional thing, I haven?t seen it, at least in the mostly-west of North America.  An east coast thing?

From: Jim Melton [mailto:jim.melton at oracle.com]
Sent: Thursday, May 28, 2015 1:16 PM
To: Shawn Steele
Cc: verdy_p at wanadoo.fr; unicode Unicode Discussion
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States.  I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner".  That's anecdotal evidence at best, but my observations cover numerous skiing sites.  I have encountered such a symbol in Europe and in New Zealand, but not in the USA.  (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I?m drawing a blank on a specific bunny sign, in my experience those are usually just green.

Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode?

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).


--

========================================================================

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144

  Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345

Oracle Corporation        Oracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com

========================================================================

=  Facts are facts.   But any opinions expressed are the opinions      =

=  only of myself and may or may not reflect the opinions of anybody   =

=  else with whom I may or may not have discussed the issues at hand.  =

========================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/443974b9/attachment.html>

From idou747 at gmail.com  Thu May 28 16:08:24 2015
From: idou747 at gmail.com (Chris)
Date: Fri, 29 May 2015 07:08:24 +1000
Subject: Arrow dingbats
In-Reply-To: <CALgEMhyNTjpC0CeQB_wPciiD6ZC8OvYqH1YdExv98OZtPa3V_w@mail.gmail.com>
References: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>
 <CALgEMhyNTjpC0CeQB_wPciiD6ZC8OvYqH1YdExv98OZtPa3V_w@mail.gmail.com>
Message-ID: <EDE54B3D-2D55-4020-A891-6BC81FCCB2A2@gmail.com>


So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs.  Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?)

Except that nobody seems to have U+2B95 aligned either. On unicode-table.com it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align?


> On 29 May 2015, at 5:13 am, Andrew West <andrewcwest at gmail.com> wrote:
> 
> On 28 May 2015 at 05:48, Chris <idou747 at gmail.com> wrote:
>> 
>> Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
>> conspicuously missing is the right arrow
>> 
>> But everywhere I can see that has this arrow, it looks a lot different to
>> the other arrows with a narrower body and head.
>> 
>> Whose fault is this,
> 
> The three left/up/downwards black arrows were added at the request of
> North Korea, so I guess you can blame Kim Jong-Il for the missing
> rightwards arrow ... perhaps the North Korean army never went to the
> right.
> 
>> and who will fix it?
> 
> It was fixed in Unicode 7.0 last year with the addition of U+2B95
> RIGHTWARDS BLACK ARROW.  Of course, it may not be fixed for you and
> other users unless you have a font installed that supports all the
> arrows in a consistent style.
> 
> I don't know why the character was added in 7.0, but it may have been
> prompted by the same question as yours that was asked on this list in
> 2013 <http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0078.html>.
> 
> Andrew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/5b725c1f/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 16:11:49 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 23:11:49 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <55677762.3060805@oracle.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
Message-ID: <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>

Some documentations also suggest that the two diamonds are not stacked one
above the other, but horizontally. It's a good point for using only one
symbol, encoding it twice in plain-text if needed.

2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:

>  I no longer ski, but I did so for many years, mostly (but not
> exclusively) in the western United States.  I never encountered, at any USA
> ski hill/mountain/resort, a special symbol for "bunny hills", which are
> typically represented by the green circle meaning "beginner".  That's
> anecdotal evidence at best, but my observations cover numerous skiing
> sites.  I have encountered such a symbol in Europe and in New Zealand, but
> not in the USA.  (I have not had the pleasure of skiing in Canada and am
> thus unable to speak about ski areas in that country.)
>
> The double black diamond would appear to be a unique symbol worthy of
> encoding, simply because the only valid typographical representation (in
> the USA) is two single black diamonds stacked one above the other and
> touching at the points.
>
> Hope this helps,
>    Jim
>
> On 5/28/2015 2:04 PM, Shawn Steele wrote:
>
>  So is double black diamond a separate symbol?  Or just two of the black
> diamond?
>
>
>
> And Blue-Black?
>
>
>
> I?m drawing a blank on a specific bunny sign, in my experience those are
> usually just green.
>
>
>
> Aren?t there a lot of cartography symbols for various systems that aren?t
> present in Unicode?
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org
> <unicode-bounces at unicode.org>] *On Behalf Of *Philippe Verdy
> *Sent:* Thursday, May 28, 2015 12:47 PM
> *To:* unicode Unicode Discussion
> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes
> for novices
>
>
>
> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
>
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
>
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
>
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
>
>
> --
> ========================================================================
> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
> ========================================================================
> =  Facts are facts.   But any opinions expressed are the opinions      =
> =  only of myself and may or may not reflect the opinions of anybody   =
> =  else with whom I may or may not have discussed the issues at hand.  =
> ========================================================================
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/37f1c8d5/attachment.html>

From Shawn.Steele at microsoft.com  Thu May 28 16:15:13 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 28 May 2015 21:15:13 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
Message-ID: <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>

I?m used to them being next to each other.  So the entire discussion seems to be about how to encode a concept vs how to get the shape you want with existing code points.   If you just want the perfect shape, then maybe an svg is a better choice.  If we?re talking about describing ski-run difficulty levels in plain-text, then the hodge-podge of glyphs being offered in this thread seems kinda hacky to me.

-Shawn

From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:12 PM
To: Jim Melton
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. It's a good point for using only one symbol, encoding it twice in plain-text if needed.

2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com<mailto:jim.melton at oracle.com>>:
I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States.  I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner".  That's anecdotal evidence at best, but my observations cover numerous skiing sites.  I have encountered such a symbol in Europe and in New Zealand, but not in the USA.  (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I?m drawing a blank on a specific bunny sign, in my experience those are usually just green.

Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode?

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).


--

========================================================================

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144

  Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345

Oracle Corporation        Oracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com

========================================================================

=  Facts are facts.   But any opinions expressed are the opinions      =

=  only of myself and may or may not reflect the opinions of anybody   =

=  else with whom I may or may not have discussed the issues at hand.  =

========================================================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/a973910a/attachment-0001.html>

From verdy_p at wanadoo.fr  Thu May 28 16:16:35 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 23:16:35 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC1GXtBqtjzKrn7qex_sU4hmTgxz0medaD52g6hVR+558Q@mail.gmail.com>

The "green" physical color does not need encoding. A black disc is enough,
just like the black square and the black diamond/romb (the rest is styling).

There's also the orange oval (horizontal) used for free-rides in America
(in Europe, not symbol but the yellow color, used for some authorized
"free-ride" pistes in Switzerland; in France, free-ride is severely
reglemented but there's no signage used as these are not open for the
general public, as they are too risky and such signs could bring too many
skiers to dangereous areas without proper training and equipement).

2015-05-28 22:59 GMT+02:00 Doug Ewell <doug at ewellic.org>:

>
> http://www.signsofthemountains.com/what-do-the-symbols-on-ski-trail-signs-mean-d/
>
> http://news.outdoortechnology.com/2015/02/04/ski-slope-rating-symbols-mean-really-mean/
>
> Looks like a green circle is the symbol for a beginner slope. (The first
> link also shows that "piste" is the European word for what we call a
> trail, run, or slope). There is no difference between a "bunny slope"
> and a "beginner" or "novice" slope.
>
> Unicode has some suitable filled circles (particularly U+2B24 and
> U+25CF), and it has a green apple, heart, and book, but as yet no green
> circle.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/b6dd3111/attachment.html>

From shervinafshar at gmail.com  Thu May 28 16:20:17 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 14:20:17 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>

Since the double-diamond has map and map legend usage, it might be a good
idea to have it encoded separately. I know that I'm stating the obvious
here, but the important point is doing the research and showing that it has
widespread usage.

? Shervin

On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:

>  I?m used to them being next to each other.  So the entire discussion
> seems to be about how to encode a concept vs how to get the shape you want
> with existing code points.   If you just want the perfect shape, then maybe
> an svg is a better choice.  If we?re talking about describing ski-run
> difficulty levels in plain-text, then the hodge-podge of glyphs being
> offered in this thread seems kinda hacky to me.
>
>
>
> -Shawn
>
>
>
> *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe
> Verdy
> *Sent:* Thursday, May 28, 2015 2:12 PM
> *To:* Jim Melton
> *Cc:* Shawn Steele; unicode Unicode Discussion
> *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski
> pistes for novices
>
>
>
> Some documentations also suggest that the two diamonds are not stacked one
> above the other, but horizontally. It's a good point for using only one
> symbol, encoding it twice in plain-text if needed.
>
>
>
> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
>
>  I no longer ski, but I did so for many years, mostly (but not
> exclusively) in the western United States.  I never encountered, at any USA
> ski hill/mountain/resort, a special symbol for "bunny hills", which are
> typically represented by the green circle meaning "beginner".  That's
> anecdotal evidence at best, but my observations cover numerous skiing
> sites.  I have encountered such a symbol in Europe and in New Zealand, but
> not in the USA.  (I have not had the pleasure of skiing in Canada and am
> thus unable to speak about ski areas in that country.)
>
> The double black diamond would appear to be a unique symbol worthy of
> encoding, simply because the only valid typographical representation (in
> the USA) is two single black diamonds stacked one above the other and
> touching at the points.
>
> Hope this helps,
>    Jim
>
>
> On 5/28/2015 2:04 PM, Shawn Steele wrote:
>
>  So is double black diamond a separate symbol?  Or just two of the black
> diamond?
>
>
>
> And Blue-Black?
>
>
>
> I?m drawing a blank on a specific bunny sign, in my experience those are
> usually just green.
>
>
>
> Aren?t there a lot of cartography symbols for various systems that aren?t
> present in Unicode?
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org
> <unicode-bounces at unicode.org>] *On Behalf Of *Philippe Verdy
> *Sent:* Thursday, May 28, 2015 12:47 PM
> *To:* unicode Unicode Discussion
> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes
> for novices
>
>
>
> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
>
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
>
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
>
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
>
>
>
> --
>
> ========================================================================
>
> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
>
>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
>
> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
>
> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
>
> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
>
> ========================================================================
>
> =  Facts are facts.   But any opinions expressed are the opinions      =
>
> =  only of myself and may or may not reflect the opinions of anybody   =
>
> =  else with whom I may or may not have discussed the issues at hand.  =
>
> ========================================================================
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/0a64b817/attachment.html>

From shervinafshar at gmail.com  Thu May 28 16:20:17 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 14:20:17 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>

Since the double-diamond has map and map legend usage, it might be a good
idea to have it encoded separately. I know that I'm stating the obvious
here, but the important point is doing the research and showing that it has
widespread usage.

? Shervin

On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <Shawn.Steele at microsoft.com>
wrote:

>  I?m used to them being next to each other.  So the entire discussion
> seems to be about how to encode a concept vs how to get the shape you want
> with existing code points.   If you just want the perfect shape, then maybe
> an svg is a better choice.  If we?re talking about describing ski-run
> difficulty levels in plain-text, then the hodge-podge of glyphs being
> offered in this thread seems kinda hacky to me.
>
>
>
> -Shawn
>
>
>
> *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe
> Verdy
> *Sent:* Thursday, May 28, 2015 2:12 PM
> *To:* Jim Melton
> *Cc:* Shawn Steele; unicode Unicode Discussion
> *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski
> pistes for novices
>
>
>
> Some documentations also suggest that the two diamonds are not stacked one
> above the other, but horizontally. It's a good point for using only one
> symbol, encoding it twice in plain-text if needed.
>
>
>
> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
>
>  I no longer ski, but I did so for many years, mostly (but not
> exclusively) in the western United States.  I never encountered, at any USA
> ski hill/mountain/resort, a special symbol for "bunny hills", which are
> typically represented by the green circle meaning "beginner".  That's
> anecdotal evidence at best, but my observations cover numerous skiing
> sites.  I have encountered such a symbol in Europe and in New Zealand, but
> not in the USA.  (I have not had the pleasure of skiing in Canada and am
> thus unable to speak about ski areas in that country.)
>
> The double black diamond would appear to be a unique symbol worthy of
> encoding, simply because the only valid typographical representation (in
> the USA) is two single black diamonds stacked one above the other and
> touching at the points.
>
> Hope this helps,
>    Jim
>
>
> On 5/28/2015 2:04 PM, Shawn Steele wrote:
>
>  So is double black diamond a separate symbol?  Or just two of the black
> diamond?
>
>
>
> And Blue-Black?
>
>
>
> I?m drawing a blank on a specific bunny sign, in my experience those are
> usually just green.
>
>
>
> Aren?t there a lot of cartography symbols for various systems that aren?t
> present in Unicode?
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org
> <unicode-bounces at unicode.org>] *On Behalf Of *Philippe Verdy
> *Sent:* Thursday, May 28, 2015 12:47 PM
> *To:* unicode Unicode Discussion
> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes
> for novices
>
>
>
> Is there a symbol that can represent the "Bunny hill" symbol used in North
> America and some other American territories with mountains, to designate
> the ski pistes open to novice skiers (those pistes are signaled with green
> signs in Europe).
>
>
>
> I'm looking for the symbol itself, not the color, or the form of the sign.
>
>
>
> For example blue pistes in Europe are designed with a green circle in
> America, but we have a symbol for the circle; red pistes in Europe are
> signaled by a blue square in America, but we have a symbol for the square;
> black pistes in Europe are signaled by a black diamond in America, but we
> also have such "black" diamond in Unicode.
>
>
>
> But I can't find an equivalent to the American "Bunny hill" signal,
> equivalent to green pistes in Europe (this is a problem for webpages
> related to skiing: do we have to embed an image ?).
>
>
>
>
>
> --
>
> ========================================================================
>
> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
>
>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
>
> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
>
> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
>
> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
>
> ========================================================================
>
> =  Facts are facts.   But any opinions expressed are the opinions      =
>
> =  only of myself and may or may not reflect the opinions of anybody   =
>
> =  else with whom I may or may not have discussed the issues at hand.  =
>
> ========================================================================
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/0a64b817/attachment-0001.html>

From verdy_p at wanadoo.fr  Thu May 28 16:26:00 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 28 May 2015 23:26:00 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
Message-ID: <CAGa7JC2v7EQMksc9m0r5ivLFTTZcaMNu0PnDDL6t_rkOci61UQ@mail.gmail.com>

2015-05-28 22:59 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Looks like a green circle is the symbol for a beginner slope. (The first
> link also shows that "piste" is the European word for what we call a
> trail, run, or slope). There is no difference between a "bunny slope"
> and a "beginner" or "novice" slope.
>

The difference is obvious in Europe where the "novice" difficulty is marked
as green pistes (slopes are below 30% or almost flat), and the
"beginner/moderate" difficulty is marked as blue pistes (slopes about
30-35%).

Even America must have this "novice" difficulty, with areas mostly used by
young children (with their parents not skiing but following them by foot,
and a restriction of speeds); these areas are protected so that other
skiers will not pass through them. In fact if you remain on these novice
areas you cannot reach any speed that could cause dangerous shocks: you
have to "push" to advance, otherwise you'll slow down naturally and stop on
the snow.

These areas can be used by walkers, and randonners using "raquettes".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/b385f0d1/attachment.html>

From lang.support at gmail.com  Thu May 28 16:36:32 2015
From: lang.support at gmail.com (Andrew Cunningham)
Date: Fri, 29 May 2015 07:36:32 +1000
Subject: "Unicode of Death"
In-Reply-To: <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
 <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
Message-ID: <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>

Not the first time unicode crashes things. There was the google chrome bug
on osx that crashed the tab for any syriac text.

A.

On Friday, 29 May 2015, Bill Poser <billposer2 at gmail.com> wrote:
> No doubt the evil Unicode Consortium is in league with the Trilateral
Commission, the Elders of Zion,and the folks at NASA who faked the moon
landing.... :)
>
> On Thu, May 28, 2015 at 7:53 AM, Doug Ewell <doug at ewellic.org> wrote:
>>
>> Unicode is in the news today as some folks with waaay too much time on
>> their hands have discovered a string consisting of Latin, Arabic,
>> Devanagari, and CJK characters that crashes Apple devices when it
>> appears as a pop-up message.
>>
>> Although most people seem to identify it correctly as a CoreText bug,
>> there are a handful, as you might expect, who attribute it to some shady
>> weirdness in Unicode itself. My favorite quote from a Reddit user was
>> this:
>>
>> "Every character you use has a unicode value which tells your phone what
>> to display. One of the unicode values is actually never-ending and so
>> when the phone tries to read it it goes into an infinite loop which
>> crashes it."
>>
>> I've read TUS Chapter 4 and UTR #23 and I still can't find the
>> "never-ending" Unicode property.
>>
>> Perhaps astonishingly to some, the string displays fine on all my
>> Windows devices. Not all apps get the directionality right, but no
>> crashes.
>>
>> --
>> Doug Ewell | http://ewellic.org | Thornton, CO ????
>>
>
>

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham at slv.vic.gov.au
          lang.support at gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/9e468f07/attachment.html>

From Shawn.Steele at microsoft.com  Thu May 28 16:44:58 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 28 May 2015 21:44:58 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <CAGa7JC2v7EQMksc9m0r5ivLFTTZcaMNu0PnDDL6t_rkOci61UQ@mail.gmail.com>
References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
 <CAGa7JC2v7EQMksc9m0r5ivLFTTZcaMNu0PnDDL6t_rkOci61UQ@mail.gmail.com>
Message-ID: <BLUPR03MB1378B7B8D57EF26A818A40D382CA0@BLUPR03MB1378.namprd03.prod.outlook.com>

Typically we have ?slow? zones with include both ?novice? areas and congested areas.  Additionally the ?novice? part of a slope often has a rope fence delineating it from the rest of the slow.  However on the maps, etc, its usually just off to the side of a green run and doesn?t have a special symbol.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:26 PM
To: Doug Ewell
Cc: Unicode Mailing List
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

2015-05-28 22:59 GMT+02:00 Doug Ewell <doug at ewellic.org<mailto:doug at ewellic.org>>:
Looks like a green circle is the symbol for a beginner slope. (The first
link also shows that "piste" is the European word for what we call a
trail, run, or slope). There is no difference between a "bunny slope"
and a "beginner" or "novice" slope.

The difference is obvious in Europe where the "novice" difficulty is marked as green pistes (slopes are below 30% or almost flat), and the "beginner/moderate" difficulty is marked as blue pistes (slopes about 30-35%).

Even America must have this "novice" difficulty, with areas mostly used by young children (with their parents not skiing but following them by foot, and a restriction of speeds); these areas are protected so that other skiers will not pass through them. In fact if you remain on these novice areas you cannot reach any speed that could cause dangerous shocks: you have to "push" to advance, otherwise you'll slow down naturally and stop on the snow.

These areas can be used by walkers, and randonners using "raquettes".


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/6d9c4797/attachment.html>

From leob at mailcom.com  Thu May 28 16:56:39 2015
From: leob at mailcom.com (Leo Broukhis)
Date: Thu, 28 May 2015 14:56:39 -0700
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
Message-ID: <CAFmvRseeVC6-q-AHSpJ4Sjhr4NnrCka6mdNUH8Wyvh4J3DkeyQ@mail.gmail.com>

Being used in maps and map legends is not a sufficient condition for
encoding a symbol. If it were, all symbols used in physical maps would
have been encoded, including each and every mineral and rare metal.


Leo

On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar <shervinafshar at gmail.com> wrote:
> Since the double-diamond has map and map legend usage, it might be a good
> idea to have it encoded separately. I know that I'm stating the obvious
> here, but the important point is doing the research and showing that it has
> widespread usage.
>
> ? Shervin
>
> On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <Shawn.Steele at microsoft.com>
> wrote:
>>
>> I?m used to them being next to each other.  So the entire discussion seems
>> to be about how to encode a concept vs how to get the shape you want with
>> existing code points.   If you just want the perfect shape, then maybe an
>> svg is a better choice.  If we?re talking about describing ski-run
>> difficulty levels in plain-text, then the hodge-podge of glyphs being
>> offered in this thread seems kinda hacky to me.
>>
>>
>>
>> -Shawn
>>
>>
>>
>> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe
>> Verdy
>> Sent: Thursday, May 28, 2015 2:12 PM
>> To: Jim Melton
>> Cc: Shawn Steele; unicode Unicode Discussion
>> Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes
>> for novices
>>
>>
>>
>> Some documentations also suggest that the two diamonds are not stacked one
>> above the other, but horizontally. It's a good point for using only one
>> symbol, encoding it twice in plain-text if needed.
>>
>>
>>
>> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
>>
>> I no longer ski, but I did so for many years, mostly (but not exclusively)
>> in the western United States.  I never encountered, at any USA ski
>> hill/mountain/resort, a special symbol for "bunny hills", which are
>> typically represented by the green circle meaning "beginner".  That's
>> anecdotal evidence at best, but my observations cover numerous skiing sites.
>> I have encountered such a symbol in Europe and in New Zealand, but not in
>> the USA.  (I have not had the pleasure of skiing in Canada and am thus
>> unable to speak about ski areas in that country.)
>>
>> The double black diamond would appear to be a unique symbol worthy of
>> encoding, simply because the only valid typographical representation (in the
>> USA) is two single black diamonds stacked one above the other and touching
>> at the points.
>>
>> Hope this helps,
>>    Jim
>>
>>
>> On 5/28/2015 2:04 PM, Shawn Steele wrote:
>>
>> So is double black diamond a separate symbol?  Or just two of the black
>> diamond?
>>
>>
>>
>> And Blue-Black?
>>
>>
>>
>> I?m drawing a blank on a specific bunny sign, in my experience those are
>> usually just green.
>>
>>
>>
>> Aren?t there a lot of cartography symbols for various systems that aren?t
>> present in Unicode?
>>
>>
>>
>> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe
>> Verdy
>> Sent: Thursday, May 28, 2015 12:47 PM
>> To: unicode Unicode Discussion
>> Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
>> novices
>>
>>
>>
>> Is there a symbol that can represent the "Bunny hill" symbol used in North
>> America and some other American territories with mountains, to designate the
>> ski pistes open to novice skiers (those pistes are signaled with green signs
>> in Europe).
>>
>>
>>
>> I'm looking for the symbol itself, not the color, or the form of the sign.
>>
>>
>>
>> For example blue pistes in Europe are designed with a green circle in
>> America, but we have a symbol for the circle; red pistes in Europe are
>> signaled by a blue square in America, but we have a symbol for the square;
>> black pistes in Europe are signaled by a black diamond in America, but we
>> also have such "black" diamond in Unicode.
>>
>>
>>
>> But I can't find an equivalent to the American "Bunny hill" signal,
>> equivalent to green pistes in Europe (this is a problem for webpages related
>> to skiing: do we have to embed an image ?).
>>
>>
>>
>>
>>
>> --
>>
>> ========================================================================
>>
>> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
>>
>>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
>>
>> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
>>
>> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
>>
>> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
>>
>> ========================================================================
>>
>> =  Facts are facts.   But any opinions expressed are the opinions      =
>>
>> =  only of myself and may or may not reflect the opinions of anybody   =
>>
>> =  else with whom I may or may not have discussed the issues at hand.  =
>>
>> ========================================================================
>>
>>
>
>


From verdy_p at wanadoo.fr  Thu May 28 17:00:32 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 29 May 2015 00:00:32 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <BLUPR03MB1378B7B8D57EF26A818A40D382CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
 <CAGa7JC2v7EQMksc9m0r5ivLFTTZcaMNu0PnDDL6t_rkOci61UQ@mail.gmail.com>
 <BLUPR03MB1378B7B8D57EF26A818A40D382CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <CAGa7JC0-aTqZvwk2Dod80nqOmV6HugPqO8EhD9wmFbv-f8eZkA@mail.gmail.com>

The rope (or other barriers) are also present in Europe, but they are
considered true "pistes" by themselves, even if they are relatively short.
In frequent cases they are connected upward to a blue piste (not for
novices) but there are "slow down" warnings displayed on them and the
regulation requires taking care of every skier that could be in front of
you.

Various tools are used to force skiers to slow down, including forcing them
to slalom between barriers, or including flat sections or sections going
upward, and adding a large rest area around this interconnection.

The European green pistes for novices are also relatively well separated
from blue pistes (used by all other skiers and interconnected with mor
difficult ones: red and black): if there's a blue piste, it will most often
be parallel and separated physically by barriers, this limits the number of
intersections or the need for interconnections (the only intersection is
then at the station itself, in a crowded area near the equipments to bring
skiers to the upper part of the piste).

But my initial question was about the symbol that I have seen (partly)
documented without an actual image for ski stations in US. May be the
"bunny hills" symbol is specific to a station, not used elsewhere, or there
are other similar symbols used locally. I wonder if this is not simply the
symbol/logo of a local ski school...

2015-05-28 23:44 GMT+02:00 Shawn Steele <Shawn.Steele at microsoft.com>:

>  Typically we have ?slow? zones with include both ?novice? areas and
> congested areas.  Additionally the ?novice? part of a slope often has a
> rope fence delineating it from the rest of the slow.  However on the maps,
> etc, its usually just off to the side of a green run and doesn?t have a
> special symbol.
>
>
>
> *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Philippe
> Verdy
> *Sent:* Thursday, May 28, 2015 2:26 PM
> *To:* Doug Ewell
> *Cc:* Unicode Mailing List
> *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski
> pistes for novices
>
>
>
> 2015-05-28 22:59 GMT+02:00 Doug Ewell <doug at ewellic.org>:
>
> Looks like a green circle is the symbol for a beginner slope. (The first
> link also shows that "piste" is the European word for what we call a
> trail, run, or slope). There is no difference between a "bunny slope"
> and a "beginner" or "novice" slope.
>
>
>
> The difference is obvious in Europe where the "novice" difficulty is
> marked as green pistes (slopes are below 30% or almost flat), and the
> "beginner/moderate" difficulty is marked as blue pistes (slopes about
> 30-35%).
>
>
>
> Even America must have this "novice" difficulty, with areas mostly used by
> young children (with their parents not skiing but following them by foot,
> and a restriction of speeds); these areas are protected so that other
> skiers will not pass through them. In fact if you remain on these novice
> areas you cannot reach any speed that could cause dangerous shocks: you
> have to "push" to advance, otherwise you'll slow down naturally and stop on
> the snow.
>
>
>
> These areas can be used by walkers, and randonners using "raquettes".
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/5fcacbb0/attachment.html>

From Shawn.Steele at microsoft.com  Thu May 28 17:06:59 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Thu, 28 May 2015 22:06:59 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <CAGa7JC0-aTqZvwk2Dod80nqOmV6HugPqO8EhD9wmFbv-f8eZkA@mail.gmail.com>
References: <20150528135903.665a7a7059d7ee80bb4d670165c8327d.67f49d0e57.wbe@email03.secureserver.net>
 <CAGa7JC2v7EQMksc9m0r5ivLFTTZcaMNu0PnDDL6t_rkOci61UQ@mail.gmail.com>
 <BLUPR03MB1378B7B8D57EF26A818A40D382CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <CAGa7JC0-aTqZvwk2Dod80nqOmV6HugPqO8EhD9wmFbv-f8eZkA@mail.gmail.com>
Message-ID: <BLUPR03MB13782D9A107DBF97EC77BA8282CA0@BLUPR03MB1378.namprd03.prod.outlook.com>

What is the image?, curiosity killed the bunny ?  I expect that it?s limited to a single ski area or maybe region.

From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 3:01 PM
To: Shawn Steele
Cc: Doug Ewell; Unicode Mailing List
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

The rope (or other barriers) are also present in Europe, but they are considered true "pistes" by themselves, even if they are relatively short. In frequent cases they are connected upward to a blue piste (not for novices) but there are "slow down" warnings displayed on them and the regulation requires taking care of every skier that could be in front of you.

Various tools are used to force skiers to slow down, including forcing them to slalom between barriers, or including flat sections or sections going upward, and adding a large rest area around this interconnection.

The European green pistes for novices are also relatively well separated from blue pistes (used by all other skiers and interconnected with mor difficult ones: red and black): if there's a blue piste, it will most often be parallel and separated physically by barriers, this limits the number of intersections or the need for interconnections (the only intersection is then at the station itself, in a crowded area near the equipments to bring skiers to the upper part of the piste).

But my initial question was about the symbol that I have seen (partly) documented without an actual image for ski stations in US. May be the "bunny hills" symbol is specific to a station, not used elsewhere, or there are other similar symbols used locally. I wonder if this is not simply the symbol/logo of a local ski school...

2015-05-28 23:44 GMT+02:00 Shawn Steele <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>>:
Typically we have ?slow? zones with include both ?novice? areas and congested areas.  Additionally the ?novice? part of a slope often has a rope fence delineating it from the rest of the slow.  However on the maps, etc, its usually just off to the side of a green run and doesn?t have a special symbol.

From: Unicode [mailto:unicode-bounces at unicode.org<mailto:unicode-bounces at unicode.org>] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:26 PM
To: Doug Ewell
Cc: Unicode Mailing List
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

2015-05-28 22:59 GMT+02:00 Doug Ewell <doug at ewellic.org<mailto:doug at ewellic.org>>:
Looks like a green circle is the symbol for a beginner slope. (The first
link also shows that "piste" is the European word for what we call a
trail, run, or slope). There is no difference between a "bunny slope"
and a "beginner" or "novice" slope.

The difference is obvious in Europe where the "novice" difficulty is marked as green pistes (slopes are below 30% or almost flat), and the "beginner/moderate" difficulty is marked as blue pistes (slopes about 30-35%).

Even America must have this "novice" difficulty, with areas mostly used by young children (with their parents not skiing but following them by foot, and a restriction of speeds); these areas are protected so that other skiers will not pass through them. In fact if you remain on these novice areas you cannot reach any speed that could cause dangerous shocks: you have to "push" to advance, otherwise you'll slow down naturally and stop on the snow.

These areas can be used by walkers, and randonners using "raquettes".


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/bbf07310/attachment.html>

From verdy_p at wanadoo.fr  Thu May 28 17:07:13 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 29 May 2015 00:07:13 +0200
Subject: "Bunny hill" symbol,
 used in America for signaling ski pistes for novices
In-Reply-To: <CAFmvRseeVC6-q-AHSpJ4Sjhr4NnrCka6mdNUH8Wyvh4J3DkeyQ@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
 <CAFmvRseeVC6-q-AHSpJ4Sjhr4NnrCka6mdNUH8Wyvh4J3DkeyQ@mail.gmail.com>
Message-ID: <CAGa7JC2nnhNXBK5DtfqeLq9o=tjCWmOKb+ONWkD+z_x_anPOCQ@mail.gmail.com>

Not just maps, but documentations. Ski resorts deliver many documentations,
including those explaining security rules or promoting their equipement.
And they are used on signs (the pistes themselves are not colored, the snow
is still white !).

In fact maps are the least common use of these symbols (there are far less
maps available), and skiers don't have to follow a map when they practice
their sport, they follow the signs. You'll find a large map display only in
stations, and poor rough maps on documentations not showing many details
seen on the terrain (and constantly varying across the seasons or with the
weather conditions, so a map will not really help). But it's more important
to train people about the signalisation they'll encounter.

2015-05-28 23:56 GMT+02:00 Leo Broukhis <leob at mailcom.com>:

> Being used in maps and map legends is not a sufficient condition for
> encoding a symbol. If it were, all symbols used in physical maps would
> have been encoded, including each and every mineral and rare metal.
>
>
> Leo
>
> On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar <shervinafshar at gmail.com>
> wrote:
> > Since the double-diamond has map and map legend usage, it might be a good
> > idea to have it encoded separately. I know that I'm stating the obvious
> > here, but the important point is doing the research and showing that it
> has
> > widespread usage.
> >
> > ? Shervin
> >
> > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <
> Shawn.Steele at microsoft.com>
> > wrote:
> >>
> >> I?m used to them being next to each other.  So the entire discussion
> seems
> >> to be about how to encode a concept vs how to get the shape you want
> with
> >> existing code points.   If you just want the perfect shape, then maybe
> an
> >> svg is a better choice.  If we?re talking about describing ski-run
> >> difficulty levels in plain-text, then the hodge-podge of glyphs being
> >> offered in this thread seems kinda hacky to me.
> >>
> >>
> >>
> >> -Shawn
> >>
> >>
> >>
> >> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe
> >> Verdy
> >> Sent: Thursday, May 28, 2015 2:12 PM
> >> To: Jim Melton
> >> Cc: Shawn Steele; unicode Unicode Discussion
> >> Subject: Re: "Bunny hill" symbol, used in America for signaling ski
> pistes
> >> for novices
> >>
> >>
> >>
> >> Some documentations also suggest that the two diamonds are not stacked
> one
> >> above the other, but horizontally. It's a good point for using only one
> >> symbol, encoding it twice in plain-text if needed.
> >>
> >>
> >>
> >> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
> >>
> >> I no longer ski, but I did so for many years, mostly (but not
> exclusively)
> >> in the western United States.  I never encountered, at any USA ski
> >> hill/mountain/resort, a special symbol for "bunny hills", which are
> >> typically represented by the green circle meaning "beginner".  That's
> >> anecdotal evidence at best, but my observations cover numerous skiing
> sites.
> >> I have encountered such a symbol in Europe and in New Zealand, but not
> in
> >> the USA.  (I have not had the pleasure of skiing in Canada and am thus
> >> unable to speak about ski areas in that country.)
> >>
> >> The double black diamond would appear to be a unique symbol worthy of
> >> encoding, simply because the only valid typographical representation
> (in the
> >> USA) is two single black diamonds stacked one above the other and
> touching
> >> at the points.
> >>
> >> Hope this helps,
> >>    Jim
> >>
> >>
> >> On 5/28/2015 2:04 PM, Shawn Steele wrote:
> >>
> >> So is double black diamond a separate symbol?  Or just two of the black
> >> diamond?
> >>
> >>
> >>
> >> And Blue-Black?
> >>
> >>
> >>
> >> I?m drawing a blank on a specific bunny sign, in my experience those are
> >> usually just green.
> >>
> >>
> >>
> >> Aren?t there a lot of cartography symbols for various systems that
> aren?t
> >> present in Unicode?
> >>
> >>
> >>
> >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of
> Philippe
> >> Verdy
> >> Sent: Thursday, May 28, 2015 12:47 PM
> >> To: unicode Unicode Discussion
> >> Subject: "Bunny hill" symbol, used in America for signaling ski pistes
> for
> >> novices
> >>
> >>
> >>
> >> Is there a symbol that can represent the "Bunny hill" symbol used in
> North
> >> America and some other American territories with mountains, to
> designate the
> >> ski pistes open to novice skiers (those pistes are signaled with green
> signs
> >> in Europe).
> >>
> >>
> >>
> >> I'm looking for the symbol itself, not the color, or the form of the
> sign.
> >>
> >>
> >>
> >> For example blue pistes in Europe are designed with a green circle in
> >> America, but we have a symbol for the circle; red pistes in Europe are
> >> signaled by a blue square in America, but we have a symbol for the
> square;
> >> black pistes in Europe are signaled by a black diamond in America, but
> we
> >> also have such "black" diamond in Unicode.
> >>
> >>
> >>
> >> But I can't find an equivalent to the American "Bunny hill" signal,
> >> equivalent to green pistes in Europe (this is a problem for webpages
> related
> >> to skiing: do we have to embed an image ?).
> >>
> >>
> >>
> >>
> >>
> >> --
> >>
> >> ========================================================================
> >>
> >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
> >>
> >>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
> >>
> >> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
> >>
> >> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
> >>
> >> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
> >>
> >> ========================================================================
> >>
> >> =  Facts are facts.   But any opinions expressed are the opinions      =
> >>
> >> =  only of myself and may or may not reflect the opinions of anybody   =
> >>
> >> =  else with whom I may or may not have discussed the issues at hand.  =
> >>
> >> ========================================================================
> >>
> >>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/9e59e959/attachment.html>

From michel at suignard.com  Thu May 28 17:08:09 2015
From: michel at suignard.com (Michel Suignard)
Date: Thu, 28 May 2015 22:08:09 +0000
Subject: Arrow dingbats
In-Reply-To: <EDE54B3D-2D55-4020-A891-6BC81FCCB2A2@gmail.com>
References: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>
 <CALgEMhyNTjpC0CeQB_wPciiD6ZC8OvYqH1YdExv98OZtPa3V_w@mail.gmail.com>
 <EDE54B3D-2D55-4020-A891-6BC81FCCB2A2@gmail.com>
Message-ID: <BLUPR0201MB15241EBBB0511171D81EB8E3A2CA0@BLUPR0201MB1524.namprd02.prod.outlook.com>

So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs.  Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?)

Except that nobody seems to have U+2B95 aligned either. On unicode-table.com<http://unicode-table.com> it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align?


Wingdings added way more arrows, check the 1F800-1F8FF Supplemental Arrows-C. In the process, many unification happened along existing arrows, resulting among other addition of 2B95 and re-use in the context of Wingdings of many already encoded characters. I have written various documents when working on the Wingdings that were posted on the UTC web site that explains the rationale in more details. Obviously when working with a posteriori unification, sometimes we have to adjust slightly the glyphs in the charts to make the set consistent. For example, we may use Wingdings glyphs in some characters that were encoded before we added Wingdings. If you look at the chart page for the block 2B00-2BFF it is totally obvious how the set in 2B05-2B0D and 2B95 go together and there are x references in the name list to make that explicit.

Glyph consistency is something I take very seriously when creating charts because so many look at the chart glyphs as the reference and given the various sources it is not a simple matter. I use a complex mix of fonts to get where we are now. By no mean Unicode-table.com represents a reference for these matters.

How they get implemented in various platforms and fonts is beyond my control, but at least I work on having a decent reference in the official Unicode pdf charts (and 10646).

Michel

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris
Sent: Thursday, May 28, 2015 2:08 PM
To: Unicode Discussion
Subject: Re: Arrow dingbats


So it sounds like 27a1 came first. Then 2b05 etc was added to complete the set with 27a1, except that it didn?t complete the set because nobody aligned the glyphs.  Then they added U+2B95 in a 2nd attempt to complete the set? (Why not just fix the old arrow?)

Except that nobody seems to have U+2B95 aligned either. On unicode-table.com<http://unicode-table.com> it looks totally different, and Mac doesn?t even have it. Is there any hope this will actually fix it? Has the unicode consortium made it clear to one and all that U+2B95 is supposed to align?


On 29 May 2015, at 5:13 am, Andrew West <andrewcwest at gmail.com<mailto:andrewcwest at gmail.com>> wrote:

On 28 May 2015 at 05:48, Chris <idou747 at gmail.com<mailto:idou747 at gmail.com>> wrote:


Unicode has the arrow dingbats in the range 2b05 with names like ?LEFTWARDS BLACK ARROW"
conspicuously missing is the right arrow

But everywhere I can see that has this arrow, it looks a lot different to
the other arrows with a narrower body and head.

Whose fault is this,

The three left/up/downwards black arrows were added at the request of
North Korea, so I guess you can blame Kim Jong-Il for the missing
rightwards arrow ... perhaps the North Korean army never went to the
right.


and who will fix it?

It was fixed in Unicode 7.0 last year with the addition of U+2B95
RIGHTWARDS BLACK ARROW.  Of course, it may not be fixed for you and
other users unless you have a font installed that supports all the
arrows in a consistent style.

I don't know why the character was added in 7.0, but it may have been
prompted by the same question as yours that was asked on this list in
2013 <http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0078.html>.

Andrew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/9b2e93c5/attachment.html>

From shervinafshar at gmail.com  Thu May 28 17:21:41 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Thu, 28 May 2015 15:21:41 -0700
Subject: Encoding map symbols (was: Re: "Bunny hill" symbol...)
Message-ID: <CA+ONODmHEO=gxDPRr2tWsEaORshXkm7cGxRFg=g8fXQ5oP+wTA@mail.gmail.com>

Sufficiency of conditions for encoding is decided on a case by case basis
by the UTC. According to the existing criteria for encoding symbols, being
a symbol used in maps and map legends contributes to multiple of criteria
items on that document and strengthens the case for acceptance.

May be all symbols used in physical maps in the world *could* be encoded if
there is a strong, compelling case presentable for them to be used in text
environments. Correlation of them not being encoded so far and them being
widespreadly used on maps and map legend does not mean any causation that
one can not provide a strong case for encoding them.

Personally speaking, I'm currently researching for a proposal for encoding
of some of the USGS symbols as well as some other general map symbols.

? Shervin

On Thu, May 28, 2015 at 2:56 PM, Leo Broukhis <leob at mailcom.com> wrote:

> Being used in maps and map legends is not a sufficient condition for
> encoding a symbol. If it were, all symbols used in physical maps would
> have been encoded, including each and every mineral and rare metal.
>
>
> Leo
>
> On Thu, May 28, 2015 at 2:20 PM, Shervin Afshar <shervinafshar at gmail.com>
> wrote:
> > Since the double-diamond has map and map legend usage, it might be a good
> > idea to have it encoded separately. I know that I'm stating the obvious
> > here, but the important point is doing the research and showing that it
> has
> > widespread usage.
> >
> > ? Shervin
> >
> > On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <
> Shawn.Steele at microsoft.com>
> > wrote:
> >>
> >> I?m used to them being next to each other.  So the entire discussion
> seems
> >> to be about how to encode a concept vs how to get the shape you want
> with
> >> existing code points.   If you just want the perfect shape, then maybe
> an
> >> svg is a better choice.  If we?re talking about describing ski-run
> >> difficulty levels in plain-text, then the hodge-podge of glyphs being
> >> offered in this thread seems kinda hacky to me.
> >>
> >>
> >>
> >> -Shawn
> >>
> >>
> >>
> >> From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe
> >> Verdy
> >> Sent: Thursday, May 28, 2015 2:12 PM
> >> To: Jim Melton
> >> Cc: Shawn Steele; unicode Unicode Discussion
> >> Subject: Re: "Bunny hill" symbol, used in America for signaling ski
> pistes
> >> for novices
> >>
> >>
> >>
> >> Some documentations also suggest that the two diamonds are not stacked
> one
> >> above the other, but horizontally. It's a good point for using only one
> >> symbol, encoding it twice in plain-text if needed.
> >>
> >>
> >>
> >> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
> >>
> >> I no longer ski, but I did so for many years, mostly (but not
> exclusively)
> >> in the western United States.  I never encountered, at any USA ski
> >> hill/mountain/resort, a special symbol for "bunny hills", which are
> >> typically represented by the green circle meaning "beginner".  That's
> >> anecdotal evidence at best, but my observations cover numerous skiing
> sites.
> >> I have encountered such a symbol in Europe and in New Zealand, but not
> in
> >> the USA.  (I have not had the pleasure of skiing in Canada and am thus
> >> unable to speak about ski areas in that country.)
> >>
> >> The double black diamond would appear to be a unique symbol worthy of
> >> encoding, simply because the only valid typographical representation
> (in the
> >> USA) is two single black diamonds stacked one above the other and
> touching
> >> at the points.
> >>
> >> Hope this helps,
> >>    Jim
> >>
> >>
> >> On 5/28/2015 2:04 PM, Shawn Steele wrote:
> >>
> >> So is double black diamond a separate symbol?  Or just two of the black
> >> diamond?
> >>
> >>
> >>
> >> And Blue-Black?
> >>
> >>
> >>
> >> I?m drawing a blank on a specific bunny sign, in my experience those are
> >> usually just green.
> >>
> >>
> >>
> >> Aren?t there a lot of cartography symbols for various systems that
> aren?t
> >> present in Unicode?
> >>
> >>
> >>
> >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of
> Philippe
> >> Verdy
> >> Sent: Thursday, May 28, 2015 12:47 PM
> >> To: unicode Unicode Discussion
> >> Subject: "Bunny hill" symbol, used in America for signaling ski pistes
> for
> >> novices
> >>
> >>
> >>
> >> Is there a symbol that can represent the "Bunny hill" symbol used in
> North
> >> America and some other American territories with mountains, to
> designate the
> >> ski pistes open to novice skiers (those pistes are signaled with green
> signs
> >> in Europe).
> >>
> >>
> >>
> >> I'm looking for the symbol itself, not the color, or the form of the
> sign.
> >>
> >>
> >>
> >> For example blue pistes in Europe are designed with a green circle in
> >> America, but we have a symbol for the circle; red pistes in Europe are
> >> signaled by a blue square in America, but we have a symbol for the
> square;
> >> black pistes in Europe are signaled by a black diamond in America, but
> we
> >> also have such "black" diamond in Unicode.
> >>
> >>
> >>
> >> But I can't find an equivalent to the American "Bunny hill" signal,
> >> equivalent to green pistes in Europe (this is a problem for webpages
> related
> >> to skiing: do we have to embed an image ?).
> >>
> >>
> >>
> >>
> >>
> >> --
> >>
> >> ========================================================================
> >>
> >> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone:
> +1.801.942.0144
> >>
> >>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax :
> +1.801.942.3345
> >>
> >> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
> >>
> >> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
> >>
> >> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
> >>
> >> ========================================================================
> >>
> >> =  Facts are facts.   But any opinions expressed are the opinions      =
> >>
> >> =  only of myself and may or may not reflect the opinions of anybody   =
> >>
> >> =  else with whom I may or may not have discussed the issues at hand.  =
> >>
> >> ========================================================================
> >>
> >>
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/8a45db7d/attachment.html>

From mark at kli.org  Thu May 28 18:42:14 2015
From: mark at kli.org (Mark E. Shoulson)
Date: Thu, 28 May 2015 19:42:14 -0400
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
Message-ID: <5567A7D6.6060102@kli.org>

As was pointed out to me, essentially what you are saying is you reject 
my premise that one size does not fit all.  You would prefer 
*everything* be in plain text, "so you wouldn't have to use other 
formats for it."  You're essentially converting plain text into THE 
format for everything.

But it isn't suited for that.  If you really believe one size should fit 
all in this way, I think the problem is that pretty much all of the rest 
of the computer science community doesn't agree with you.  Sorry.

~mark

On 05/28/2015 07:50 AM, William_J_G Overington wrote:
> Responding to Mark E. Shoulson:
>
> The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage.
>
> The following may be useful as a guide to the original problem that I am trying to solve.
>
> http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term
>
> I tried to apply the brilliant new "base character followed by tag characters" format to the problem.
>
> In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format.
>
> William Overington
>
> 28 May 2015
>


From idou747 at gmail.com  Thu May 28 21:37:25 2015
From: idou747 at gmail.com (John)
Date: Thu, 28 May 2015 19:37:25 -0700 (PDT)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <5567A7D6.6060102@kli.org>
References: <5567A7D6.6060102@kli.org>
Message-ID: <1432867044809.9dc7c15b@Nodemailer>

"Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request?


If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way?


Part of the reason at least of having any code system rather than just pixels and images is to efficiently and consistently encode data. Unicode has private use ranges of codes. I can see an argument that it would be desirable to be able to send someone text with private use ranges and have the header define some default renderings. I?m not sure that replacing a document of 100,000 characters with 100,000 embedded html5 <img tags is the same thing. It would be inefficient in space. Impossible to process (e.g. find all the instances of a particular character, or sequence), and so forth.


Given that its been agreed that private use ranges are a good thing, and given that we can agree that exchanging data is a good thing, maybe something should bring those two things together. Just a thought.


?
Chris

On Fri, May 29, 2015 at 9:45 AM, Mark E. Shoulson <mark at kli.org> wrote:

> As was pointed out to me, essentially what you are saying is you reject 
> my premise that one size does not fit all.  You would prefer 
> *everything* be in plain text, "so you wouldn't have to use other 
> formats for it."  You're essentially converting plain text into THE 
> format for everything.
> But it isn't suited for that.  If you really believe one size should fit 
> all in this way, I think the problem is that pretty much all of the rest 
> of the computer science community doesn't agree with you.  Sorry.
> ~mark
> On 05/28/2015 07:50 AM, William_J_G Overington wrote:
>> Responding to Mark E. Shoulson:
>>
>> The big advantage of this new format is that the result is an unambiguous Unicode plain text file and could be placed within a file of plain text without having to make the whole document a markup file to some format. Plain text is the key advantage.
>>
>> The following may be useful as a guide to the original problem that I am trying to solve.
>>
>> http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term
>>
>> I tried to apply the brilliant new "base character followed by tag characters" format to the problem.
>>
>> In the future, maybe Serif DrawPlus will have the ability to export a picture to this new format.
>>
>> William Overington
>>
>> 28 May 2015
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/4af3c5c7/attachment.html>

From kenwhistler at att.net  Thu May 28 22:14:19 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 28 May 2015 20:14:19 -0700
Subject: Arrow dingbats
In-Reply-To: <EDE54B3D-2D55-4020-A891-6BC81FCCB2A2@gmail.com>
References: <C71D7C51-FBBD-4ECB-8A70-B944EFFCEE80@gmail.com>
 <CALgEMhyNTjpC0CeQB_wPciiD6ZC8OvYqH1YdExv98OZtPa3V_w@mail.gmail.com>
 <EDE54B3D-2D55-4020-A891-6BC81FCCB2A2@gmail.com>
Message-ID: <5567D98B.4080006@att.net>

Michel Suignard (editor of ISO/IEC 10646) responded to these questions,
but let me augment his response with some more detailed history here.
(Pardon the length of the reply, but these things tend never to be as
simple as people assume and hope they are.)

On 5/28/2015 2:08 PM, Chris wrote:
>
> So it sounds like 27a1 came first. Then 2b05 etc was added to complete 
> the set with 27a1, except that it didn?t complete the set because 
> nobody aligned the glyphs.  Then they added U+2B95 in a 2nd attempt to 
> complete the set? (Why not just fix the old arrow?)

O.k. That is *roughly* correct, but only very roughly.

U+27A1 BLACK RIGHTWARDS ARROW

That *did* come first. It has a Unicode Age=V1_1, dating back to 1993 in 
the standard.
(Actually, its Unicode history goes back even further, but 1993 is 
enough for this
discussion.)

U+27A1 was part of the set of dingbats encoded for compatibility with
the ITC Zapf Dingbats series 100, which saw widespread early commercial 
implementation
on PostScript printers and was widely used as a font encoding back in 
the 80's
and early 90's.

An important thing to note about the Zapf Dingbat arrows (go look at the
Unicode code chart for the 27XX block) is that almost all of those arrows
are exclusively right-facing:

http://www.unicode.org/charts/PDF/U2700.pdf

It was assumed at the time that in actual implementations that used these
arrows in documents, they would be used by PostScript drivers that had
arbitrary scale and rotate functions that would allow, among other things,
the rotation of an arrow to display in any orientation. The Unicode 
*character*
encoding of these was, rather, intended as a code point compatibility
mapping that would enable Unicode mapping of documents that had used
font encoded Zapf dingbats simply as symbolic "blorts" in text.

This compatibility issue explains why, back in 1993, the whole set of 
Dingbat
arrows was not elaborated into character-encoded rotational sets
of symbols (i.e. rightwards, leftwards, upwards, downwards, ...)

U+2B05 LEFTWARDS BLACK ARROW

That one (and the near-complete rotational set of similar black arrows
at U+2B05..U+2B0D) have a Unicode Age=V4_0 (2003). Andrew West
was correct in identifying the source of these. They were brought to
SC2/WG2 and proposed for encoding by the DPRK, back in 2001, for
compatibility with a North Korean standard. See page 5 of the pdf in:

http://www.unicode.org/L2/L2001/01349-N2374-DPRK-AddSymbols.pdf

That is the proximate source of these "black arrows" in the Unicode Standard
(along with the white versions at U+2B00..U+2B04). The glyphs that were
used for these arrows in Unicode 4.0 are also derived from that source.
However, the fact that WG2 N2374 (i.e., the DPRK) did not ask for also
encoding a separate "RIGHTWARDS BLACK ARROW" indicates that they
considered the existing U+27A1 BLACK RIGHTWARDS ARROW to
suffice for mapping to their standard.

The fact that "nobody aligned the glyphs" in 2003, when these were published
was partly because: a) the glyphs were inherited from the proposal document
and then ISO ballot documents, and nobody commented on or required them
to be changed in ballot comments, and b) nobody much cared, because these
were compatibility additions for a DPRK standard, and weren't mapped to
any commercial sets at the time, anyway.

The glyphs for U+2B05..U+2B0D remained unchanged in the standard from
Unicode 4.0 through Unicode 6.3. (Again, because nobody had any strong
reason to do otherwise.) And that explains why, as implementations of
the Unicode 4.0 (and later) repertoire came to be more widely supported
in fonts, the glyphs for U+2B05 tended to have a relatively narrow arrow
shaft that matched the Unicode charts. The unification of the rotational
set U+2B05..U+2B0D with the existing ITC Zap Dingbat U+27A1 was
*implicit* in the encoding, but was not explicitly called out by anything
other than a note in the names list for the 2BXX block that pointed to
the 27XX block for "Other white and black arrows to complete this set".
In practice, most people just put glyphs in fonts that matched the code
charts.

U+2B95 RIGHTWARDS BLACK ARROW

This one has a Unicode Age=V7_0 (2014). It was added as a result of a
complete re-rationalization of all of the arrow symbols in the standard,
required, as Michel Suignard noted, to deal with the addition of 
compatibility
characters to cover the multitude of arrow symbols in the Wingding sets.

If you want to see the explicit rationale and the point at which this 
happened,
see page 21 in the pdf of:

http://www.unicode.org/L2/L2012/12130-n4239.pdf

That was the disposition of comments for PDAM 1.2 to ISO/IEC 10646
3rd edition. And the relevant note from the editor is:

"To complete the set of BLACK ARROW in 2B05..2B0D a new character
is added:
2B95 RIGHTWARDS BLACK ARROW
(The character 27A1 BLACK RIGHTWARDS ARROW in the dingbat block
is not an appropriate match for the other 9 characters)."

This happened in the context of mapping against multiple Wingding arrow
shapes, which were at the time being added to the standard in explicit
rotational sets. Doing this consistently required a rationalization of the
shapes and aspects of the white and black arrows in the 2BXX block.
And the explicit changes that ended up in the Unicode code charts can
be traced back to the following repertoire chart:

http://www.unicode.org/L2/L2012/12128-n4244.pdf

See pages 36 and 49 of the pdf.

Page 49, in particular, shows explicitly what Michel pointed out: the 
addition
of the new character 2B95 was deliberately aligned with the glyph changes
for 2B05, etc.

So now finally, to your question: "Why not fix the old arrow?"

Well, Michel explained that in WG2 N4239. If you are going to map
an entire set to Wingdings (as opposed to a then decade-old proposal
document from the DPRK) it makes sense to use appropriate glyphs
for that, in the context of all the other additions. But it is *not*
appropriate to retroactively pick out the old ITC Zapf Dingbats series
100 glyph (from amongst a set of others with very explicit shapes)
and change *that* glyph just to make the rotational set complete.
Hence the addition of U+2B95 as the best solution for Unicode 7.0.

>
> Except that nobody seems to have U+2B95 aligned either.

It takes a while for font implementations to catch up with the
standard. The glyphs for U+2B05..U+2B0D have been in fonts
for some time now, and the multitude of arrow additions for
Unicode 7.0 are relatively new and not yet fully supported in
many fonts. *When* a font adjusts for the addition of the new
sets of arrows, however, it *should* take into account the explicit
glyph updates for U+2B00..U+2B0D, which were clearly intentional,
as part of all this work on the arrows to cover Wingdings.

> On unicode-table.com <http://unicode-table.com> it looks totally 
> different,

You cannot depend on unicode-table.com for definitive information
about glyphs. That site is not coordinated with or sanctioned by
the Unicode Consortium. If you want definitive information about
encoding and current representative glyphs for each character, please
go instead to:

http://www.unicode.org/charts/

> and Mac doesn?t even have it.

Implementations may well lag in addition of new sets of symbols
from Unicode 7.0.

> Is there any hope this will actually fix it?

Yes.

> Has the unicode consortium made it clear to one and all that U+2B95 is 
> supposed to align?

Yes. (See above.)

--Ken

>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/38ee1846/attachment-0001.html>

From asmus-inc at ix.netcom.com  Fri May 29 00:46:58 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Thu, 28 May 2015 22:46:58 -0700
Subject: "Bunny hill" symbol, used in America for signaling ski pistes
 for novices
In-Reply-To: <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
Message-ID: <5567FD52.6020007@ix.netcom.com>

On 5/28/2015 2:15 PM, Shawn Steele wrote:
>
> I?m used to them being next to each other.  So the entire discussion 
> seems to be about how to encode a concept vs how to get the shape you 
> want with existing code points.   If you just want the perfect shape, 
> then maybe an svg is a better choice.  If we?re talking about 
> describing ski-run difficulty levels in plain-text, then the 
> hodge-podge of glyphs being offered in this thread seems kinda hacky 
> to me.
>
> -Shawn
>
>
*Symbols, have a rather different relation between identity and 
collection of typical shapes than letters.*

For symbols, the way they are re-used in different conventions is 
different as well.

For letters, in many scripts, what matters is that they represent
a) a member of an alphabet (subset of a script)
b) readers and writers can agree *which* member of the alphabet is 
intended (identity).

This identity selection is the sum total of the "semantics" of the 
character, when it comes to letters.

Some symbols, like the integral signs, are closely tied to a 
well-defined notation, which in turn governs the acceptability of the 
range of visual representations.

For general symbols you quickly get to the situation where the shape 
*is* the identity. For geometric shapes, you can't really predict how 
they are going to be used and in which conventions. (That is true for 
the more generically shaped punctuation marks as well, like period). 
Because you can't predict the use to be made of them, what you need to 
guarantee the writer (author) is that the shape he or she sees is what 
the reader will see, so that the author can make the determination that 
the symbol represents the notational element, or the concept, that was 
intended.

That means, you really need to approach the encoding of symbols 
differently from letters, where the latter have a well established 
"identity" and the only task for a visual representation is to give 
enough unambiguous details so as to be able to select that identity from 
a restricted set. (Hence the wide range of wonderfully whimsical 
decorative fonts).

It's useless to treat some "concept" as the functional equivalent of a 
letter's membership in an alphabet. Unlike the case of writing systems, 
neither authors nore readers have the same kind prior agreement of how 
much you can vary a shape and still refer to the same concept. 
(Obviously, even among symbols there is some variation in this regard). 
As a result, you simply need to allow the encoding to become more shape 
based. So that authors can create documents that do not have to rely on 
the missing agreement with the readers on what other shapes may or may 
not be substituted successfully without affecting  the semantics (not of 
the code point, but of the text).


A./*
*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150528/b5da55f5/attachment.html>

From jknappen at web.de  Fri May 29 02:32:30 2015
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Fri, 29 May 2015 09:32:30 +0200
Subject: Aw: Re: "Bunny hill" symbol, used in America for signaling ski
 pistes for novices
In-Reply-To: <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>,
 <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
Message-ID: <trinity-7d75994e-408e-4098-990e-ddf61df6d2be-1432884750708@3capp-webde-bs10>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/67fd00a9/attachment.html>

From wjgo_10009 at btinternet.com  Fri May 29 03:38:19 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 29 May 2015 09:38:19 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <5567A7D6.6060102@kli.org>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
 <5567A7D6.6060102@kli.org>
Message-ID: <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost>

Responding to Mark E. Shoulson:


> As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all.


Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML.


HTML is good for web pages.


Plain text is, amongst other applications, good for text messages.


The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text.


I have not purported that it become the only format for transmitting images.


> You would prefer *everything* be in plain text, "so you wouldn't have to use other formats for it." You're essentially converting plain text into THE format for everything. 


No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software.


>  If you really believe one size should fit all in this way, ...


But I don't.


Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text.


For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate.


http://www.users.globalnet.co.uk/~ngo/library.htm


http://www.users.globalnet.co.uk/~ngo/spec0001.htm


http://www.users.globalnet.co.uk/~ngo/song1018.htm


http://www.users.globalnet.co.uk/~ngo/song1021.htm


I have embedded a wav file in a pdf and published the result on the web.


http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf


Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting?


What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting?


William Overington


29 May 2015


From andrewcwest at gmail.com  Fri May 29 05:30:40 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Fri, 29 May 2015 11:30:40 +0100
Subject: KPS 9566 mappings (was Re: Arrow dingbats)
Message-ID: <CALgEMhzs1dbkLTmTpbQ1yjdxQchKYreZtcDyr5z3Czss=d2-7Q@mail.gmail.com>

As someone who supports opening of KPS 9566 encoded files in my
software (BabelPad), I am interested in those characters proposed by
DPRK (http://std.dkuug.dk/jtc1/sc2/wg2/Docs/n2374.pdf) that were not
accepted for encoding but which are still in the latest version of the
DPRK standard, KPS 9566-2012(?). Red Star OS 3.0 Unicode-maps most of
them to the PUA, which is not satisfactory in most cases.

LEFTWARDS SCISSORS = KPS 9566-2012 ACD5

There are five scissors characters at 2700..2704, but they are all
right-facing. I think it would not be unreasonable to encode a
left-facing scissors character for compatibility with KPS 9566.
Alternatively, standardized variants for left-facing and right-facing
scissors could be defined for all 2700..2704, but that might open a
nasty precedent that we come to regret, so I would prefer simply
encoding a single left-facing scissor character.

CIRCLED UPWARD INDICATION = KPS 9566-2012 ACD4

This could be represented as U+1F446 WHITE UP POINTING BACKHAND INDEX
+ U+20DD COMBINING ENCLOSING CIRCLE.

WHITE UP-POINTING TRIANGLE WITH BLACK TRIANGLE = KPS 9566-2012 A2F1
WHITE UP-POINTING TRIANGLE WITH HORIZONTAL FILL = KPS 9566-2012 A2F2
WHITE UP-POINTING TRIANGLE WITH UPPER LEFT TO LOWER RIGHT FILL = KPS
9566-2012 A2F3
WHITE UP-POINTING TRIANGLE WITH UPPER RIGHT TO LOWER LEFT FILL = KPS
9566-2012 A2F4

I don't know why these were not accepted for encoding.  As far as I
can tell, they cannot be represented by any current Unicode character,
and I think it would be reasonable to encode them for compatibility
with KPS 9566.

RIGHT PARENTHESIS WITH FULL STOP = KPS 9566-2012 A1DC
RIGHT DOUBLE ANGLE BRACKET WITH FULL STOP = KPS 9566-2012 A1DD

I understand why these were not accepted for encoding, but the
precedent of U+2047 DOUBLE QUESTION MARK, U+2048 QUESTION EXCLAMATION
MARK, and U+2049 EXCLAMATION QUESTION MARK, which I believe were
encoded because they are used in vertically oriented Mongolian text
and it is problematic to embed ?? etc. horizontally in vertical text
suggests that it may be appropriate to encode these two characters for
compatibility with KPS 9566.

VULGAR FRACTION ONE HALF WITH HORIZONTAL BAR = KPS 9566-2012 A7FA
VULGAR FRACTION ONE THIRD WITH HORIZONTAL BAR = KPS 9566-2012 A7FB
VULGAR FRACTION TWO THIRDS WITH HORIZONTAL BAR = KPS 9566-2012 A7FC
VULGAR FRACTION ONE QUARTER WITH HORIZONTAL BAR = KPS 9566-2012 A7FD
VULGAR FRACTION THREE QUARTERS WITH HORIZONTAL BAR = KPS 9566-2012 A7FE

These contrast with KPS 9566 A7CA..A7CE which are vulgar fractions
with diagonal bar.  The issue of distinguishing between a horizontal
and a diagonal fraction slash is not restricted to North Korea, and I
think that there is an argument to be made for defining standardized
variants for all vulgar fraction characters to specify a glyph with
either a horizontal bar or a diagonal bar.

HAMMER AND SICKLE AND BRUSH
CIRCLED HAMMER AND SICKLE AND BRUSH

I assume that there is no appetite to encode these symbols for the
Workers' Party of Korea, and so mapping them to the PUA is
appropriate.

There is also the proposed VERTICAL TILDE character which was not
accepted for encoding, but which Red Star OS 3.0 Unicode-maps to
U+2E2F VERTICAL TILDE which was added in Unicode 5.1 for Cyrillic
transliteration.  This mapping does not seem wholy satisfactory to me,
and I wonder whether it would not be better to simply encode a
PRESENTATION FORM FOR VERTICAL TILDE at FE1A.

Andrew

From alolita.sharma at gmail.com  Fri May 29 10:51:50 2015
From: alolita.sharma at gmail.com (Alolita Sharma)
Date: Fri, 29 May 2015 08:51:50 -0700
Subject: "Unicode of Death"
In-Reply-To: <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
 <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
 <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
Message-ID: <CACfsPpVMxM-VHe0keqFnWvcUpby-o91=wSo=ScHHKihU4SySXg@mail.gmail.com>

Seems like we may see a temporary fix for iOS.

http://www.businessinsider.com/apple-issues-temporary-siri-workaround-iphone-crash-unicode-text-message-bug-2015-5

Best,
Alolita


On Thu, May 28, 2015 at 2:36 PM, Andrew Cunningham <lang.support at gmail.com>
wrote:

> Not the first time unicode crashes things. There was the google chrome bug
> on osx that crashed the tab for any syriac text.
>
> A.
>
>
> On Friday, 29 May 2015, Bill Poser <billposer2 at gmail.com> wrote:
> > No doubt the evil Unicode Consortium is in league with the Trilateral
> Commission, the Elders of Zion,and the folks at NASA who faked the moon
> landing.... :)
> >
> > On Thu, May 28, 2015 at 7:53 AM, Doug Ewell <doug at ewellic.org> wrote:
> >>
> >> Unicode is in the news today as some folks with waaay too much time on
> >> their hands have discovered a string consisting of Latin, Arabic,
> >> Devanagari, and CJK characters that crashes Apple devices when it
> >> appears as a pop-up message.
> >>
> >> Although most people seem to identify it correctly as a CoreText bug,
> >> there are a handful, as you might expect, who attribute it to some shady
> >> weirdness in Unicode itself. My favorite quote from a Reddit user was
> >> this:
> >>
> >> "Every character you use has a unicode value which tells your phone what
> >> to display. One of the unicode values is actually never-ending and so
> >> when the phone tries to read it it goes into an infinite loop which
> >> crashes it."
> >>
> >> I've read TUS Chapter 4 and UTR #23 and I still can't find the
> >> "never-ending" Unicode property.
> >>
> >> Perhaps astonishingly to some, the string displays fine on all my
> >> Windows devices. Not all apps get the directionality right, but no
> >> crashes.
> >>
> >> --
> >> Doug Ewell | http://ewellic.org | Thornton, CO ????
> >>
> >
> >
>
> --
> Andrew Cunningham
> Project Manager, Research and Development
> (Social and Digital Inclusion)
> Public Libraries and Community Engagement
> State Library of Victoria
> 328 Swanston Street
> Melbourne VIC 3000
> Australia
>
> Ph: +61-3-8664-7430
> Mobile: 0459 806 589
> Email: acunningham at slv.vic.gov.au
>           lang.support at gmail.com
>
> http://www.openroad.net.au/
> http://www.mylanguage.gov.au/
> http://www.slv.vic.gov.au/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/00627af5/attachment.html>

From leob at mailcom.com  Fri May 29 11:09:47 2015
From: leob at mailcom.com (Leo Broukhis)
Date: Fri, 29 May 2015 09:09:47 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
 <5567A7D6.6060102@kli.org>
 <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost>
Message-ID: <CAFmvRseJx=s6K3jM8SjQro1wLG86B4Te-jYZdOwfB5BT7hdh3A@mail.gmail.com>

> The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text.

 A more common occurrence is the need to include a non-standard
character in a text message, be it a ski piste symbol or an obscure
CJK ideogram. Have you thought of embedding TrueType in Unicode?

Leo

On Fri, May 29, 2015 at 1:38 AM, William_J_G Overington
<wjgo_10009 at btinternet.com> wrote:
> Responding to Mark E. Shoulson:
>
>
>> As was pointed out to me, essentially what you are saying is you reject my premise that one size does not fit all.
>
>
> Well, I do not know where that came from, but no, I do not reject that premise. There is plain text, there is HTML, there is XML.
>
>
> HTML is good for web pages.
>
>
> Plain text is, amongst other applications, good for text messages.
>
>
> The format that I am suggesting would allow the image for a non-standard emoji character to be included in a text message, with the image located at the correct place in the text.
>
>
> I have not purported that it become the only format for transmitting images.
>
>
>> You would prefer *everything* be in plain text, "so you wouldn't have to use other formats for it." You're essentially converting plain text into THE format for everything.
>
>
> No. Use the best format for the task that is being carried out. I am enthusiastic that as much as possible can be done in open source formats rather than an end user of computing equipment needing to rely on expensive propriety software packages with proprietary file formats that cannot be accessed without expensive software.
>
>
>>  If you really believe one size should fit all in this way, ...
>
>
> But I don't.
>
>
> Just because I opine that plain text is best for some applications and I have suggested a format that would allow a graphic to be included directly in a plain text file does not mean that I opine that everything should be plain text.
>
>
> For example, I use HTML files, gif files, png files, pdf files, wav files, TTF files as appropriate.
>
>
> http://www.users.globalnet.co.uk/~ngo/library.htm
>
>
> http://www.users.globalnet.co.uk/~ngo/spec0001.htm
>
>
> http://www.users.globalnet.co.uk/~ngo/song1018.htm
>
>
> http://www.users.globalnet.co.uk/~ngo/song1021.htm
>
>
> I have embedded a wav file in a pdf and published the result on the web.
>
>
> http://www.users.globalnet.co.uk/~ngo/the_mobile_art_shop.pdf
>
>
> Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting?
>
>
> What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting?
>
>
> William Overington
>
>
> 29 May 2015
>
>
>
>


From shervinafshar at gmail.com  Fri May 29 11:16:50 2015
From: shervinafshar at gmail.com (Shervin Afshar)
Date: Fri, 29 May 2015 09:16:50 -0700
Subject: "Unicode of Death"
In-Reply-To: <CACfsPpVMxM-VHe0keqFnWvcUpby-o91=wSo=ScHHKihU4SySXg@mail.gmail.com>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
 <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
 <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
 <CACfsPpVMxM-VHe0keqFnWvcUpby-o91=wSo=ScHHKihU4SySXg@mail.gmail.com>
Message-ID: <CA+ONODmt4sHEgyZT+bSCb=0p0rV1ahEFyFq9Q6gUYgxOdxRwUw@mail.gmail.com>

>
> Ask Siri to "read unread messages."
>

Siri saves the day :).


? Shervin

On Fri, May 29, 2015 at 8:51 AM, Alolita Sharma <alolita.sharma at gmail.com>
wrote:

> Seems like we may see a temporary fix for iOS.
>
>
> http://www.businessinsider.com/apple-issues-temporary-siri-workaround-iphone-crash-unicode-text-message-bug-2015-5
>
> Best,
> Alolita
>
>
>
> On Thu, May 28, 2015 at 2:36 PM, Andrew Cunningham <lang.support at gmail.com
> > wrote:
>
>> Not the first time unicode crashes things. There was the google chrome
>> bug on osx that crashed the tab for any syriac text.
>>
>> A.
>>
>>
>> On Friday, 29 May 2015, Bill Poser <billposer2 at gmail.com> wrote:
>> > No doubt the evil Unicode Consortium is in league with the Trilateral
>> Commission, the Elders of Zion,and the folks at NASA who faked the moon
>> landing.... :)
>> >
>> > On Thu, May 28, 2015 at 7:53 AM, Doug Ewell <doug at ewellic.org> wrote:
>> >>
>> >> Unicode is in the news today as some folks with waaay too much time on
>> >> their hands have discovered a string consisting of Latin, Arabic,
>> >> Devanagari, and CJK characters that crashes Apple devices when it
>> >> appears as a pop-up message.
>> >>
>> >> Although most people seem to identify it correctly as a CoreText bug,
>> >> there are a handful, as you might expect, who attribute it to some
>> shady
>> >> weirdness in Unicode itself. My favorite quote from a Reddit user was
>> >> this:
>> >>
>> >> "Every character you use has a unicode value which tells your phone
>> what
>> >> to display. One of the unicode values is actually never-ending and so
>> >> when the phone tries to read it it goes into an infinite loop which
>> >> crashes it."
>> >>
>> >> I've read TUS Chapter 4 and UTR #23 and I still can't find the
>> >> "never-ending" Unicode property.
>> >>
>> >> Perhaps astonishingly to some, the string displays fine on all my
>> >> Windows devices. Not all apps get the directionality right, but no
>> >> crashes.
>> >>
>> >> --
>> >> Doug Ewell | http://ewellic.org | Thornton, CO ????
>> >>
>> >
>> >
>>
>> --
>> Andrew Cunningham
>> Project Manager, Research and Development
>> (Social and Digital Inclusion)
>> Public Libraries and Community Engagement
>> State Library of Victoria
>> 328 Swanston Street
>> Melbourne VIC 3000
>> Australia
>>
>> Ph: +61-3-8664-7430
>> Mobile: 0459 806 589
>> Email: acunningham at slv.vic.gov.au
>>           lang.support at gmail.com
>>
>> http://www.openroad.net.au/
>> http://www.mylanguage.gov.au/
>> http://www.slv.vic.gov.au/
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/3e3af6fb/attachment.html>

From wjgo_10009 at btinternet.com  Fri May 29 11:31:00 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 29 May 2015 17:31:00 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <CAGa7JC2ErXZv7D6z9bEBH7+ohqpGce4WPz584uLs+_APhdAw0Q@mail.gmail.com>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
 <CAGa7JC2ErXZv7D6z9bEBH7+ohqpGce4WPz584uLs+_APhdAw0Q@mail.gmail.com>
Message-ID: <33086681.59649.1432917060255.JavaMail.defaultUser@defaultHost>

Responding to Philippe Verdy:
> There's no advantage because what you want to create is effectively another markup language with its own syntax (but requiring new obscure characters that most applications and users will not be able to interpret and render correctly in the way intended by you, ...
Well, if the format became accepted as part of Unicode then appropriate applications could well be produced that would interpret the format and display an image in the desired place.
 > ... and with still many things you have forgotten about the specific needs for images (e.g. colorimetry profiles, aspect ratio of pixels with bitmaps, undesired effects that must be controled such as "moir?" artefacts).
The format is just at present a basic suggestion. Rather than just state what you consider what I have forgotten and dismiss the format, how about joining in progress and specifying what you consider needs adding to the format and perhaps suggest how to add in that functionality in the style that the format uses.
> You don't need new characters to create a markup language and its syntax. Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don't require any external request, or embedding special effects on images, such as animation or dynamic layouts for adapting the document to the redering device, with the help of CSS and Javascript that are also embeddable).
The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here.
Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting?
What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting?
> At least with HTML5 they don't try to reinvent the image formats and there's ample space for supporting multiple images formats tuned for specific needs (e.g. JPEG, PNG, GIF, SVG, TIFF...) including animation and video, and synchronization of images and audio in time for videos, or with user interactions. They are designed separately and benefit from patient researches made since long (your desired format, still undocumented, is largely under the level needed for images, independantly of the markup syntax you want to create to support them, and independantly of the fact that you also want to encode these syntaxic elements with new characters, something that is absolutely not needed for any markup language)
Well it is undocumented apart from posts in this thread because I have put forward the format for discussion. A pdf document for consideration by the Unicode Technical Committee could be produced and submitted if there is interest in the format, the content of the pdf document perhaps including suggestions from this thread if any such suggestions are forthcoming.
> In summary, you are reinventing the wheel.
Well, this is progress, producing an additional format for expressing an image for application in various specific specialised circumstances.
William Overington
29 May 2015
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/c13869bf/attachment.html>

From doug at ewellic.org  Fri May 29 13:12:46 2015
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 29 May 2015 11:12:46 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
Message-ID: <20150529111246.665a7a7059d7ee80bb4d670165c8327d.10c55b41ea.wbe@email03.secureserver.net>

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

>> There's no advantage because what you want to create is effectively
>> another markup language with its own syntax (but requiring new
>> obscure characters that most applications and users will not be able
>> to interpret and render correctly in the way intended by you, ...
>
> Well, if the format became accepted as part of Unicode then
> appropriate applications could well be produced that would interpret
> the format and display an image in the desired place.

I think this cuts to the heart of what people have been trying to say
all along.

Historically, Unicode was not meant to be the means by which brand new
ideas are run up the proverbial flagpole to see if they will gain
traction.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Fri May 29 14:07:45 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 29 May 2015 21:07:45 +0200
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <1432867044809.9dc7c15b@Nodemailer>
References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer>
Message-ID: <CAGa7JC23x2DLQbk9qqoHiZwgcr+JucT26cRLkF4y+fj7Kx_6XA@mail.gmail.com>

2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com>:

> "Today the world goes very well with HTML(5) which is now the bext markup
> language for document (including for inserting embedded images that don?t
> require any external request?
> If I had a large document that reused a particular character thousands of
> times, would this HTML markup require embedding that character thousands of
> times, or could I define the character once at the beginning of the
> sequence, and then refer back to it in a space efficient way?
>

HTML(5) allows defining *once* entities for images that can then be reused
thousands of times without repeting their definition. You can do this as
well with CSS styles, just define a class for a small element. This element
may still be an "image", but the semantic is carried by the class you
assign to it. You are not required to provide an external source URL for
that image if the CSS style provides the content.

You may also use PUAs for the same purpose (however I have not seen how CSS
allows to style individual characters in text elements as these characters
are not elements, and there's no defined selector for pseudo-elements
matching a single character). PUAs are perfectly usable in the situation
where you have embedded a custom font in your document for assigning glyphs
to characters (you can still do that, but I would avoid TrueType/OpenType
for this purpose, but would use the SVG font format which is valid in CSS,
for defining a collection of glyphs).

If the document is not restricted to be standalone, of course you can use
links to an external shared CSS stylesheet and to this SVG font referenced
by the stylesheet. With such approach, you don't even need to use classes
on elements, you use plain-text with very compact PUAs (it's up to you to
decide if the document must be standalone (embedding everything it needs)
or must use external references for missing definitions, HTML allows
both (and SVG as well when it contains plain-text elements).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/91ffbafb/attachment.html>

From verdy_p at wanadoo.fr  Fri May 29 15:23:22 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 29 May 2015 22:23:22 +0200
Subject: "Unicode of Death"
In-Reply-To: <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
 <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
 <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
Message-ID: <CAGa7JC16JUD7OB4sB=QZi99uWMQ95SyyEyTCMNqOhjVg8fNx=Q@mail.gmail.com>

2015-05-28 23:36 GMT+02:00 Andrew Cunningham <lang.support at gmail.com>:

> Not the first time unicode crashes things. There was the google chrome bug
> on osx that crashed the tab for any syriac text.
>

"Unicode crashes things"? Unicode has nothing to do in those crashes caused
by bugs in applications that make incorrect assumptions (in fact not even
related to characters themselves but to the supposed behavior of the layout
engine. Programmers and designers for example VERY frequently forget the
constraints for RTL languages and make incorrect assumptions about left and
right sides when sizing objects, or they don't expect that the cursor will
advance backward and forget that some measurements can be negative: if they
use this negative value to compute the size of a bitmap redering surface,
they'll get out of memory, unchecked null pointers returned, then they will
crash assuming the buffer was effectively allocated.

These are the same kind of bugs as with the too common buffer overruns with
unchecked assumtions: the code is kept because "it works as is" in their
limited immediate tests.

Producing full coverage tests is a difficult and lengthy task, that
programmers not always have the time to do, when they are urged to produce
a workable solution for some clients and then given no time to improve the
code before the same code is distributed to a wider range of clients.

Commercial staffs do that frequently, they can't even read the technical
limitations even when they are documented by programmers... in addition the
commercial staff like selling softwares that will cause customers to ask
for support... that will be billed ! After that, programmers are
overwhelmed by bug reports and support requests, and have even less time to
design other thigs that they are working on and still have to produce. QA
tools may help programmers in this case by providing statistics about the
effective costs of producing new software with better quality, and the cost
of supporting it when it contains too many bugs: commercial teams like
those statistics because they can convert them to costs, commercial
margins, and billing rates. (When such QA tools are not used, programmers
will rapidly leave the place, they are fed up by the growing pressure to do
always more in the same time, with also a growing number of "urgent"
support requests.).

Those that say "Unicode crashes things" do the same thing: they make broad
unchecked assumptions about how things are really made or how things are
actually working.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/6e14c5ad/attachment.html>

From lang.support at gmail.com  Fri May 29 18:20:08 2015
From: lang.support at gmail.com (Andrew Cunningham)
Date: Sat, 30 May 2015 09:20:08 +1000
Subject: "Unicode of Death"
In-Reply-To: <CAGa7JC16JUD7OB4sB=QZi99uWMQ95SyyEyTCMNqOhjVg8fNx=Q@mail.gmail.com>
References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net>
 <CACPRsRRirRcraite5YF1BBUvsRLx_WRF21FCDL4X=F_n8XYEcQ@mail.gmail.com>
 <CAGJ7U-W+w9EdZv6=KjcXs6gEukoUhAkD1CbUgT=pdKpdMQnw9g@mail.gmail.com>
 <CAGa7JC16JUD7OB4sB=QZi99uWMQ95SyyEyTCMNqOhjVg8fNx=Q@mail.gmail.com>
Message-ID: <CAGJ7U-XBPbzvn6N9jRub=LPO0_vrK7y=61bbo3NXX4SW_7JF7w@mail.gmail.com>

Geez Philippe,

It was tounge in cheek.

A.

On Saturday, 30 May 2015, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> 2015-05-28 23:36 GMT+02:00 Andrew Cunningham <lang.support at gmail.com>:
>>
>> Not the first time unicode crashes things. There was the google chrome
bug on osx that crashed the tab for any syriac text.
>
> "Unicode crashes things"? Unicode has nothing to do in those crashes
caused by bugs in applications that make incorrect assumptions (in fact not
even related to characters themselves but to the supposed behavior of the
layout engine. Programmers and designers for example VERY frequently forget
the constraints for RTL languages and make incorrect assumptions about left
and right sides when sizing objects, or they don't expect that the cursor
will advance backward and forget that some measurements can be negative: if
they use this negative value to compute the size of a bitmap redering
surface, they'll get out of memory, unchecked null pointers returned, then
they will crash assuming the buffer was effectively allocated.
> These are the same kind of bugs as with the too common buffer overruns
with unchecked assumtions: the code is kept because "it works as is" in
their limited immediate tests.
> Producing full coverage tests is a difficult and lengthy task, that
programmers not always have the time to do, when they are urged to produce
a workable solution for some clients and then given no time to improve the
code before the same code is distributed to a wider range of clients.
> Commercial staffs do that frequently, they can't even read the technical
limitations even when they are documented by programmers... in addition the
commercial staff like selling softwares that will cause customers to ask
for support... that will be billed ! After that, programmers are
overwhelmed by bug reports and support requests, and have even less time to
design other thigs that they are working on and still have to produce. QA
tools may help programmers in this case by providing statistics about the
effective costs of producing new software with better quality, and the cost
of supporting it when it contains too many bugs: commercial teams like
those statistics because they can convert them to costs, commercial
margins, and billing rates. (When such QA tools are not used, programmers
will rapidly leave the place, they are fed up by the growing pressure to do
always more in the same time, with also a growing number of "urgent"
support requests.).
> Those that say "Unicode crashes things" do the same thing: they make
broad unchecked assumptions about how things are really made or how things
are actually working.
>

-- 
Andrew Cunningham
Project Manager, Research and Development
(Social and Digital Inclusion)
Public Libraries and Community Engagement
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Ph: +61-3-8664-7430
Mobile: 0459 806 589
Email: acunningham at slv.vic.gov.au
          lang.support at gmail.com

http://www.openroad.net.au/
http://www.mylanguage.gov.au/
http://www.slv.vic.gov.au/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/ea525aef/attachment.html>

From c933103 at gmail.com  Fri May 29 19:20:42 2015
From: c933103 at gmail.com (gfb hjjhjh)
Date: Sat, 30 May 2015 08:20:42 +0800
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPPJ9V5JQ3y1yF8OZOtoh4P0fYw8Uj_U2zx7W7JTRW3cFcg@mail.gmail.com>
 <CAGHjPPJ-x_BZ_6NDhPVdkRZAJaPThQzWZ3r8qPnhbL4AWh-Miw@mail.gmail.com>
 <CAGHjPP+LQnYfZcOWXx8PsDU224g6iGaCfQ7yRKW4XxKxD04VaQ@mail.gmail.com>
 <CAGHjPP+3T7ShHOiK=rqh=iXXryJZXa_rFjxL0K_sD+6_KaVDbw@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
Message-ID: <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>

Hello, I am new to this maillist and have some questions about unicode that
i am looking for answers or guide to answer. Can anyone provide me some
information regrading any of those questions below or point out where
should I find out answers to these questions instead?
1. I have seen a chinese character ??? from a Vietnamese dictionary NHAT
DUNG THUONG DAM DICTIONARY which digitalized on
http://www.nomfoundation.org/common/show.php?detail=2117 and I've also
checked the unihan database which do not include this character. Then I
read http://unicode.org/pending/proposals.html which listed the requirement
and processes needed to propose a new character to Unicode and point to
mail lists for help. So, a.) In http://www.unicode.org/alloc/Pipeline.html
, it show that CJK Extension E and F have already been accepted, but where
can I check those proposals to see if the xharacter is in them or not? and
b.) it say to propose a new character, the proposal must include
information about someone who would agree to provide a computer font for
publishing the standard, do that mean i have to provide info about someone
who is anticipated to agree on doing so or do i need to contact them for
their agreement first, and does that mean I can just put info of someone
who are making free full unicode CJK coverage font into the proposal?, and
c.) just like the question (b), do "names and addresses of appropriate
contacts within national body or user organizations" represent Vietnamese
government in this case?
2. Is combined characters like U+20DD intended to work with all different
type of characters, or is it some problem related to implementation ? as I
when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle)
appear to be separate on most font I use, but if I change the Hiragana Yu
into a conventional = sign or some latin character, most fonts are at least
somehow able to put them together. Or, is there any better/alternative
representation in unicode that can show japanese hiragana yu in a circle?
3.From what I read, Unicode record different regional glyphs for a single
character. Is there a character in Vietnamese chu nom that the character is
also persent in other languages (Chinese, Japanese, Vietnamese), but it
have some special feature in the glyph that make it different from all
other variants, so that if the fomputer system displaying that character
out, i can immediately tell it is displaying the character as Vietnamese
chu nom, or displaying as character of other languages? Furthermore, is
using simply 'vi' in CSS's lang parameter sufficient to force browsers to
show Chu Nom glyph instead of other glyphs, or is something like vi-Nom or
vi-Hani or Han-Nom is needed? (This part is less directly related to
unicode so i don't know if it is a suitable place to ask, plese tell me if
it is not the case.)
4.In CJK Symbols and Punctuation, Proper name mark and Book name mark are
not included. While there are charactera like U+2584, U+FE33, U+FE4F, and
U+FE34 in unicode that is more or less a representation for the two symbol,
they do not appear below or on the left of typed characters when text flow
is horizontal/vertical, and instead, they occupy their own space which make
them having little use in daily life, and while the proper name mark and
book name mark can represented by text editing softwares and css but those
representation are not ideal and they do match "Criteria for Encoding
Symbols". Is it possible to make a new unicode symbol, or change some
current symbol into one that could appear in suitable place of other
characters when typed? And a property of the symbol is that when used in
case like ???? which ?? and ?? are two different proper name (place name),
so an underline should go below them without any separation between the
character ?and? or ?and? (when text are written horizontally), but at the
same time the underline should not be linked between ? and ? as ? is the
end of first place name while ? is the start of the other.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/35541dc5/attachment.html>

From kenwhistler at att.net  Fri May 29 20:50:28 2015
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 29 May 2015 18:50:28 -0700
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+3T7ShHOiK=rqh=iXXryJZXa_rFjxL0K_sD+6_KaVDbw@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
Message-ID: <55691764.4030802@att.net>


On 5/29/2015 5:20 PM, gfb hjjhjh wrote:
>
> 1. I have seen a chinese character ??? from a Vietnamese dictionary 
> NHAT DUNG THUONG DAM DICTIONARY**
>

>  So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that 
> CJK Extension E and F have already been accepted, but where can I 
> check those proposals to see if the xharacter is in them or not?
>

For Extension E, you can check the following code chart:

http://www.unicode.org/charts/PDF/Unicode-8.0/U80-2B820.pdf

See: U+2C89A..U+2C931 (pp. 54-56 of the pdf) for the relevant
radical (#149). But I don't see that character in the list of
Extension E characters.

Extension F is harder to track down, because it has not yet been
approved by the UTC, and comes in two pieces, with different
progression so far in the ISO committee. Perhaps somebody on this list
who has better access to the relevant documents can let you
know whether ??? can be found in those sets.

> and b.) it say to propose a new character, the proposal must include 
> information about someone who would agree to provide a computer font 
> for publishing the standard, do that mean i have to provide info about 
> someone who is anticipated to agree on doing so or do i need to 
> contact them for their agreement first, and does that mean I can just 
> put info of someone who are making free full unicode CJK coverage font 
> into the proposal?,
>

It would require (eventually) provision of a font with correct display
of just the character proposed -- but in the case of CJK additions, these
first go through a process of collection and review by the Ideographic
Rapporteur Group. The best thing to do is to work with a national
body concerned with CJK characters and ensure that they include
this character on their list of submissions for IRG review.

> and c.) just like the question (b), do "names and addresses of 
> appropriate contacts within national body or user organizations" 
> represent Vietnamese government in this case?
>

If the character has not been submitted to the IRG for review, it would
probably be best to work through the Vietnamese national standards
body. Again, people on this list may be able to provide you the
correct contact information for them.

> 2. Is combined characters like U+20DD intended to work with all 
> different type of characters, or is it some problem related to 
> implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + 
> Combining Enclosing Circle) appear to be separate on most font I use, 
> but if I change the Hiragana Yu into a conventional = sign or some 
> latin character, most fonts are at least somehow able to put them 
> together. Or, is there any better/alternative representation in 
> unicode that can show japanese hiragana yu in a circle?
>

Combining enclosing marks in principle could work with most characters,
but in practice most arbitrary combinations do not work very well,
because they would require very complicated font support.

> 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark 
> are not included. While there are charactera like U+2584, U+FE33, 
> U+FE4F, and U+FE34 in unicode that is more or less a representation 
> for the two symbol, they do not appear below or on the left of typed 
> characters when text flow is horizontal/vertical, and instead, they 
> occupy their own space which make them having little use in daily 
> life, and while the proper name mark and book name mark can 
> represented by text editing softwares and css but those representation 
> are not ideal and they do match "Criteria for Encoding Symbols". Is it 
> possible to make a new unicode symbol, or change some current symbol 
> into one that could appear in suitable place of other characters when 
> typed? And a property of the symbol is that when used in case like ? 
> ??? which ?? and ?? are two different proper name (place name), 
> so an underline should go below them without any separation between 
> the character ?and? or ?and? (when text are written horizontally), 
> but at the same time the underline should not be linked between ? and 
> ? as ? is the end of first place name while ? is the start of the 
> other.
>

What you are talking about is, indeed, best handled by text styling 
attributes,
rather than by individual character encoding. These are various CJK-specific
underlining styles (for horizontal text layout) or sidelining styles (for
vertical text layout). It is precisely because these require 
highlighting for
ranges of characters (without breaks) that this kind of text decoration is
handled best by style attributes (or markup), rather than by individual
combining symbols.

The characters U+FE33, U+FE34, U+FE4F (but not U+2584) are compatibility
characters only for mapping to old Chinese standards that had individual
characters encoded for these underlining or sidelining text highlights,
but which required specialized text layout programs to make any use
of them.

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150529/048dd3d3/attachment.html>

From mpsuzuki at hiroshima-u.ac.jp  Sat May 30 00:46:21 2015
From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya)
Date: Sat, 30 May 2015 14:46:21 +0900
Subject: ["Unicode"] Re: Some questions about Unicode's CJK Unified
 Ideograph
In-Reply-To: <55691764.4030802@att.net>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
 <55691764.4030802@att.net>
Message-ID: <55694EAD.6030604@hiroshima-u.ac.jp>

Hi,

Please let me ask a slightly off-topic question,
? = ??? (not ???) is coded at U+46E9. Of course,
the unification between ? vs ? is not applied basically,
so the separated encoding of ??? would be reasonable
(if there is a requirement), but I want to know whether
Vietnamese user community distinguishes ??? and ???
semantically. Do you know anything?

Regards,
mpsuzuki

Ken Whistler wrote:
> 
> 
> On 5/29/2015 5:20 PM, gfb hjjhjh wrote:
>>
>> 1. I have seen a chinese character ??? from a Vietnamese dictionary 
>> NHAT DUNG THUONG DAM DICTIONARY* *
>>
> 
>>  So, a.) In http://www.unicode.org/alloc/Pipeline.html , it show that 
>> CJK Extension E and F have already been accepted, but where can I 
>> check those proposals to see if the xharacter is in them or not?
>>
> 
> For Extension E, you can check the following code chart:
> 
> http://www.unicode.org/charts/PDF/Unicode-8.0/U80-2B820.pdf
> 
> See: U+2C89A..U+2C931 (pp. 54-56 of the pdf) for the relevant
> radical (#149). But I don't see that character in the list of
> Extension E characters.
> 
> Extension F is harder to track down, because it has not yet been
> approved by the UTC, and comes in two pieces, with different
> progression so far in the ISO committee. Perhaps somebody on this list
> who has better access to the relevant documents can let you
> know whether ??? can be found in those sets.
> 
>> and b.) it say to propose a new character, the proposal must include 
>> information about someone who would agree to provide a computer font 
>> for publishing the standard, do that mean i have to provide info about 
>> someone who is anticipated to agree on doing so or do i need to 
>> contact them for their agreement first, and does that mean I can just 
>> put info of someone who are making free full unicode CJK coverage font 
>> into the proposal?,
>>
> 
> It would require (eventually) provision of a font with correct display
> of just the character proposed -- but in the case of CJK additions, these
> first go through a process of collection and review by the Ideographic
> Rapporteur Group. The best thing to do is to work with a national
> body concerned with CJK characters and ensure that they include
> this character on their list of submissions for IRG review.
> 
>> and c.) just like the question (b), do "names and addresses of 
>> appropriate contacts within national body or user organizations" 
>> represent Vietnamese government in this case?
>>
> 
> If the character has not been submitted to the IRG for review, it would
> probably be best to work through the Vietnamese national standards
> body. Again, people on this list may be able to provide you the
> correct contact information for them.
> 
>> 2. Is combined characters like U+20DD intended to work with all 
>> different type of characters, or is it some problem related to 
>> implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + 
>> Combining Enclosing Circle) appear to be separate on most font I use, 
>> but if I change the Hiragana Yu into a conventional = sign or some 
>> latin character, most fonts are at least somehow able to put them 
>> together. Or, is there any better/alternative representation in 
>> unicode that can show japanese hiragana yu in a circle?
>>
> 
> Combining enclosing marks in principle could work with most characters,
> but in practice most arbitrary combinations do not work very well,
> because they would require very complicated font support.
> 
>> 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark 
>> are not included. While there are charactera like U+2584, U+FE33, 
>> U+FE4F, and U+FE34 in unicode that is more or less a representation 
>> for the two symbol, they do not appear below or on the left of typed 
>> characters when text flow is horizontal/vertical, and instead, they 
>> occupy their own space which make them having little use in daily 
>> life, and while the proper name mark and book name mark can 
>> represented by text editing softwares and css but those representation 
>> are not ideal and they do match "Criteria for Encoding Symbols". Is it 
>> possible to make a new unicode symbol, or change some current symbol 
>> into one that could appear in suitable place of other characters when 
>> typed? And a property of the symbol is that when used in case like ? 
>> ??? which ?? and ?? are two different proper name (place name), 
>> so an underline should go below them without any separation between 
>> the character ?and? or ?and? (when text are written horizontally), 
>> but at the same time the underline should not be linked between ? and 
>> ? as ? is the end of first place name while ? is the start of the 
>> other.
>>
> 
> What you are talking about is, indeed, best handled by text styling 
> attributes,
> rather than by individual character encoding. These are various CJK-specific
> underlining styles (for horizontal text layout) or sidelining styles (for
> vertical text layout). It is precisely because these require 
> highlighting for
> ranges of characters (without breaks) that this kind of text decoration is
> handled best by style attributes (or markup), rather than by individual
> combining symbols.
> 
> The characters U+FE33, U+FE34, U+FE4F (but not U+2584) are compatibility
> characters only for mapping to old Chinese standards that had individual
> characters encoded for these underlining or sidelining text highlights,
> but which required specialized text layout programs to make any use
> of them.
> 
> --Ken
> 


From wjgo_10009 at btinternet.com  Sat May 30 03:47:05 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sat, 30 May 2015 09:47:05 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
Message-ID: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost>

Responding to Doug Ewell:

> I think this cuts to the heart of what people have been trying to say all along.

> Historically, Unicode was not meant to be the means by which brand new ideas are run up the proverbial flagpole to see if they will gain traction.

History is interesting and can be a good guide, yet many things that are an accepted part of Unicode today started as new ideas that gained traction and became implemented. So history should not be allowed to be a reason to restrict progress.

For example, there was the extension from 1 plane to 17 planes.

There was the introduction of emoji support.

There was the introduction of the policy of colour sometimes being a recorded property rather than having just the original monochrome recording policy.

There has been the change of encoding policy that facilitated the introduction of the Indian Rupee character into Unicode and ISO/IEC 10646 far more quickly than had been thought possible, so that the encoding was ready for use when needed.

There has been the recent encoding policy change regarding encoding of pure electronic use items taking place without (extensive prior use using a Private Use Area encoding), such as the encoding of the UNICORN FACE.

There is the recent change to the deprecation status of most of the tag characters and the acceptance of the base character followed by tag characters technique so as to allow the specifying of a larger collection of particular flags.

----

The two questions that I asked in my response to a post by Mark E. Shoulson are relevant here.

Suppose that a plain text file is to include just one non-standard emoji graphic. How would that be done otherwise than by the format that I am suggesting?

What if there were three such non-standard emoji graphics needed in the plain text file, the second graphic being used twice. How would that be done otherwise than by the format that I am suggesting?

William Overington

30 May 2015


From andrewcwest at gmail.com  Sat May 30 04:19:03 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Sat, 30 May 2015 10:19:03 +0100
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <55691764.4030802@att.net>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+3T7ShHOiK=rqh=iXXryJZXa_rFjxL0K_sD+6_KaVDbw@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
 <55691764.4030802@att.net>
Message-ID: <CALgEMhwUZpaepbeN_EQVtf9412JBAUbwkq=NjWCAbuDEeevHGg@mail.gmail.com>

On 30 May 2015 at 02:50, Ken Whistler <kenwhistler at att.net> wrote:
>
> 1. I have seen a chinese character ??? from a Vietnamese dictionary NHAT
> DUNG THUONG DAM DICTIONARY
>
> Extension F is harder to track down, because it has not yet been
> approved by the UTC, and comes in two pieces, with different
> progression so far in the ISO committee. Perhaps somebody on this list
> who has better access to the relevant documents can let you
> know whether ??? can be found in those sets.

It's not in my lists of F1 and F2 characters.

> 2. Is combined characters like U+20DD intended to work with all different
> type of characters, or is it some problem related to implementation ? as I
> when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle)
> appear to be separate on most font I use, but if I change the Hiragana Yu
> into a conventional = sign or some latin character, most fonts are at
least
> somehow able to put them together. Or, is there any better/alternative
> representation in unicode that can show japanese hiragana yu in a circle?
>
> Combining enclosing marks in principle could work with most characters,
> but in practice most arbitrary combinations do not work very well,
> because they would require very complicated font support.

It's not that complicated, but I think most fonts don't support arbitrary
combinations with combining enclosing circle because there is little or no
demand for them.  BabelStone Han displays Japanese Hiragana Letter Yu +
Combining Enclosing Circle quite well, but on the other hand it does not
work so well with CJK ideographs, and fails with Latin letters and
punctuation.


?

> 4.In CJK Symbols and Punctuation, Proper name mark and Book name mark are
> not included. While there are charactera like U+2584, U+FE33, U+FE4F, and
> U+FE34 in unicode that is more or less a representation for the two
symbol,
> they do not appear below or on the left of typed characters when text flow
> is horizontal/vertical, and instead, they occupy their own space which
make
> them having little use in daily life, and while the proper name mark and
> book name mark can represented by text editing softwares and css but those
> representation are not ideal and they do match "Criteria for Encoding
> Symbols". Is it possible to make a new unicode symbol, or change some
> current symbol into one that could appear in suitable place of other
> characters when typed? And a property of the symbol is that when used in
> case like ???? which ?? and ?? are two different proper name (place name),
> so an underline should go below them without any separation between the
> character ?and? or ?and? (when text are written horizontally), but at the
> same time the underline should not be linked between ? and ? as ? is the
end
> of first place name while ? is the start of the other.
>
>
> What you are talking about is, indeed, best handled by text styling
> attributes,rather than by individual character encoding.

I agree.  However, if you really do want to represent underlining of proper
names at the character encoding level, then you would have to do something
like put U+0332 Combining Low Line after each character to be underlined,
and select a font that supports Combining Low Line with CJK ideographs.
BabelStone Han supports this low-level method of underlining CJK
ideographs, but if you want a space in the underlining between ?? and ??
you would have to insert a very thin space (U+200A Hair Space in this
example) between the characters.


?

Andrew
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/08038428/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MeiguoNiuyue.png
Type: image/png
Size: 27233 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/08038428/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: circled yu.png
Type: image/png
Size: 26781 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/08038428/attachment-0001.png>

From wjgo_10009 at btinternet.com  Sat May 30 04:22:34 2015
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Sat, 30 May 2015 10:22:34 +0100 (BST)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <CAFmvRseJx=s6K3jM8SjQro1wLG86B4Te-jYZdOwfB5BT7hdh3A@mail.gmail.com>
References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost>
 <55665633.8040503@kli.org>
 <22772865.27600.1432813809365.JavaMail.defaultUser@defaultHost>
 <5567A7D6.6060102@kli.org>
 <211801.9901.1432888699177.JavaMail.defaultUser@defaultHost>
 <CAFmvRseJx=s6K3jM8SjQro1wLG86B4Te-jYZdOwfB5BT7hdh3A@mail.gmail.com>
Message-ID: <22703395.8755.1432977754153.JavaMail.defaultUser@defaultHost>

Responding to Leo Broukhis:

> A more common occurrence is the need to include a non-standard character in a text message, be it a ski piste symbol or an obscure CJK ideogram. Have you thought of  embedding TrueType in Unicode? 

Not congruently so, yet, in effect, yes, as I have considered including individual OpenType-compatible glyphs in a base character followed by tag characters format. OpenType is a development from TrueType that can achieve more than can TrueType on its own.

There is a little about this in the last two paragraphs of the following post.

http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html

There would need to be a few additions to make if work effectively: for example, a value for each of advance width, ascent maximum, descent maximum and fontunits per em.

William Overington

30 May 2015


From idou747 at gmail.com  Sat May 30 09:14:05 2015
From: idou747 at gmail.com (John)
Date: Sat, 30 May 2015 07:14:05 -0700 (PDT)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <CAGa7JC23x2DLQbk9qqoHiZwgcr+JucT26cRLkF4y+fj7Kx_6XA@mail.gmail.com>
References: <CAGa7JC23x2DLQbk9qqoHiZwgcr+JucT26cRLkF4y+fj7Kx_6XA@mail.gmail.com>
Message-ID: <1432995244747.7fa720f5@Nodemailer>

Hmm, these "once entities" of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language.


But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML.


But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document?


I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group.


And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters <IMG would that be HTML5 image definition that should be rendered as such? Or would it be text that happens to contain greater than symbol, I, M and G? It would have to be the former I guess, and thereby there would no longer be a unicode symbol for the mathematical greater than symbol. Rather there would be a unicode symbol for opening a HTML tag, and the text code for greater than would be &gt; Never again would a computer store > to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that.


And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a "character".


I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to ?work on a standard to do that, not just point to html5.


On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy <verdy_p at wanadoo.fr>, wrote:


2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com>:

"Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request?

If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way?


HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content.


You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs).


If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both?(and SVG as well when it contains plain-text elements).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/72b86c08/attachment.html>

From verdy_p at wanadoo.fr  Sat May 30 13:50:21 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 30 May 2015 20:50:21 +0200
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost>
References: <1573163.7044.1432975625923.JavaMail.defaultUser@defaultHost>
Message-ID: <CAGa7JC2Ruq3i0D7pvwONkDWbyw6i98ue1ZdPixUCx0mgfbCPTw@mail.gmail.com>

2015-05-30 10:47 GMT+02:00 William_J_G Overington <wjgo_10009 at btinternet.com
>:

> Responding to Doug Ewell:
>
> > I think this cuts to the heart of what people have been trying to say
> all along.
>
> > Historically, Unicode was not meant to be the means by which brand new
> ideas are run up the proverbial flagpole to see if they will gain traction.
>
> History is interesting and can be a good guide, yet many things that are
> an accepted part of Unicode today started as new ideas that gained traction
> and became implemented. So history should not be allowed to be a reason to
> restrict progress.
>
> For example, there was the extension from 1 plane to 17 planes.
>

Actually this was a restriction of the UCS to *only* 17 planes. Before that
the UCS contained 31-bit code points, i.e. 32768 planes !

If you're speaking about the old Unicode 1.0 it was then still not the UCS
and it was then incompatible with the UCS for many important parts, and the
initial targets of Unicode was only to have an "industry standard"
immediately usable between a few software providers (Unicode 1.0 was then
not an international standard, forget it !).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/43d220a8/attachment.html>

From verdy_p at wanadoo.fr  Sat May 30 16:56:26 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 30 May 2015 23:56:26 +0200
Subject: "Bunny hill" symbol, used in America for signaling ski pistes
 for novices
In-Reply-To: <trinity-7d75994e-408e-4098-990e-ddf61df6d2be-1432884750708@3capp-webde-bs10>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
 <trinity-7d75994e-408e-4098-990e-ddf61df6d2be-1432884750708@3capp-webde-bs10>
Message-ID: <CAGa7JC2Niw-pVKEbpZ2=MUb0GTMHXDVpJFAa=ce3J50isVM3kQ@mail.gmail.com>

But observations show that the vertical stacking is not universal.
Horizontal stacking is also used in direction signs. My opinion is that
they are just two separate "diamonds" and not a single symbol.

Quite equivalent to the situation with the classification of hotels with
stars (generally aligned horizontally but not always, we can see them also
arranged vertically, or on two rows 1+1, 1+2 or 2+1 or 2+3 or 3+2...)

I don't think the exact layout of individual symbols (diamond, star, ...)
is semantically significant, only their number is important  (and the fact
they are grouped together on the same medium with the same
foreground/background colors or tecturing and the same sizes).

2015-05-29 9:32 GMT+02:00 "J?rg Knappen" <jknappen at web.de>:

> From the description of the symbol it looks like a geometric shape. I
> think it is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS
> VERTICALLY STACKED or something like this) with a note * bunny hill. It may
> have (r find in future) other uses.
>
> --J?rg Knappen
>
> *Gesendet:* Donnerstag, 28. Mai 2015 um 23:20 Uhr
> *Von:* "Shervin Afshar" <shervinafshar at gmail.com>
> *An:* "Shawn Steele" <Shawn.Steele at microsoft.com>
> *Cc:* "verdy_p at wanadoo.fr" <verdy_p at wanadoo.fr>, "unicode Unicode
> Discussion" <unicode at unicode.org>, "Jim Melton" <jim.melton at oracle.com>
> *Betreff:* Re: "Bunny hill" symbol, used in America for signaling ski
> pistes for novices
>  Since the double-diamond has map and map legend usage, it might be a
> good idea to have it encoded separately. I know that I'm stating the
> obvious here, but the important point is doing the research and showing
> that it has widespread usage.
>
>  ? Shervin
>
> On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <Shawn.Steele at microsoft.com>
> wrote:
>>
>>  I?m used to them being next to each other.  So the entire discussion
>> seems to be about how to encode a concept vs how to get the shape you want
>> with existing code points.   If you just want the perfect shape, then maybe
>> an svg is a better choice.  If we?re talking about describing ski-run
>> difficulty levels in plain-text, then the hodge-podge of glyphs being
>> offered in this thread seems kinda hacky to me.
>>
>>
>>
>> -Shawn
>>
>>
>>
>> *From:* verdyp at gmail.com [mailto:verdyp at gmail.com] *On Behalf Of *Philippe
>> Verdy
>> *Sent:* Thursday, May 28, 2015 2:12 PM
>> *To:* Jim Melton
>> *Cc:* Shawn Steele; unicode Unicode Discussion
>> *Subject:* Re: "Bunny hill" symbol, used in America for signaling ski
>> pistes for novices
>>
>>
>>
>> Some documentations also suggest that the two diamonds are not stacked
>> one above the other, but horizontally. It's a good point for using only one
>> symbol, encoding it twice in plain-text if needed.
>>
>>
>>
>> 2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com>:
>>
>>  I no longer ski, but I did so for many years, mostly (but not
>> exclusively) in the western United States.  I never encountered, at any USA
>> ski hill/mountain/resort, a special symbol for "bunny hills", which are
>> typically represented by the green circle meaning "beginner".  That's
>> anecdotal evidence at best, but my observations cover numerous skiing
>> sites.  I have encountered such a symbol in Europe and in New Zealand, but
>> not in the USA.  (I have not had the pleasure of skiing in Canada and am
>> thus unable to speak about ski areas in that country.)
>>
>> The double black diamond would appear to be a unique symbol worthy of
>> encoding, simply because the only valid typographical representation (in
>> the USA) is two single black diamonds stacked one above the other and
>> touching at the points.
>>
>> Hope this helps,
>>    Jim
>>
>>
>> On 5/28/2015 2:04 PM, Shawn Steele wrote:
>>
>>  So is double black diamond a separate symbol?  Or just two of the black
>> diamond?
>>
>>
>>
>> And Blue-Black?
>>
>>
>>
>> I?m drawing a blank on a specific bunny sign, in my experience those are
>> usually just green.
>>
>>
>>
>> Aren?t there a lot of cartography symbols for various systems that aren?t
>> present in Unicode?
>>
>>
>>
>> *From:* Unicode [mailto:unicode-bounces at unicode.org
>> <http://unicode-bounces at unicode.org>] *On Behalf Of *Philippe Verdy
>> *Sent:* Thursday, May 28, 2015 12:47 PM
>> *To:* unicode Unicode Discussion
>> *Subject:* "Bunny hill" symbol, used in America for signaling ski pistes
>> for novices
>>
>>
>>
>> Is there a symbol that can represent the "Bunny hill" symbol used in
>> North America and some other American territories with mountains, to
>> designate the ski pistes open to novice skiers (those pistes are signaled
>> with green signs in Europe).
>>
>>
>>
>> I'm looking for the symbol itself, not the color, or the form of the sign.
>>
>>
>>
>> For example blue pistes in Europe are designed with a green circle in
>> America, but we have a symbol for the circle; red pistes in Europe are
>> signaled by a blue square in America, but we have a symbol for the square;
>> black pistes in Europe are signaled by a black diamond in America, but we
>> also have such "black" diamond in Unicode.
>>
>>
>>
>> But I can't find an equivalent to the American "Bunny hill" signal,
>> equivalent to green pistes in Europe (this is a problem for webpages
>> related to skiing: do we have to embed an image ?).
>>
>>
>>
>>
>>
>> --
>>
>> ========================================================================
>>
>> Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
>>
>>   Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345
>>
>> Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
>>
>> 1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org
>>
>> Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com
>>
>> ========================================================================
>>
>> =  Facts are facts.   But any opinions expressed are the opinions      =
>>
>> =  only of myself and may or may not reflect the opinions of anybody   =
>>
>> =  else with whom I may or may not have discussed the issues at hand.  =
>>
>> ========================================================================
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/ef1db99d/attachment.html>

From doug at ewellic.org  Sat May 30 18:21:44 2015
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 30 May 2015 16:21:44 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
Message-ID: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net>

Note: Everything below is my personal opinion and does not represent any
official Unicode Consortium or UTC position.

William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
wrote:

>> Historically, Unicode was not meant to be the means by which brand
>> new ideas are run up the proverbial flagpole to see if they will gain
>> traction.
>
> History is interesting and can be a good guide, yet many things that
> are an accepted part of Unicode today started as new ideas that gained
> traction and became implemented. So history should not be allowed to
> be a reason to restrict progress.

I used "historically" to distinguish between the pre- and post-Emoji
Revolution eras. There have clearly been changes recently, but there is
still at least a minimal expectation that proposed characters will
fulfill a demonstrated need.

I'm not seeing any truly novel, untested ideas in the list below that
Unicode implemented purely on speculation.

> For example, there was the extension from 1 plane to 17 planes.

That was an architectural extension, brought about by the realization
that 64K code points wasn't enough for even the original scope. There's
no comparison.

> There was the introduction of emoji support.

Emoji proponents would argue that "emoji support" began in 1.0 with the
inclusion of various dingbats. But even emoji are arguably "characters"
in some sense. They aren't a mini-language used to define images pixel
by pixel.

> There was the introduction of the policy of colour sometimes being a
> recorded property rather than having just the original monochrome
> recording policy.

There isn't any such policy. There is a variation selector to suggest
that the rendering engine show certain characters in "emoji style"
instead of "text style," and there are characters with colors in their
names, but there is no policy that specific colors are "recorded" as
part of the encoding. YELLOW HEART could conformantly appear in any
color.

> There has been the change of encoding policy that facilitated the
> introduction of the Indian Rupee character into Unicode and ISO/IEC
> 10646 far more quickly than had been thought possible, so that the
> encoding was ready for use when needed.

That's not a change to what types of things get encoded. It's a
procedural change, one which I would agree has been applied with
increasing creativity.

> There has been the recent encoding policy change regarding encoding of
> pure electronic use items taking place without (extensive prior use
> using a Private Use Area encoding), such as the encoding of the
> UNICORN FACE.

This is probably your best analogy. People like Asmus have addressed it,
saying it's not reasonable to expect users to adopt PUA solutions and
wait for them to catch on.

> There is the recent change to the deprecation status of most of the
> tag characters and the acceptance of the base character followed by
> tag characters technique so as to allow the specifying of a larger
> collection of particular flags.

There must have been a great wailing and gnashing of teeth over that
decision. So many statements were made over the years about the basic
evilness of tag characters.

But the concept of representing flags was already agreed upon as a
"compatibility" measure, and the Regional Indicator Symbols solution was
a compromise that allowed expansion beyond the 10 flags that Japanese
telcos chose to include. RIS were an architectural decision. The tag
solution (to be fully outlined in a future PRI) was another
architectural decision. Neither (I believe) is analogous to a scope
decision to start encoding different types of non-character things as if
they were characters, and as I have said before, assigning a glyph to a
thing that isn't a character doesn't make it one.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From Shawn.Steele at microsoft.com  Sat May 30 18:34:38 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sat, 30 May 2015 23:34:38 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes
 for novices
In-Reply-To: <CAGa7JC2Niw-pVKEbpZ2=MUb0GTMHXDVpJFAa=ce3J50isVM3kQ@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <BLUPR03MB13789531A49CDE54049C80FF82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <55677762.3060805@oracle.com>
 <CAGa7JC3Z7d9to4azLjFXn9d8Mg=-3OHqkHkCwHA4b3oQiDQJtw@mail.gmail.com>
 <BLUPR03MB1378958E0D0C2236E452138C82CA0@BLUPR03MB1378.namprd03.prod.outlook.com>
 <CA+ONOD=F+T7mq1DpJN=RfPTmeBTAJXWQ4mbjF9H1Jf+t0s2D5Q@mail.gmail.com>
 <trinity-7d75994e-408e-4098-990e-ddf61df6d2be-1432884750708@3capp-webde-bs10>
 <CAGa7JC2Niw-pVKEbpZ2=MUb0GTMHXDVpJFAa=ce3J50isVM3kQ@mail.gmail.com>
Message-ID: <CY1PR03MB138960A7E255A757AE5A531882C80@CY1PR03MB1389.namprd03.prod.outlook.com>

I guess it depends on what you?re representing.  If it is the concept of ?double black?, then maybe a separate symbol and the ?font? or other selectors determine if it?s vertically or horizontally rendered.

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: Saturday, May 30, 2015 2:56 PM
To: J?rg Knappen
Cc: Shervin Afshar; unicode Unicode Discussion
Subject: Re: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

But observations show that the vertical stacking is not universal. Horizontal stacking is also used in direction signs. My opinion is that they are just two separate "diamonds" and not a single symbol.

Quite equivalent to the situation with the classification of hotels with stars (generally aligned horizontally but not always, we can see them also arranged vertically, or on two rows 1+1, 1+2 or 2+1 or 2+3 or 3+2...)

I don't think the exact layout of individual symbols (diamond, star, ...) is semantically significant, only their number is important  (and the fact they are grouped together on the same medium with the same foreground/background colors or tecturing and the same sizes).

2015-05-29 9:32 GMT+02:00 "J?rg Knappen" <jknappen at web.de<mailto:jknappen at web.de>>:
From the description of the symbol it looks like a geometric shape. I think it is worth to be encoded as a geometric shape (TWO BLACK DIAMONDS VERTICALLY STACKED or something like this) with a note * bunny hill. It may have (r find in future) other uses.

--J?rg Knappen

Gesendet: Donnerstag, 28. Mai 2015 um 23:20 Uhr
Von: "Shervin Afshar" <shervinafshar at gmail.com<mailto:shervinafshar at gmail.com>>
An: "Shawn Steele" <Shawn.Steele at microsoft.com<mailto:Shawn.Steele at microsoft.com>>
Cc: "verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>" <verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>>, "unicode Unicode Discussion" <unicode at unicode.org<mailto:unicode at unicode.org>>, "Jim Melton" <jim.melton at oracle.com<mailto:jim.melton at oracle.com>>
Betreff: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices
Since the double-diamond has map and map legend usage, it might be a good idea to have it encoded separately. I know that I'm stating the obvious here, but the important point is doing the research and showing that it has widespread usage.

? Shervin

On Thu, May 28, 2015 at 2:15 PM, Shawn Steele <Shawn.Steele at microsoft.com<http://Shawn.Steele at microsoft.com>> wrote:
I?m used to them being next to each other.  So the entire discussion seems to be about how to encode a concept vs how to get the shape you want with existing code points.   If you just want the perfect shape, then maybe an svg is a better choice.  If we?re talking about describing ski-run difficulty levels in plain-text, then the hodge-podge of glyphs being offered in this thread seems kinda hacky to me.

-Shawn

From: verdyp at gmail.com<http://verdyp at gmail.com> [mailto:verdyp at gmail.com<http://verdyp at gmail.com>] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 2:12 PM
To: Jim Melton
Cc: Shawn Steele; unicode Unicode Discussion
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Some documentations also suggest that the two diamonds are not stacked one above the other, but horizontally. It's a good point for using only one symbol, encoding it twice in plain-text if needed.

2015-05-28 22:15 GMT+02:00 Jim Melton <jim.melton at oracle.com<http://jim.melton at oracle.com>>:
I no longer ski, but I did so for many years, mostly (but not exclusively) in the western United States.  I never encountered, at any USA ski hill/mountain/resort, a special symbol for "bunny hills", which are typically represented by the green circle meaning "beginner".  That's anecdotal evidence at best, but my observations cover numerous skiing sites.  I have encountered such a symbol in Europe and in New Zealand, but not in the USA.  (I have not had the pleasure of skiing in Canada and am thus unable to speak about ski areas in that country.)

The double black diamond would appear to be a unique symbol worthy of encoding, simply because the only valid typographical representation (in the USA) is two single black diamonds stacked one above the other and touching at the points.

Hope this helps,
   Jim

On 5/28/2015 2:04 PM, Shawn Steele wrote:
So is double black diamond a separate symbol?  Or just two of the black diamond?

And Blue-Black?

I?m drawing a blank on a specific bunny sign, in my experience those are usually just green.

Aren?t there a lot of cartography symbols for various systems that aren?t present in Unicode?

From: Unicode [mailto:unicode-bounces at unicode.org<http://unicode-bounces at unicode.org>] On Behalf Of Philippe Verdy
Sent: Thursday, May 28, 2015 12:47 PM
To: unicode Unicode Discussion
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for novices

Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).


--

========================================================================

Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144

  Chair, ISO/IEC JTC1/SC32 and W3C XML Query WG    Fax : +1.801.942.3345

Oracle Corporation        Oracle Email: jim dot melton at oracle dot com

1930 Viscounti Drive      Alternate email: jim dot melton at acm dot org

Sandy, UT 84093-1063 USA  Personal email: SheltieJim at xmission dot com

========================================================================

=  Facts are facts.   But any opinions expressed are the opinions      =

=  only of myself and may or may not reflect the opinions of anybody   =

=  else with whom I may or may not have discussed the issues at hand.  =

========================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150530/c49ee180/attachment.html>

From prosfilaes at gmail.com  Sat May 30 19:34:50 2015
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 31 May 2015 00:34:50 +0000
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net>
References: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net>
Message-ID: <CAMZ=zj6XtrJHPByC=ZE3CJndO+vid4zPCXKRXrUdGV2--AA_Mw@mail.gmail.com>

I would say that a system would conform with Unicode in having yellow heart
red (in a non-monochrome font) as well as if it made it a cross. Either way
it's violating character identity. I'd say that being monochromatic is now
like being monospaced; it's suboptimal for a Unicode implementation, but
hardly something Unicode can condemn as nonconformant.

On 4:25pm, Sat, May 30, 2015 Doug Ewell <doug at ewellic.org> wrote:

> Note: Everything below is my personal opinion and does not represent any
> official Unicode Consortium or UTC position.
>
> William_J_G Overington <wjgo underscore 10009 at btinternet dot com>
> wrote:
>
> >> Historically, Unicode was not meant to be the means by which brand
> >> new ideas are run up the proverbial flagpole to see if they will gain
> >> traction.
> >
> > History is interesting and can be a good guide, yet many things that
> > are an accepted part of Unicode today started as new ideas that gained
> > traction and became implemented. So history should not be allowed to
> > be a reason to restrict progress.
>
> I used "historically" to distinguish between the pre- and post-Emoji
> Revolution eras. There have clearly been changes recently, but there is
> still at least a minimal expectation that proposed characters will
> fulfill a demonstrated need.
>
> I'm not seeing any truly novel, untested ideas in the list below that
> Unicode implemented purely on speculation.
>
> > For example, there was the extension from 1 plane to 17 planes.
>
> That was an architectural extension, brought about by the realization
> that 64K code points wasn't enough for even the original scope. There's
> no comparison.
>
> > There was the introduction of emoji support.
>
> Emoji proponents would argue that "emoji support" began in 1.0 with the
> inclusion of various dingbats. But even emoji are arguably "characters"
> in some sense. They aren't a mini-language used to define images pixel
> by pixel.
>
> > There was the introduction of the policy of colour sometimes being a
> > recorded property rather than having just the original monochrome
> > recording policy.
>
> There isn't any such policy. There is a variation selector to suggest
> that the rendering engine show certain characters in "emoji style"
> instead of "text style," and there are characters with colors in their
> names, but there is no policy that specific colors are "recorded" as
> part of the encoding. YELLOW HEART could conformantly appear in any
> color.
>
> > There has been the change of encoding policy that facilitated the
> > introduction of the Indian Rupee character into Unicode and ISO/IEC
> > 10646 far more quickly than had been thought possible, so that the
> > encoding was ready for use when needed.
>
> That's not a change to what types of things get encoded. It's a
> procedural change, one which I would agree has been applied with
> increasing creativity.
>
> > There has been the recent encoding policy change regarding encoding of
> > pure electronic use items taking place without (extensive prior use
> > using a Private Use Area encoding), such as the encoding of the
> > UNICORN FACE.
>
> This is probably your best analogy. People like Asmus have addressed it,
> saying it's not reasonable to expect users to adopt PUA solutions and
> wait for them to catch on.
>
> > There is the recent change to the deprecation status of most of the
> > tag characters and the acceptance of the base character followed by
> > tag characters technique so as to allow the specifying of a larger
> > collection of particular flags.
>
> There must have been a great wailing and gnashing of teeth over that
> decision. So many statements were made over the years about the basic
> evilness of tag characters.
>
> But the concept of representing flags was already agreed upon as a
> "compatibility" measure, and the Regional Indicator Symbols solution was
> a compromise that allowed expansion beyond the 10 flags that Japanese
> telcos chose to include. RIS were an architectural decision. The tag
> solution (to be fully outlined in a future PRI) was another
> architectural decision. Neither (I believe) is analogous to a scope
> decision to start encoding different types of non-character things as if
> they were characters, and as I have said before, assigning a glyph to a
> thing that isn't a character doesn't make it one.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/22e7002b/attachment.html>

From Shawn.Steele at microsoft.com  Sat May 30 22:02:11 2015
From: Shawn.Steele at microsoft.com (Shawn Steele)
Date: Sun, 31 May 2015 03:02:11 +0000
Subject: "Bunny hill" symbol, used in America for signaling ski pistes for
 novices
In-Reply-To: <CAJ6uix7N+KcW7fnMB=7qDf6j-JV5NCqKESXQG7z1ckwqFOU43w@mail.gmail.com>
References: <CAGa7JC04UV5-zbh0yNNSfp=W9ACuRA_DFf32in+FmS=G=8srCA@mail.gmail.com>
 <CAJ6uix7N+KcW7fnMB=7qDf6j-JV5NCqKESXQG7z1ckwqFOU43w@mail.gmail.com>
Message-ID: <BLUPR03MB1378364A8CA1AB5FA3E08C4982B70@BLUPR03MB1378.namprd03.prod.outlook.com>

I?m really curious to see one of these signs.  Is it a regional thing?

From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Leonardo Boiko
Sent: Thursday, May 28, 2015 1:02 PM
To: Philippe Verdy
Cc: unicode Unicode Discussion
Subject: Re: "Bunny hill" symbol, used in America for signaling ski pistes for novices

You could use U+1F407 RABBIT combined with U+20E4 COMBINING ENCLOSING UPWARD POINTING TRIANGLE, and pretend the triangle is a hill.  ?? ?
If only we had a combining rabbit, we could add rabbits to U+1F3D4 SNOW CAPPED MOUNTAIN.  Or anything else.

2015-05-28 16:46 GMT-03:00 Philippe Verdy <verdy_p at wanadoo.fr<mailto:verdy_p at wanadoo.fr>>:
Is there a symbol that can represent the "Bunny hill" symbol used in North America and some other American territories with mountains, to designate the ski pistes open to novice skiers (those pistes are signaled with green signs in Europe).

I'm looking for the symbol itself, not the color, or the form of the sign.

For example blue pistes in Europe are designed with a green circle in America, but we have a symbol for the circle; red pistes in Europe are signaled by a blue square in America, but we have a symbol for the square; black pistes in Europe are signaled by a black diamond in America, but we also have such "black" diamond in Unicode.

But I can't find an equivalent to the American "Bunny hill" signal, equivalent to green pistes in Europe (this is a problem for webpages related to skiing: do we have to embed an image ?).


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/03efed2f/attachment.html>

From c933103 at gmail.com  Sun May 31 03:43:12 2015
From: c933103 at gmail.com (gfb hjjhjh)
Date: Sun, 31 May 2015 16:43:12 +0800
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <55694EAD.6030604@hiroshima-u.ac.jp>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
 <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp>
Message-ID: <CAGHjPPKZSicRyRBd2ObTWSULLeoo_hQBBpsnsT8L=3LjTqZzNA@mail.gmail.com>

Thanks for answers.

As of ??? versus ???, as I don't have much knowledge about Vietnamese and
the character is from chu han instead of chu nom, I don't really know if
there are any semantic difference between the two, but at least the one
usage of ??? shown in the word on that dictionary page would be something
like "dumb, mute" which were not listed as part of the meaning of the
character ? in wiktionary.

And for the proper name mark and book name mark, while i see the point that
it wiuld be best achieve via word processor styling or markup language, so
is it a good idea to integrating things similar to markup language into
unicode, like create a character ps that indicate start of proper name mark
and pe for end of proper name mark, then typing psPROPERNAMEpe would result
in something similar to <u>PROPERNAME</u>?

And if using the work around suggested by Andrew, yes the hair space work
but it a distance between characters a gap with width equal to an 'i'. Have
also tried characters like u+200c or u+034f which does not work. and it
seem like babelstone han is not supporting U+1AB6? and is there any
vertical edition of the two characters...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/856262f2/attachment.html>

From asmus-inc at ix.netcom.com  Sun May 31 06:05:10 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sun, 31 May 2015 04:05:10 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <1432995244747.7fa720f5@Nodemailer>
References: <CAGa7JC23x2DLQbk9qqoHiZwgcr+JucT26cRLkF4y+fj7Kx_6XA@mail.gmail.com>
 <1432995244747.7fa720f5@Nodemailer>
Message-ID: <556AEAE6.2040203@ix.netcom.com>

John,

reading this discussion, I agree with your reaductio ad absurdum of 
infinitely nested HTML.

But I think you are onto something with your hypothetical example of the 
"subset that works in ALL textual situations".

There's clearly a use case for something like it, and I believe many 
people would intuitively agree on a set of features for it.

What people seem to have in mind is something like "inline" text. 
Something beyond a mere stream of plain text (with effectively every 
character rendered visibly), but still limited in important ways by 
general behavior of inline text: a string of it, laid out, must wrap and 
line break, any objects included in it must behave like characters 
(albeit of custom width, height and appearance), and so on. Paragraph 
formatting, stacked layout, header levels and all those good things 
would not be available.

With such a subset clearly defined, many quirky limitations might no 
longer be necessary; any container that today only takes plain text 
could be upgraded to take "inline text". I can see some inline 
containers retaining a nesting limitation, but I could imagine that it 
is possible to arrive at a consistent definition of such inline format.

Going further, I can't shake the impression that without a clean 
definition of an inline text format along those lines, any attempts at 
making stickers and similar solutions "stick" are doomed to failure.

The interesting thing in defining such a format is not how to represent 
it in HTML or CSS syntax, but in describing what feature sets it must 
(minimally) support. Doing it that way would free existing 
implementations of rich text to map native formats onto that minimally 
required subset and to add them to their format translators for HMTL or 
whatever else they use for interchange.

Only with a definition can you ever hope to develop a processing model. 
It won't be as simple as for plain text strings, but it should be able 
to support common abstractions (like iteration by logical unit). It 
would have to support the management of external resources - if the 
inline format allows images, custom fonts, etc. one would need a way to 
manage references to them in the local context.

If your skeptical position proves correct in that this is something that 
turns out to not be tractable, then I think you've provided conclusive 
proof why stickers won't happen and why encoding emoji was the only 
sensible decision Unicode could have taken.

A./

On 5/30/2015 7:14 AM, John wrote:
>
> Hmm, these "once entities" of which you speak, do they require 
> javascript? Because I'm not sure what we are looking for here is 
> static documents requiring a full programming language.
>
> But let's say for a moment that html5 can, or could do the job here. 
> Then to make the dream come true that you could just cut and paste 
> text that happened to contain a custom character to somewhere else, 
> and nothing untoward would happen, would mean that everything in the 
> computing universe should allow full blown html. So every Java Swing 
> component, every Apple gui component, every .NET component, every 
> windows component, every browser, every Android and IOS component 
> would allow text entry of HTML entities. OK, so let's say everyone 
> agrees with this course of action, now the universal text format is HTML.
>
> But in this new world where anywhere that previously you could input 
> text, you can now input full blown html, does that actually make 
> sense? Does it make sense that you can for example, put full blown 
> HTML inside a H1 tag in html itself? That's a lot of recursion going 
> on there. Or in a MS-Excel cell? Or interspersed in some otherwise 
> fairly regular text in a Word document?
>
> I suppose someone could define a strict limited subset of HTML to be 
> that subset that makes sense in ALL textual situations. That subset 
> would be something like just defining things that act like characters, 
> and not like a full blown rendering engine. But who would define that 
> subset? Not the HTML groups, because their mandate is to define full 
> blown rendering engines. It would be more likely to be something like 
> the unicode group.
>
> And also, in this brave new world where HTML5 is the new standard text 
> format, what would the binary format of it be? I mean, if I have the 
> string of unicode characters <IMG would that be HTML5 image definition 
> that should be rendered as such? Or would it be text that happens to 
> contain greater than symbol, I, M and G? It would have to be the 
> former I guess, and thereby there would no longer be a unicode symbol 
> for the mathematical greater than symbol. Rather there would be a 
> unicode symbol for opening a HTML tag, and the text code for greater 
> than would be &gt; Never again would a computer store > to mean 
> greater than. Do we want HTML to be so pervasive? Not sure it deserves 
> that.
>
> And from a programmers point of view, he wants to be able to iterate 
> over an array of characters and treat each one the same way, 
> regardless if it is a custom character or not. Without that kind of 
> programmatic abstraction, the whole thing can never gain traction. I 
> don't think fully blown HTML embedded in your text can fulfill that. A 
> very strictly defined subset, possibly could. Sure HTML5 can RENDER 
> stuff adquately, if the only aim of the game is provide a correct 
> rendering. But to be able to actually treat particular images embedded 
> as characters, and have some programming library see that abstraction 
> consistently, I'm not sure I'm convinced that is possible. Not without 
> nailing down exactly what html elements in what particular 
> circumstances constitute a "character".
>
> I guess in summary, yes we have the technology already to render 
> anything. But I don't think the whole standards framework does 
> anything to allow the computing universe to actually exchange custom 
> characters as if they were just any other text. Someone would actually 
> have to  work on a standard to do that, not just point to html5.
>
>
> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy 
> <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>
>
>     2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com
>     <mailto:idou747 at gmail.com>>:
>
>         "Today the world goes very well with HTML(5) which is now the
>         bext markup language for document (including for inserting
>         embedded images that don?t require any external request?
>         If I had a large document that reused a particular character
>         thousands of times, would this HTML markup require embedding
>         that character thousands of times, or could I define the
>         character once at the beginning of the sequence, and then
>         refer back to it in a space efficient way?
>
>
>     HTML(5) allows defining *once* entities for images that can then
>     be reused thousands of times without repeting their definition.
>     You can do this as well with CSS styles, just define a class for a
>     small element. This element may still be an "image", but the
>     semantic is carried by the class you assign to it. You are not
>     required to provide an external source URL for that image if the
>     CSS style provides the content.
>
>     You may also use PUAs for the same purpose (however I have not
>     seen how CSS allows to style individual characters in text
>     elements as these characters are not elements, and there's no
>     defined selector for pseudo-elements matching a single character).
>     PUAs are perfectly usable in the situation where you have embedded
>     a custom font in your document for assigning glyphs to characters
>     (you can still do that, but I would avoid TrueType/OpenType for
>     this purpose, but would use the SVG font format which is valid in
>     CSS, for defining a collection of glyphs).
>
>     If the document is not restricted to be standalone, of course you
>     can use links to an external shared CSS stylesheet and to this SVG
>     font referenced by the stylesheet. With such approach, you don't
>     even need to use classes on elements, you use plain-text with very
>     compact PUAs (it's up to you to decide if the document must be
>     standalone (embedding everything it needs) or must use external
>     references for missing definitions, HTML allows both (and SVG as
>     well when it contains plain-text elements).
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/59e23eda/attachment.html>

From andrewcwest at gmail.com  Sun May 31 06:42:41 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Sun, 31 May 2015 12:42:41 +0100
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <CAGHjPPKZSicRyRBd2ObTWSULLeoo_hQBBpsnsT8L=3LjTqZzNA@mail.gmail.com>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
 <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp>
 <CAGHjPPKZSicRyRBd2ObTWSULLeoo_hQBBpsnsT8L=3LjTqZzNA@mail.gmail.com>
Message-ID: <CALgEMhyJU9RDbBFGf4XydKh0UKk6GmwEZ8tg22XKa3SydgKQsQ@mail.gmail.com>

On 31 May 2015 at 09:43, gfb hjjhjh <c933103 at gmail.com> wrote:
>
> As of ??? versus ???, as I don't have much knowledge about Vietnamese and
> the character is from chu han instead of chu nom, I don't really know if
> there are any semantic difference between the two, but at least the one
> usage of ??? shown in the word on that dictionary page would be something
> like "dumb, mute" which were not listed as part of the meaning of the
> character ? in wiktionary.

The way CJK unification works, you don't need to show that there is a
semantic difference between the two forms, just that the form is used
in a reputable source.  Can you send me off-list a scan of the
character from the Vietnamese dictionary you mention?

> And for the proper name mark and book name mark, while i see the point that
> it wiuld be best achieve via word processor styling or markup language, so
> is it a good idea to integrating things similar to markup language into
> unicode, like create a character ps that indicate start of proper name mark
> and pe for end of proper name mark, then typing psPROPERNAMEpe would result
> in something similar to <u>PROPERNAME</u>?

I think you can achieve the appropriate styling for web pages using CSS:

http://www.w3.org/TR/2013/WD-css-text-decor-3-20130103/#text-decoration-style-property

> And if using the work around suggested by Andrew, yes the hair space work
> but it a distance between characters a gap with width equal to an 'i'. Have
> also tried characters like u+200c or u+034f which does not work.

Even with OpenType it is not easy to contextually create a gap between
two combining underlines as the characters are not adjacent (I don't
think it is impossible, but the only way I can think of doing it is
rather unpleasant; perhaps other font experts on this list know an
easy way of doing it).

> and it seem
> like babelstone han is not supporting U+1AB6?

U+1AB6 is supported in the next release of BabelStone Han (due for
release very soon, probably within the next week or two).

> and is there any vertical
> edition of the two characters...

The combining underline and wavy line characters will work OK with a
vertically oriented CJK font (they will display on the left).
Unfortunately BabelStone does not currently work very well in vertical
orientation.

Andrew


From idou747 at gmail.com  Sun May 31 07:33:44 2015
From: idou747 at gmail.com (John)
Date: Sun, 31 May 2015 05:33:44 -0700 (PDT)
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <556AEAE6.2040203@ix.netcom.com>
References: <556AEAE6.2040203@ix.netcom.com>
Message-ID: <1433075623556.38b645ad@Nodemailer>

Yes, Asmus good post. But I don?t really think HTML, even a subset, is really the right solution. I?m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn?t know about that schema could go to that URL and download the schema, and check that the XML ?conforms to that schema.


Similarly, imagine a text format that had a header with something like:

\uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345


Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn?t be reliant on today?s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn?t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn?t previously aware of them, and the format would be independent of today?s rendering technologies. Let?s face it, HTML5 changes every few years, and I don?t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don?t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don?t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee.


As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined.


?
Chris

On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t)
<asmus-inc at ix.netcom.com> wrote:

> John,
> reading this discussion, I agree with your reaductio ad absurdum of 
> infinitely nested HTML.
> But I think you are onto something with your hypothetical example of the 
> "subset that works in ALL textual situations".
> There's clearly a use case for something like it, and I believe many 
> people would intuitively agree on a set of features for it.
> What people seem to have in mind is something like "inline" text. 
> Something beyond a mere stream of plain text (with effectively every 
> character rendered visibly), but still limited in important ways by 
> general behavior of inline text: a string of it, laid out, must wrap and 
> line break, any objects included in it must behave like characters 
> (albeit of custom width, height and appearance), and so on. Paragraph 
> formatting, stacked layout, header levels and all those good things 
> would not be available.
> With such a subset clearly defined, many quirky limitations might no 
> longer be necessary; any container that today only takes plain text 
> could be upgraded to take "inline text". I can see some inline 
> containers retaining a nesting limitation, but I could imagine that it 
> is possible to arrive at a consistent definition of such inline format.
> Going further, I can't shake the impression that without a clean 
> definition of an inline text format along those lines, any attempts at 
> making stickers and similar solutions "stick" are doomed to failure.
> The interesting thing in defining such a format is not how to represent 
> it in HTML or CSS syntax, but in describing what feature sets it must 
> (minimally) support. Doing it that way would free existing 
> implementations of rich text to map native formats onto that minimally 
> required subset and to add them to their format translators for HMTL or 
> whatever else they use for interchange.
> Only with a definition can you ever hope to develop a processing model. 
> It won't be as simple as for plain text strings, but it should be able 
> to support common abstractions (like iteration by logical unit). It 
> would have to support the management of external resources - if the 
> inline format allows images, custom fonts, etc. one would need a way to 
> manage references to them in the local context.
> If your skeptical position proves correct in that this is something that 
> turns out to not be tractable, then I think you've provided conclusive 
> proof why stickers won't happen and why encoding emoji was the only 
> sensible decision Unicode could have taken.
> A./
> On 5/30/2015 7:14 AM, John wrote:
>>
>> Hmm, these "once entities" of which you speak, do they require 
>> javascript? Because I'm not sure what we are looking for here is 
>> static documents requiring a full programming language.
>>
>> But let's say for a moment that html5 can, or could do the job here. 
>> Then to make the dream come true that you could just cut and paste 
>> text that happened to contain a custom character to somewhere else, 
>> and nothing untoward would happen, would mean that everything in the 
>> computing universe should allow full blown html. So every Java Swing 
>> component, every Apple gui component, every .NET component, every 
>> windows component, every browser, every Android and IOS component 
>> would allow text entry of HTML entities. OK, so let's say everyone 
>> agrees with this course of action, now the universal text format is HTML.
>>
>> But in this new world where anywhere that previously you could input 
>> text, you can now input full blown html, does that actually make 
>> sense? Does it make sense that you can for example, put full blown 
>> HTML inside a H1 tag in html itself? That's a lot of recursion going 
>> on there. Or in a MS-Excel cell? Or interspersed in some otherwise 
>> fairly regular text in a Word document?
>>
>> I suppose someone could define a strict limited subset of HTML to be 
>> that subset that makes sense in ALL textual situations. That subset 
>> would be something like just defining things that act like characters, 
>> and not like a full blown rendering engine. But who would define that 
>> subset? Not the HTML groups, because their mandate is to define full 
>> blown rendering engines. It would be more likely to be something like 
>> the unicode group.
>>
>> And also, in this brave new world where HTML5 is the new standard text 
>> format, what would the binary format of it be? I mean, if I have the 
>> string of unicode characters <IMG would that be HTML5 image definition 
>> that should be rendered as such? Or would it be text that happens to 
>> contain greater than symbol, I, M and G? It would have to be the 
>> former I guess, and thereby there would no longer be a unicode symbol 
>> for the mathematical greater than symbol. Rather there would be a 
>> unicode symbol for opening a HTML tag, and the text code for greater 
>> than would be &gt; Never again would a computer store > to mean 
>> greater than. Do we want HTML to be so pervasive? Not sure it deserves 
>> that.
>>
>> And from a programmers point of view, he wants to be able to iterate 
>> over an array of characters and treat each one the same way, 
>> regardless if it is a custom character or not. Without that kind of 
>> programmatic abstraction, the whole thing can never gain traction. I 
>> don't think fully blown HTML embedded in your text can fulfill that. A 
>> very strictly defined subset, possibly could. Sure HTML5 can RENDER 
>> stuff adquately, if the only aim of the game is provide a correct 
>> rendering. But to be able to actually treat particular images embedded 
>> as characters, and have some programming library see that abstraction 
>> consistently, I'm not sure I'm convinced that is possible. Not without 
>> nailing down exactly what html elements in what particular 
>> circumstances constitute a "character".
>>
>> I guess in summary, yes we have the technology already to render 
>> anything. But I don't think the whole standards framework does 
>> anything to allow the computing universe to actually exchange custom 
>> characters as if they were just any other text. Someone would actually 
>> have to  work on a standard to do that, not just point to html5.
>>
>>
>> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy 
>> <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>>
>>
>>     2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com
>>     <mailto:idou747 at gmail.com>>:
>>
>>         "Today the world goes very well with HTML(5) which is now the
>>         bext markup language for document (including for inserting
>>         embedded images that don?t require any external request?
>>         If I had a large document that reused a particular character
>>         thousands of times, would this HTML markup require embedding
>>         that character thousands of times, or could I define the
>>         character once at the beginning of the sequence, and then
>>         refer back to it in a space efficient way?
>>
>>
>>     HTML(5) allows defining *once* entities for images that can then
>>     be reused thousands of times without repeting their definition.
>>     You can do this as well with CSS styles, just define a class for a
>>     small element. This element may still be an "image", but the
>>     semantic is carried by the class you assign to it. You are not
>>     required to provide an external source URL for that image if the
>>     CSS style provides the content.
>>
>>     You may also use PUAs for the same purpose (however I have not
>>     seen how CSS allows to style individual characters in text
>>     elements as these characters are not elements, and there's no
>>     defined selector for pseudo-elements matching a single character).
>>     PUAs are perfectly usable in the situation where you have embedded
>>     a custom font in your document for assigning glyphs to characters
>>     (you can still do that, but I would avoid TrueType/OpenType for
>>     this purpose, but would use the SVG font format which is valid in
>>     CSS, for defining a collection of glyphs).
>>
>>     If the document is not restricted to be standalone, of course you
>>     can use links to an external shared CSS stylesheet and to this SVG
>>     font referenced by the stylesheet. With such approach, you don't
>>     even need to use classes on elements, you use plain-text with very
>>     compact PUAs (it's up to you to decide if the document must be
>>     standalone (embedding everything it needs) or must use external
>>     references for missing definitions, HTML allows both (and SVG as
>>     well when it contains plain-text elements).
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/4c8bdc4c/attachment.html>

From andrewcwest at gmail.com  Sun May 31 07:55:50 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Sun, 31 May 2015 13:55:50 +0100
Subject: Some questions about Unicode's CJK Unified Ideograph
In-Reply-To: <CALgEMhyJU9RDbBFGf4XydKh0UKk6GmwEZ8tg22XKa3SydgKQsQ@mail.gmail.com>
References: <CAGHjPPLTsh3bkfQF-2DGqHi768Ddgto4JJODL2puP0gS-YAyHA@mail.gmail.com>
 <CAGHjPP+qe9kzXjYDe9GyVAGbfdyyVvyWKX7SRxLgQ9dDc7tFLQ@mail.gmail.com>
 <CAGHjPPLpuY3G9teKq3qHy+xGLVsr=2tV_RZOVExnBiuUzkosMg@mail.gmail.com>
 <CAGHjPPK8+N2JCXFSK2fR9uUiGy3fSaWF6D_ceRNyi7iyOHivNA@mail.gmail.com>
 <CAGHjPP+pJaveyJJfgLj2Q=1vdxfHvRjzLoMXK3PtDJMcjsMGTQ@mail.gmail.com>
 <CAGHjPP+Syrr5SrNvDkJtssqa=miP__=EbJUYV1eh9qHL+WxMpA@mail.gmail.com>
 <CAGHjPPLNNMs=LPxCo6fAMUgpRPgXV9cd5kjn5n+_zyDGe+Zafw@mail.gmail.com>
 <CAGHjPPKDDYBNJM7UrH1sUkBusEkUFs4S+FqGZsoYGhLVjwAyLA@mail.gmail.com>
 <CAGHjPPJp=uNEobZ4oYTDLuAdYE5o1vpF1v_GgAGzp9b0ahmnFQ@mail.gmail.com>
 <CAGHjPPL4mQK8BKYUe-6aTYoOYQwFwbr0z-hB-8VL5ooNxs-n_A@mail.gmail.com>
 <CAGHjPPK=9YkaRXVWTeWn_UDF2icao+W5T3soOSZkVGU8ftYkJw@mail.gmail.com>
 <CAGHjPPJMjH_couD9Tb3Vef2csyXUnqEXFvf5AM5uDP72xqMU0Q@mail.gmail.com>
 <CAGHjPP+GNSsYL9tC0fJvE32RELJKuVAtg+3iduVzHm1E2XQmhg@mail.gmail.com>
 <55691764.4030802@att.net> <55694EAD.6030604@hiroshima-u.ac.jp>
 <CAGHjPPKZSicRyRBd2ObTWSULLeoo_hQBBpsnsT8L=3LjTqZzNA@mail.gmail.com>
 <CALgEMhyJU9RDbBFGf4XydKh0UKk6GmwEZ8tg22XKa3SydgKQsQ@mail.gmail.com>
Message-ID: <CALgEMhy1ridcHHbE85rusW7eP-3f+TefLRhB4-D3TeeuePWnWA@mail.gmail.com>

On 31 May 2015 at 12:42, Andrew West <andrewcwest at gmail.com> wrote:
>
> Even with OpenType it is not easy to contextually create a gap between
> two combining underlines as the characters are not adjacent (I don't
> think it is impossible, but the only way I can think of doing it is
> rather unpleasant; perhaps other font experts on this list know an
> easy way of doing it).

Ignore that, I wasn't thinking straight.  It can be done easily using OpenType.

Andrew

From jsbien at mimuw.edu.pl  Sun May 31 09:32:36 2015
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Sun, 31 May 2015 16:32:36 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
Message-ID: <86lhg43ji3.fsf@mimuw.edu.pl>


I'm curious what was the motivation for adding the character to
Unicode. I understand the proposal is somewhere in the archives, perhaps
it is available on the Internet?

The only usage I'm aware of (with the exception of my own for historical
Polish) is that found in Wiktionary:

    ? is also [?} used for the sign for avo, the small form of Pataca.

Best regards

Janusz

-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From andrewcwest at gmail.com  Sun May 31 09:56:32 2015
From: andrewcwest at gmail.com (Andrew West)
Date: Sun, 31 May 2015 15:56:32 +0100
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <86lhg43ji3.fsf@mimuw.edu.pl>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
Message-ID: <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>

On 31 May 2015 at 15:32, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
>
> I'm curious what was the motivation for adding the character to
> Unicode. I understand the proposal is somewhere in the archives, perhaps
> it is available on the Internet?

Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc.

Andrew


From gansmann at uni-bonn.de  Sun May 31 10:01:36 2015
From: gansmann at uni-bonn.de (Gerrit Ansmann)
Date: Sun, 31 May 2015 17:01:36 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <86lhg43ji3.fsf@mimuw.edu.pl>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
Message-ID: <op.xzh9cyx15dzc5p@dumpfbacke.rechnerverbund>

On Sun, 31 May 2015 16:32:36 +0200, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:

> I'm curious what was the motivation for adding the character to Unicode.

According to the Code Chart for Latin Extended B (http://www.unicode.org/charts/PDF/U0180.pdf), it?s used for Sencoten. It was also used in some old Norwegian texts (for a start, see here: http://en.wikipedia.org/wiki/Christian_K?lle).

From jsbien at mimuw.edu.pl  Sun May 31 10:03:32 2015
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sun, 31 May 2015 17:03:32 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
 <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>
Message-ID: <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl>

Quote/Cytat - Andrew West <andrewcwest at gmail.com> (Sun 31 May 2015  
04:56:32 PM CEST):

> On 31 May 2015 at 15:32, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
>>
>> I'm curious what was the motivation for adding the character to
>> Unicode. I understand the proposal is somewhere in the archives, perhaps
>> it is available on the Internet?
>
> Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc.

Thank you very much for your quick answer!

Would you so kind to point me to the proposal for the upper case of "A  
WITH STROKE", or advice me how to look for it in the archive?

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From jsbien at mimuw.edu.pl  Sun May 31 10:17:57 2015
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sun, 31 May 2015 17:17:57 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <op.xzh9cyx15dzc5p@dumpfbacke.rechnerverbund>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
 <op.xzh9cyx15dzc5p@dumpfbacke.rechnerverbund>
Message-ID: <20150531171757.141310hr7rh5t4px@mail.mimuw.edu.pl>

Quote/Cytat - Gerrit Ansmann <gansmann at uni-bonn.de> (Sun 31 May 2015  
05:01:36 PM CEST):

> On Sun, 31 May 2015 16:32:36 +0200, Janusz S. Bie?  
> <jsbien at mimuw.edu.pl> wrote:
>
>> I'm curious what was the motivation for adding the character to Unicode.
>
> According to the Code Chart for Latin Extended B  
> (http://www.unicode.org/charts/PDF/U0180.pdf), it?s used for  
> Sencoten. It was also used in some old Norwegian texts (for a start,  
> see here: http://en.wikipedia.org/wiki/Christian_K?lle).

Thank you very much for the link to old Norwegian (I was aware of Sencoten).

Best regards

JSB

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From asmus-inc at ix.netcom.com  Sun May 31 10:50:05 2015
From: asmus-inc at ix.netcom.com (Asmus Freytag (t))
Date: Sun, 31 May 2015 08:50:05 -0700
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <1433075623556.38b645ad@Nodemailer>
References: <556AEAE6.2040203@ix.netcom.com>
 <1433075623556.38b645ad@Nodemailer>
Message-ID: <556B2DAD.6050204@ix.netcom.com>

On 5/31/2015 5:33 AM, Chris-as-John wrote:
>
> Yes, Asmus good post. But I don?t really think HTML, even a subset, is 
> really the right solution.

The longer I think about this, what would be needed would be something 
like an "abstract" format. A specification of the capabilities to be 
supported and the types of properties needed to support them in an 
extensible way. HTML and CSS would possibly become an implementation of 
such a specification.

There would still be a place for a character set, that is Unicode, as an 
efficient way to implement the most basic and most standard features of 
text contents, but perhaps some extension mechanism that can handle 
various extensions.

The first level of extension is support for recent (or rare) code points 
in the character set (additional fonts, etc, as you mention).

The next level of extension could be support for collections of custom 
entities that are not available as character sets (stickers and the like).

And finally, there would have to be a way to deal with "one-offs", such 
as actual images that do not form categorizable sets, but are used in an 
ad-hoc manner and behave like custom characters.

And so on.

It should be possible to describe all of this in a way that allows it to 
be mapped to HMTL and CSS or to any other rich text format -- the goal, 
after all is to make such "inline text" as widely and effortlessly 
interchangeable as plain text is today (or at least nearly so).

By keeping the specification abstract, you could accommodate both SGML 
like formats where ascii-string markup is intermixed with the text, as 
well as pure text buffers with place holder code points and links to 
external data.

But, however bored you are with plain Unicode emoji, as long as there 
isn't an agreed upon common format for rich "inline text" I see very 
little chance that those cute facebook emoji will do anything other than 
firmly keep you in that particular ghetto.

A./

> I?m reminded of the design for XML itself, it is supposed to start 
> with a header that defines what that XML will conform to. Those 
> definitions contain some unique identifiers of that XML schema, which 
> happens to be a URL. The URL is partly just a convenient unique 
> identifier, but also, the XML engine, if it doesn?t know about that 
> schema could go to that URL and download the schema, and check that 
> the XML  conforms to that schema.
>
> Similarly, imagine a text format that had a header with something like:
> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345
>
> Now all the characters following in the text will interpret characters 
> that start with 12345 with respect to that character set. What would 
> you find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might 
> find bitmaps, truetype fonts, vector graphics, etc. You might find 
> many many representations of that character set that your rendering 
> engine could cache for future use. The text format wouldn?t be reliant 
> on today?s favorite rendering technology, whether bitmap, truetype 
> fonts, or whatever. Right now, if you go to a website that references 
> unicode that your platform doesn?t know about, you see nothing. If a 
> format like this existed, character sets would be infinitely 
> extensible, everybody on earth could see characters, even if their 
> platform wasn?t previously aware of them, and the format would be 
> independent of today?s rendering technologies. Let?s face it, HTML5 
> changes every few years, and I don?t think anybody wants the 
> fundamental textual representation dependant on an entire layout 
> engine. And also the whole range of what HTML5 can do, even some 
> subset, is too much information. You don?t necessarily want your text 
> to embed the actual character set. Perhaps that might be a useful 
> option, but I think most people would want to uniquely identify the 
> character set, in a way that an engine can download it, but without 
> defining the actual details itself. Of course, certain charsets would 
> probably become pervasive enough that platforms would just include 
> them for convenience. Emojis by major messaging platforms. Maybe 
> characters related to specialised domains like, I don?t know, mapping 
> or specialised work domains or whatever, But without having to be 
> subservient to the central unicode committee.
>
> As someone who is a keen user of Facebook messenger, and who sees them 
> bring out a new set of emoji almost every week, I think the world will 
> soon be totally bored with the plain basic emoji that unicode has defined.
>
>
> ?
> Chris
>
>
> On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) 
> <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>
>     reading this discussion, I agree with your reaductio ad absurdum
>     of infinitely nested HTML.
>
>     But I think you are onto something with your hypothetical example
>     of the "subset that works in ALL textual situations".
>
>     There's clearly a use case for something like it, and I believe
>     many people would intuitively agree on a set of features for it.
>
>     What people seem to have in mind is something like "inline" text.
>     Something beyond a mere stream of plain text (with effectively
>     every character rendered visibly), but still limited in important
>     ways by general behavior of inline text: a string of it, laid out,
>     must wrap and line break, any objects included in it must behave
>     like characters (albeit of custom width, height and appearance),
>     and so on. Paragraph formatting, stacked layout, header levels and
>     all those good things would not be available.
>
>     With such a subset clearly defined, many quirky limitations might
>     no longer be necessary; any container that today only takes plain
>     text could be upgraded to take "inline text". I can see some
>     inline containers retaining a nesting limitation, but I could
>     imagine that it is possible to arrive at a consistent definition
>     of such inline format.
>
>     Going further, I can't shake the impression that without a clean
>     definition of an inline text format along those lines, any
>     attempts at making stickers and similar solutions "stick" are
>     doomed to failure.
>
>     The interesting thing in defining such a format is not how to
>     represent it in HTML or CSS syntax, but in describing what feature
>     sets it must (minimally) support. Doing it that way would free
>     existing implementations of rich text to map native formats onto
>     that minimally required subset and to add them to their format
>     translators for HMTL or whatever else they use for interchange.
>
>     Only with a definition can you ever hope to develop a processing
>     model. It won't be as simple as for plain text strings, but it
>     should be able to support common abstractions (like iteration by
>     logical unit). It would have to support the management of external
>     resources - if the inline format allows images, custom fonts, etc.
>     one would need a way to manage references to them in the local
>     context.
>
>     If your skeptical position proves correct in that this is
>     something that turns out to not be tractable, then I think you've
>     provided conclusive proof why stickers won't happen and why
>     encoding emoji was the only sensible decision Unicode could have
>     taken.
>
>     A./
>
>     On 5/30/2015 7:14 AM, John wrote:
>>
>>     Hmm, these "once entities" of which you speak, do they require
>>     javascript? Because I'm not sure what we are looking for here is
>>     static documents requiring a full programming language.
>>
>>     But let's say for a moment that html5 can, or could do the job
>>     here. Then to make the dream come true that you could just cut
>>     and paste text that happened to contain a custom character to
>>     somewhere else, and nothing untoward would happen, would mean
>>     that everything in the computing universe should allow full blown
>>     html. So every Java Swing component, every Apple gui component,
>>     every .NET component, every windows component, every browser,
>>     every Android and IOS component would allow text entry of HTML
>>     entities. OK, so let's say everyone agrees with this course of
>>     action, now the universal text format is HTML.
>>
>>     But in this new world where anywhere that previously you could
>>     input text, you can now input full blown html, does that actually
>>     make sense? Does it make sense that you can for example, put full
>>     blown HTML inside a H1 tag in html itself? That's a lot of
>>     recursion going on there. Or in a MS-Excel cell? Or interspersed
>>     in some otherwise fairly regular text in a Word document?
>>
>>     I suppose someone could define a strict limited subset of HTML to
>>     be that subset that makes sense in ALL textual situations. That
>>     subset would be something like just defining things that act like
>>     characters, and not like a full blown rendering engine. But who
>>     would define that subset? Not the HTML groups, because their
>>     mandate is to define full blown rendering engines. It would be
>>     more likely to be something like the unicode group.
>>
>>     And also, in this brave new world where HTML5 is the new standard
>>     text format, what would the binary format of it be? I mean, if I
>>     have the string of unicode characters <IMG would that be HTML5
>>     image definition that should be rendered as such? Or would it be
>>     text that happens to contain greater than symbol, I, M and G? It
>>     would have to be the former I guess, and thereby there would no
>>     longer be a unicode symbol for the mathematical greater than
>>     symbol. Rather there would be a unicode symbol for opening a HTML
>>     tag, and the text code for greater than would be &gt; Never again
>>     would a computer store > to mean greater than. Do we want HTML to
>>     be so pervasive? Not sure it deserves that.
>>
>>     And from a programmers point of view, he wants to be able to
>>     iterate over an array of characters and treat each one the same
>>     way, regardless if it is a custom character or not. Without that
>>     kind of programmatic abstraction, the whole thing can never gain
>>     traction. I don't think fully blown HTML embedded in your text
>>     can fulfill that. A very strictly defined subset, possibly could.
>>     Sure HTML5 can RENDER stuff adquately, if the only aim of the
>>     game is provide a correct rendering. But to be able to actually
>>     treat particular images embedded as characters, and have some
>>     programming library see that abstraction consistently, I'm not
>>     sure I'm convinced that is possible. Not without nailing down
>>     exactly what html elements in what particular circumstances
>>     constitute a "character".
>>
>>     I guess in summary, yes we have the technology already to render
>>     anything. But I don't think the whole standards framework does
>>     anything to allow the computing universe to actually exchange
>>     custom characters as if they were just any other text. Someone
>>     would actually have to  work on a standard to do that, not just
>>     point to html5.
>>
>>
>>     On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy
>>     <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>>
>>
>>         2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com
>>         <mailto:idou747 at gmail.com>>:
>>
>>             "Today the world goes very well with HTML(5) which is now
>>             the bext markup language for document (including for
>>             inserting embedded images that don?t require any external
>>             request?
>>             If I had a large document that reused a particular
>>             character thousands of times, would this HTML markup
>>             require embedding that character thousands of times, or
>>             could I define the character once at the beginning of the
>>             sequence, and then refer back to it in a space efficient way?
>>
>>
>>         HTML(5) allows defining *once* entities for images that can
>>         then be reused thousands of times without repeting their
>>         definition. You can do this as well with CSS styles, just
>>         define a class for a small element. This element may still be
>>         an "image", but the semantic is carried by the class you
>>         assign to it. You are not required to provide an external
>>         source URL for that image if the CSS style provides the content.
>>
>>         You may also use PUAs for the same purpose (however I have
>>         not seen how CSS allows to style individual characters in
>>         text elements as these characters are not elements, and
>>         there's no defined selector for pseudo-elements matching a
>>         single character). PUAs are perfectly usable in the situation
>>         where you have embedded a custom font in your document for
>>         assigning glyphs to characters (you can still do that, but I
>>         would avoid TrueType/OpenType for this purpose, but would use
>>         the SVG font format which is valid in CSS, for defining a
>>         collection of glyphs).
>>
>>         If the document is not restricted to be standalone, of course
>>         you can use links to an external shared CSS stylesheet and to
>>         this SVG font referenced by the stylesheet. With such
>>         approach, you don't even need to use classes on elements, you
>>         use plain-text with very compact PUAs (it's up to you to
>>         decide if the document must be standalone (embedding
>>         everything it needs) or must use external references for
>>         missing definitions, HTML allows both (and SVG as well when
>>         it contains plain-text elements).
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/9927344c/attachment-0001.html>

From frederic.grosshans at gmail.com  Sun May 31 11:20:31 2015
From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=)
Date: Sun, 31 May 2015 18:20:31 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
 <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>
 <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl>
Message-ID: <556B34CF.2040106@gmail.com>

Le 31/05/2015 17:03, Janusz S. Bien a ?crit :
> Quote/Cytat - Andrew West <andrewcwest at gmail.com> (Sun 31 May 2015 
> 04:56:32 PM CEST):
>
>> On 31 May 2015 at 15:32, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
>>>
>>> I'm curious what was the motivation for adding the character to
>>> Unicode. I understand the proposal is somewhere in the archives, 
>>> perhaps
>>> it is available on the Internet?
>>
>> Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc.
>
> Thank you very much for your quick answer!
>
> Would you so kind to point me to the proposal for the upper case of "A 
> WITH STROKE", or advice me how to look for it in the archive?

The upper case was introduces for Sencoten, and the proposal is here
http://www.unicode.org/L2/L2004/04170-sencoten.pdf

(found by googling sencoten site:unicode.org)

   Fr?d?ric

From doug at ewellic.org  Sun May 31 11:44:24 2015
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 31 May 2015 10:44:24 -0600
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <CAMZ=zj6XtrJHPByC=ZE3CJndO+vid4zPCXKRXrUdGV2--AA_Mw@mail.gmail.com>
References: <20150530162143.665a7a7059d7ee80bb4d670165c8327d.d600649964.wbe@email03.secureserver.net>
 <CAMZ=zj6XtrJHPByC=ZE3CJndO+vid4zPCXKRXrUdGV2--AA_Mw@mail.gmail.com>
Message-ID: <4BC2309D56004EFFA43592EBE3248D2E@DougEwell>

David Starner wrote:

> I would say that a system would conform with Unicode in having yellow
> heart red (in a non-monochrome font) as well as if it made it a cross.
> Either way it's violating character identity. I'd say that being
> monochromatic is now like being monospaced; it's suboptimal for a
> Unicode implementation, but hardly something Unicode can condemn as
> nonconformant.

This seems fair and sensible. My main point was that being monochromatic 
(i.e. black) is conformant, and was an attempt to challenge the 
statement about character color "sometimes being a recorded property." I 
don't see any Unicode character properties that identify color, only 
character names, which don't carry property information.

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From jsbien at mimuw.edu.pl  Sun May 31 13:05:49 2015
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sun, 31 May 2015 20:05:49 +0200
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <556B34CF.2040106@gmail.com>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
 <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>
 <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl>
 <556B34CF.2040106@gmail.com>
Message-ID: <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl>

Quote/Cytat - Fr?d?ric Grosshans <frederic.grosshans at gmail.com> (Sun  
31 May 2015 06:20:31 PM CEST):

> Le 31/05/2015 17:03, Janusz S. Bien a ?crit :
>> Quote/Cytat - Andrew West <andrewcwest at gmail.com> (Sun 31 May 2015  
>> 04:56:32 PM CEST):
>>
>>> On 31 May 2015 at 15:32, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
>>>>
>>>> I'm curious what was the motivation for adding the character to
>>>> Unicode. I understand the proposal is somewhere in the archives, perhaps
>>>> it is available on the Internet?
>>>
>>> Please see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2942.doc.
>>
>> Thank you very much for your quick answer!
>>
>> Would you so kind to point me to the proposal for the upper case of  
>> "A WITH STROKE", or advice me how to look for it in the archive?
>
> The upper case was introduces for Sencoten, and the proposal is here
> http://www.unicode.org/L2/L2004/04170-sencoten.pdf
>
> (found by googling sencoten site:unicode.org)

Thank yout very much for both informations.

The proposal makes me curious about past and present Unicode policy,  
e.g. would it be accepted if submitted now. But this is a completely  
different question to which I perhaps will return in some future.

Thanks again to all who responded.

Best regards

Janusz


-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From verdy_p at wanadoo.fr  Sun May 31 16:26:45 2015
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 31 May 2015 23:26:45 +0200
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <556B2DAD.6050204@ix.netcom.com>
References: <556AEAE6.2040203@ix.netcom.com>
 <1433075623556.38b645ad@Nodemailer>
 <556B2DAD.6050204@ix.netcom.com>
Message-ID: <CAGa7JC0+DEG8beh3AqaJDbjYfiRKHcHE=ezJN04tJw=yui+dWQ@mail.gmail.com>

The "abstract format" already exists also for HTML (with MIME "charset"
extension of the media-type "text/plain" (it can also be embedded in a meta
tag, where the HTML source file ius just stored in a filesystem, so that a
webserver can parse it and provide the correct MIME header, if the
webserver has no repository for metadata and must infer the media type from
the file content itself with some guesser).

It also exists in various conventions for source code (recognized by
editors such as vi(m) or Emacs, or for Unic shells using embedded "magic"
identifiers near the top of the file.

You can use it to send an identifier for a private charset without having
to request for a registration of the charset in the IANA database (which is
not intended for private encodings). The pricate chrset can be named a
unique way (consider using a private charset name based on a domain name
you own, such as "x-www.example.net-mycharset-1" if you own the domain name
"example.net"). It will be enough for the initial experimentation for a few
years (or more, provided that you renew this domain name). Your charset can
contain various defitnitions: a mapping of your codepoints (including PUAs,
or standard codepoints, or "hacked" codepoints if you have no other
solution to get the correct character properties working with existing
algorithms such as case mappings, collation, layout behavior in text
renderers).

Such solution would allow a more predictable management of PUAs (byt
allowing to control their scope of use, by binding them, only in some magic
header of the document, to a private charset that remains reasonnably
unique. for example "x-example.net-mycharset-1" would map to an URL like "//
www.example.net/mycharset/1/" containing some schema (it could be the base
adress of an XML of JSON file, and of a web font containing the relevant
glyphs, and of a character properties database to override the default ones
from the standard: if you already know this private charset in your
application, you don't need to download any of these files, the URL is just
an identifier and you file can still be used in standalone mode, just like
you can parse many standard XML schemas by just recognizing the URLs
assigned to the XML namespaces, without even having to find a DTD or XML
schema definition from an external resource; if needed you app can contain
a local repository in some cache folder where you can extend the number of
private "charsets" that can be recognized).

----

Full interopability will still not be possible if you need to mix in the
same document texts encoded with different private charsets (there's always
a risk of collision), without a way to reencode some of them to a joined
charset without the collisions) by infering a new private charset (it's not
impossible to do, after all this is done already with XML schemas that you
can mix together: you just need to rename the XML namespaces, keeping the
URLs to which they are bound, when there's a collision on the XML namespace
names, a situation that occurs sometimes because of versioning where some
features of a schema are not fully upward compatible).

Yes this complicate things a bit, but much less than when using documents
in which PUA assignments are not negociated at all (even minimally to make
sure they are compatible when mixing sources); and for which there exits
for now absolutely no protocol defined for such negociation (TUS says that
PUAs are usable and interchangeable under "private mutual agreement" but
still provides no schemes for supporting such mutual agreement, and for
this reason, PUAs are alsmost always rejected, and people want true
permanent assignments for characters that are very specific, badly
documented, or insufficiently known to have reliable permanent properties).

So let's think about securing the use of PUAs with some identification
scheme (for plain-text formats, it should just be allowed to negocaite a
single charset for the whole, using the "magic" header tricks that re used
since long by charset guessers (including for autodetecting UTF-8 encoded
files).

This would also solve the chicken-and-egg problem where we need more
sources to attest an effective usage before encoding new characters, but
developping this usages is extremely difficult (and much slower) in our
modern technologies where most documents are now handled numerically (in
the past it was possible to create a metal font and use it immediately to
start editing books, and there were many more people using handwriting and
drawings, so it was much less difficult to invent new characters, than it
is today, unless you're a big company that has enough resources to develop
this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or
Microsoft introducing new sets of Emojis for their instant messaging
platform, with tons of developers working for them to develop a wide range
of services around it...)

However I'm not saying that Unicode should specify how such private charset
containing private assignments could be inserted in headers (I just think
that it should describe a mechanism and give example of how common text
formats are already used to convery some "magic" identifiers near the top
of the file, and then we could describe a service allowing to locate and
retrieve the associated definitions of this identifier, and some
interchangeable format for these informations.


2015-05-31 17:50 GMT+02:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:

>  On 5/31/2015 5:33 AM, Chris-as-John wrote:
>
>
>  Yes, Asmus good post. But I don?t really think HTML, even a subset, is
> really the right solution.
>
>
> The longer I think about this, what would be needed would be something
> like an "abstract" format. A specification of the capabilities to be
> supported and the types of properties needed to support them in an
> extensible way. HTML and CSS would possibly become an implementation of
> such a specification.
>
> There would still be a place for a character set, that is Unicode, as an
> efficient way to implement the most basic and most standard features of
> text contents, but perhaps some extension mechanism that can handle various
> extensions.
>
> The first level of extension is support for recent (or rare) code points
> in the character set (additional fonts, etc, as you mention).
>
> The next level of extension could be support for collections of custom
> entities that are not available as character sets (stickers and the like).
>
> And finally, there would have to be a way to deal with "one-offs", such as
> actual images that do not form categorizable sets, but are used in an
> ad-hoc manner and behave like custom characters.
>
> And so on.
>
> It should be possible to describe all of this in a way that allows it to
> be mapped to HMTL and CSS or to any other rich text format -- the goal,
> after all is to make such "inline text" as widely and effortlessly
> interchangeable as plain text is today (or at least nearly so).
>
> By keeping the specification abstract, you could accommodate both SGML
> like formats where ascii-string markup is intermixed with the text, as well
> as pure text buffers with place holder code points and links to external
> data.
>
> But, however bored you are with plain Unicode emoji, as long as there
> isn't an agreed upon common format for rich "inline text" I see very little
> chance that those cute facebook emoji will do anything other than firmly
> keep you in that particular ghetto.
>
> A./
>
>  I?m reminded of the design for XML itself, it is supposed to start with
> a header that defines what that XML will conform to. Those definitions
> contain some unique identifiers of that XML schema, which happens to be a
> URL. The URL is partly just a convenient unique identifier, but also, the
> XML engine, if it doesn?t know about that schema could go to that URL and
> download the schema, and check that the XML  conforms to that schema.
>
>  Similarly, imagine a text format that had a header with something like:
> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345
>
>  Now all the characters following in the text will interpret characters
> that start with 12345 with respect to that character set. What would you
> find at at facebook.com/charsets/pusheen-the-cat-emoji/? You might find
> bitmaps, truetype fonts, vector graphics, etc. You might find many many
> representations of that character set that your rendering engine could
> cache for future use. The text format wouldn?t be reliant on today?s
> favorite rendering technology, whether bitmap, truetype fonts, or whatever.
> Right now, if you go to a website that references unicode that your
> platform doesn?t know about, you see nothing. If a format like this
> existed, character sets would be infinitely extensible, everybody on earth
> could see characters, even if their platform wasn?t previously aware of
> them, and the format would be independent of today?s rendering
> technologies. Let?s face it, HTML5 changes every few years, and I don?t
> think anybody wants the fundamental textual representation dependant on an
> entire layout engine. And also the whole range of what HTML5 can do, even
> some subset, is too much information. You don?t necessarily want your text
> to embed the actual character set. Perhaps that might be a useful option,
> but I think most people would want to uniquely identify the character set,
> in a way that an engine can download it, but without defining the actual
> details itself. Of course, certain charsets would probably become pervasive
> enough that platforms would just include them for convenience. Emojis by
> major messaging platforms. Maybe characters related to specialised domains
> like, I don?t know, mapping or specialised work domains or whatever, But
> without having to be subservient to the central unicode committee.
>
>  As someone who is a keen user of Facebook messenger, and who sees them
> bring out a new set of emoji almost every week, I think the world will soon
> be totally bored with the plain basic emoji that unicode has defined.
>
>
> ?
> Chris
>
>
>  On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) <
> asmus-inc at ix.netcom.com> wrote:
>
>>  reading this discussion, I agree with your reaductio ad absurdum of
>> infinitely nested HTML.
>>
>> But I think you are onto something with your hypothetical example of the
>> "subset that works in ALL textual situations".
>>
>> There's clearly a use case for something like it, and I believe many
>> people would intuitively agree on a set of features for it.
>>
>> What people seem to have in mind is something like "inline" text.
>> Something beyond a mere stream of plain text (with effectively every
>> character rendered visibly), but still limited in important ways by general
>> behavior of inline text: a string of it, laid out, must wrap and line
>> break, any objects included in it must behave like characters (albeit of
>> custom width, height and appearance), and so on. Paragraph formatting,
>> stacked layout, header levels and all those good things would not be
>> available.
>>
>> With such a subset clearly defined, many quirky limitations might no
>> longer be necessary; any container that today only takes plain text could
>> be upgraded to take "inline text". I can see some inline containers
>> retaining a nesting limitation, but I could imagine that it is possible to
>> arrive at a consistent definition of such inline format.
>>
>> Going further, I can't shake the impression that without a clean
>> definition of an inline text format along those lines, any attempts at
>> making stickers and similar solutions "stick" are doomed to failure.
>>
>> The interesting thing in defining such a format is not how to represent
>> it in HTML or CSS syntax, but in describing what feature sets it must
>> (minimally) support. Doing it that way would free existing implementations
>> of rich text to map native formats onto that minimally required subset and
>> to add them to their format translators for HMTL or whatever else they use
>> for interchange.
>>
>> Only with a definition can you ever hope to develop a processing model.
>> It won't be as simple as for plain text strings, but it should be able to
>> support common abstractions (like iteration by logical unit). It would have
>> to support the management of external resources - if the inline format
>> allows images, custom fonts, etc. one would need a way to manage references
>> to them in the local context.
>>
>> If your skeptical position proves correct in that this is something that
>> turns out to not be tractable, then I think you've provided conclusive
>> proof why stickers won't happen and why encoding emoji was the only
>> sensible decision Unicode could have taken.
>>
>> A./
>>
>> On 5/30/2015 7:14 AM, John wrote:
>>
>>
>>  Hmm, these "once entities" of which you speak, do they require
>> javascript? Because I'm not sure what we are looking for here is static
>> documents requiring a full programming language.
>>
>>  But let's say for a moment that html5 can, or could do the job here.
>> Then to make the dream come true that you could just cut and paste text
>> that happened to contain a custom character to somewhere else, and nothing
>> untoward would happen, would mean that everything in the computing universe
>> should allow full blown html. So every Java Swing component, every Apple
>> gui component, every .NET component, every windows component, every
>> browser, every Android and IOS component would allow text entry of HTML
>> entities. OK, so let's say everyone agrees with this course of action, now
>> the universal text format is HTML.
>>
>>  But in this new world where anywhere that previously you could input
>> text, you can now input full blown html, does that actually make sense?
>> Does it make sense that you can for example, put full blown HTML inside a
>> H1 tag in html itself? That's a lot of recursion going on there. Or in a
>> MS-Excel cell? Or interspersed in some otherwise fairly regular text in a
>> Word document?
>>
>>  I suppose someone could define a strict limited subset of HTML to be
>> that subset that makes sense in ALL textual situations. That subset would
>> be something like just defining things that act like characters, and not
>> like a full blown rendering engine. But who would define that subset? Not
>> the HTML groups, because their mandate is to define full blown rendering
>> engines. It would be more likely to be something like the unicode group.
>>
>>  And also, in this brave new world where HTML5 is the new standard text
>> format, what would the binary format of it be? I mean, if I have the string
>> of unicode characters <IMG would that be HTML5 image definition that should
>> be rendered as such? Or would it be text that happens to contain greater
>> than symbol, I, M and G? It would have to be the former I guess, and
>> thereby there would no longer be a unicode symbol for the mathematical
>> greater than symbol. Rather there would be a unicode symbol for opening a
>> HTML tag, and the text code for greater than would be &gt; Never again
>> would a computer store > to mean greater than. Do we want HTML to be so
>> pervasive? Not sure it deserves that.
>>
>>  And from a programmers point of view, he wants to be able to iterate
>> over an array of characters and treat each one the same way, regardless if
>> it is a custom character or not. Without that kind of programmatic
>> abstraction, the whole thing can never gain traction. I don't think fully
>> blown HTML embedded in your text can fulfill that. A very strictly defined
>> subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only
>> aim of the game is provide a correct rendering. But to be able to actually
>> treat particular images embedded as characters, and have some programming
>> library see that abstraction consistently, I'm not sure I'm convinced that
>> is possible. Not without nailing down exactly what html elements in what
>> particular circumstances constitute a "character".
>>
>>  I guess in summary, yes we have the technology already to render
>> anything. But I don't think the whole standards framework does anything to
>> allow the computing universe to actually exchange custom characters as if
>> they were just any other text. Someone would actually have to  work on a
>> standard to do that, not just point to html5.
>>
>>
>> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy <verdy_p at wanadoo.fr>,
>> wrote:
>>
>>>
>>> 2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com>:
>>>
>>>>  "Today the world goes very well with HTML(5) which is now the bext
>>>> markup language for document (including for inserting embedded images that
>>>> don?t require any external request?
>>>> If I had a large document that reused a particular character thousands
>>>> of times, would this HTML markup require embedding that character thousands
>>>> of times, or could I define the character once at the beginning of the
>>>> sequence, and then refer back to it in a space efficient way?
>>>>
>>>
>>>  HTML(5) allows defining *once* entities for images that can then be
>>> reused thousands of times without repeting their definition. You can do
>>> this as well with CSS styles, just define a class for a small element. This
>>> element may still be an "image", but the semantic is carried by the class
>>> you assign to it. You are not required to provide an external source URL
>>> for that image if the CSS style provides the content.
>>>
>>>  You may also use PUAs for the same purpose (however I have not seen
>>> how CSS allows to style individual characters in text elements as these
>>> characters are not elements, and there's no defined selector for
>>> pseudo-elements matching a single character). PUAs are perfectly usable in
>>> the situation where you have embedded a custom font in your document for
>>> assigning glyphs to characters (you can still do that, but I would avoid
>>> TrueType/OpenType for this purpose, but would use the SVG font format which
>>> is valid in CSS, for defining a collection of glyphs).
>>>
>>>  If the document is not restricted to be standalone, of course you can
>>> use links to an external shared CSS stylesheet and to this SVG font
>>> referenced by the stylesheet. With such approach, you don't even need to
>>> use classes on elements, you use plain-text with very compact PUAs (it's up
>>> to you to decide if the document must be standalone (embedding everything
>>> it needs) or must use external references for missing definitions, HTML
>>> allows both (and SVG as well when it contains plain-text elements).
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150531/17edc2a9/attachment.html>

From idou747 at gmail.com  Sun May 31 18:33:49 2015
From: idou747 at gmail.com (Chris)
Date: Mon, 1 Jun 2015 09:33:49 +1000
Subject: Tag characters and in-line graphics (from Tag characters)
In-Reply-To: <CAGa7JC0+DEG8beh3AqaJDbjYfiRKHcHE=ezJN04tJw=yui+dWQ@mail.gmail.com>
References: <556AEAE6.2040203@ix.netcom.com>
 <1433075623556.38b645ad@Nodemailer> <556B2DAD.6050204@ix.netcom.com>
 <CAGa7JC0+DEG8beh3AqaJDbjYfiRKHcHE=ezJN04tJw=yui+dWQ@mail.gmail.com>
Message-ID: <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com>


Of course, anyone can invent a character set. The difficult bit is having a standard way of combining custom character sets. That?s why a standard would be useful.

And while stuff like this can, to some extent, be recognised by magic numbers, and unique strings in headers, such things are unreliable. Just because example.net/mycharset/ <http://example.net/mycharset/> appears near the start of a document, doesn?t necessarily mean it was meant to define a character set. Maybe it was a document discussing character sets.

And while it is tempting to allow the ?container? to define the ?header? information, whether the container be html defining something in its HEAD tag, or some proprietary format (MS-Word), or whatever, that doesn?t really solve anybody?s problem in a standard way. For a start, what if you want to copy text to the clipboard? You want the thing receiving it to be able to interpret it in a self-contained way.

The 2 obvious implementations for a standard seem to be:

1) A standard (optional) header. Perhaps if the string starts with a special character, then follows a header defining charsets first. These would allocate character ranges for custom characters, and point to where their renderings can be found. Standard programming libraries on all platforms would invisibly act appropriately on these headers. If you concatenated strings with conflicting namespaces, standard libraries would seamlessly reallocate one of the custom namespaces and merge the headers.

2) Make a new character set, let?s call it UTF-64. 32 bits would be allocated for custom character sets. Anybody could apply to a central authority to be allocated a custom id (32 bits=4 billion ids). A central location, kind of like a domain name system, would map that id to the URL where the canonical definition for that character set is.

The 2nd option has the advantage that the file format is fixed width like normal plain text documents. Concatenating custom character set strings is no issue. The canonical location for a character set isn?t forevermore mapped to a particular domain owner. Nothing about the meaning of the characters is defined in the actual bits other than the unique id. The disadvantage is it needs a central authority to maintain the list of ids, and map them to domains.


> On 1 Jun 2015, at 7:26 am, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> The "abstract format" already exists also for HTML (with MIME "charset" extension of the media-type "text/plain" (it can also be embedded in a meta tag, where the HTML source file ius just stored in a filesystem, so that a webserver can parse it and provide the correct MIME header, if the webserver has no repository for metadata and must infer the media type from the file content itself with some guesser).
> 
> It also exists in various conventions for source code (recognized by editors such as vi(m) or Emacs, or for Unic shells using embedded "magic" identifiers near the top of the file.
> 
> You can use it to send an identifier for a private charset without having to request for a registration of the charset in the IANA database (which is not intended for private encodings). The pricate chrset can be named a unique way (consider using a private charset name based on a domain name you own, such as "x-www.example.net-mycharset-1" if you own the domain name "example.net <http://example.net/>"). It will be enough for the initial experimentation for a few years (or more, provided that you renew this domain name). Your charset can contain various defitnitions: a mapping of your codepoints (including PUAs, or standard codepoints, or "hacked" codepoints if you have no other solution to get the correct character properties working with existing algorithms such as case mappings, collation, layout behavior in text renderers).
> 
> Such solution would allow a more predictable management of PUAs (byt allowing to control their scope of use, by binding them, only in some magic header of the document, to a private charset that remains reasonnably unique. for example "x-example.net-mycharset-1" would map to an URL like "//www.example.net/mycharset/1/ <http://www.example.net/mycharset/1/>" containing some schema (it could be the base adress of an XML of JSON file, and of a web font containing the relevant glyphs, and of a character properties database to override the default ones from the standard: if you already know this private charset in your application, you don't need to download any of these files, the URL is just an identifier and you file can still be used in standalone mode, just like you can parse many standard XML schemas by just recognizing the URLs assigned to the XML namespaces, without even having to find a DTD or XML schema definition from an external resource; if needed you app can contain a local repository in some cache folder where you can extend the number of private "charsets" that can be recognized).
> 
> ----
> 
> Full interopability will still not be possible if you need to mix in the same document texts encoded with different private charsets (there's always a risk of collision), without a way to reencode some of them to a joined charset without the collisions) by infering a new private charset (it's not impossible to do, after all this is done already with XML schemas that you can mix together: you just need to rename the XML namespaces, keeping the URLs to which they are bound, when there's a collision on the XML namespace names, a situation that occurs sometimes because of versioning where some features of a schema are not fully upward compatible).
> 
> Yes this complicate things a bit, but much less than when using documents in which PUA assignments are not negociated at all (even minimally to make sure they are compatible when mixing sources); and for which there exits for now absolutely no protocol defined for such negociation (TUS says that PUAs are usable and interchangeable under "private mutual agreement" but still provides no schemes for supporting such mutual agreement, and for this reason, PUAs are alsmost always rejected, and people want true permanent assignments for characters that are very specific, badly documented, or insufficiently known to have reliable permanent properties).
> 
> So let's think about securing the use of PUAs with some identification scheme (for plain-text formats, it should just be allowed to negocaite a single charset for the whole, using the "magic" header tricks that re used since long by charset guessers (including for autodetecting UTF-8 encoded files).
> 
> This would also solve the chicken-and-egg problem where we need more sources to attest an effective usage before encoding new characters, but developping this usages is extremely difficult (and much slower) in our modern technologies where most documents are now handled numerically (in the past it was possible to create a metal font and use it immediately to start editing books, and there were many more people using handwriting and drawings, so it was much less difficult to invent new characters, than it is today, unless you're a big company that has enough resources to develop this usage alone, such as Japanese telcos or Google, Yahoo, Samsung or Microsoft introducing new sets of Emojis for their instant messaging platform, with tons of developers working for them to develop a wide range of services around it...)
> 
> However I'm not saying that Unicode should specify how such private charset containing private assignments could be inserted in headers (I just think that it should describe a mechanism and give example of how common text formats are already used to convery some "magic" identifiers near the top of the file, and then we could describe a service allowing to locate and retrieve the associated definitions of this identifier, and some interchangeable format for these informations.
> 
> 
> 2015-05-31 17:50 GMT+02:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>>:
> On 5/31/2015 5:33 AM, Chris-as-John wrote:
>> 
>> Yes, Asmus good post. But I don?t really think HTML, even a subset, is really the right solution.
> 
> The longer I think about this, what would be needed would be something like an "abstract" format. A specification of the capabilities to be supported and the types of properties needed to support them in an extensible way. HTML and CSS would possibly become an implementation of such a specification.
> 
> There would still be a place for a character set, that is Unicode, as an efficient way to implement the most basic and most standard features of text contents, but perhaps some extension mechanism that can handle various extensions. 
> 
> The first level of extension is support for recent (or rare) code points in the character set (additional fonts, etc, as you mention).
> 
> The next level of extension could be support for collections of custom entities that are not available as character sets (stickers and the like).
> 
> And finally, there would have to be a way to deal with "one-offs", such as actual images that do not form categorizable sets, but are used in an ad-hoc manner and behave like custom characters.
> 
> And so on.
> 
> It should be possible to describe all of this in a way that allows it to be mapped to HMTL and CSS or to any other rich text format -- the goal, after all is to make such "inline text" as widely and effortlessly interchangeable as plain text is today (or at least nearly so).
> 
> By keeping the specification abstract, you could accommodate both SGML like formats where ascii-string markup is intermixed with the text, as well as pure text buffers with place holder code points and links to external data.
> 
> But, however bored you are with plain Unicode emoji, as long as there isn't an agreed upon common format for rich "inline text" I see very little chance that those cute facebook emoji will do anything other than firmly keep you in that particular ghetto.
> 
> A./
> 
>> I?m reminded of the design for XML itself, it is supposed to start with a header that defines what that XML will conform to. Those definitions contain some unique identifiers of that XML schema, which happens to be a URL. The URL is partly just a convenient unique identifier, but also, the XML engine, if it doesn?t know about that schema could go to that URL and download the schema, and check that the XML  conforms to that schema.
>> 
>> Similarly, imagine a text format that had a header with something like:
>> \uCHARSET:facebook.com/charsets/pusheen-the-cat-emoji/,12345 <http://facebook.com/charsets/pusheen-the-cat-emoji/,12345>
>> 
>> Now all the characters following in the text will interpret characters that start with 12345 with respect to that character set. What would you find at atfacebook.com/charsets/pusheen-the-cat-emoji/ <http://facebook.com/charsets/pusheen-the-cat-emoji/>? You might find bitmaps, truetype fonts, vector graphics, etc. You might find many many representations of that character set that your rendering engine could cache for future use. The text format wouldn?t be reliant on today?s favorite rendering technology, whether bitmap, truetype fonts, or whatever. Right now, if you go to a website that references unicode that your platform doesn?t know about, you see nothing. If a format like this existed, character sets would be infinitely extensible, everybody on earth could see characters, even if their platform wasn?t previously aware of them, and the format would be independent of today?s rendering technologies. Let?s face it, HTML5 changes every few years, and I don?t think anybody wants the fundamental textual representation dependant on an entire layout engine. And also the whole range of what HTML5 can do, even some subset, is too much information. You don?t necessarily want your text to embed the actual character set. Perhaps that might be a useful option, but I think most people would want to uniquely identify the character set, in a way that an engine can download it, but without defining the actual details itself. Of course, certain charsets would probably become pervasive enough that platforms would just include them for convenience. Emojis by major messaging platforms. Maybe characters related to specialised domains like, I don?t know, mapping or specialised work domains or whatever, But without having to be subservient to the central unicode committee.
>> 
>> As someone who is a keen user of Facebook messenger, and who sees them bring out a new set of emoji almost every week, I think the world will soon be totally bored with the plain basic emoji that unicode has defined.
>> 
>> 
>> ?
>> Chris
>> 
>> 
>> On Sun, May 31, 2015 at 9:06 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
>> reading this discussion, I agree with your reaductio ad absurdum of infinitely nested HTML.
>> 
>> But I think you are onto something with your hypothetical example of the "subset that works in ALL textual situations".
>> 
>> There's clearly a use case for something like it, and I believe many people would intuitively agree on a set of features for it.
>> 
>> What people seem to have in mind is something like "inline" text. Something beyond a mere stream of plain text (with effectively every character rendered visibly), but still limited in important ways by general behavior of inline text: a string of it, laid out, must wrap and line break, any objects included in it must behave like characters (albeit of custom width, height and appearance), and so on. Paragraph formatting, stacked layout, header levels and all those good things would not be available.
>> 
>> With such a subset clearly defined, many quirky limitations might no longer be necessary; any container that today only takes plain text could be upgraded to take "inline text". I can see some inline containers retaining a nesting limitation, but I could imagine that it is possible to arrive at a consistent definition of such inline format.
>> 
>> Going further, I can't shake the impression that without a clean definition of an inline text format along those lines, any attempts at making stickers and similar solutions "stick" are doomed to failure.
>> 
>> The interesting thing in defining such a format is not how to represent it in HTML or CSS syntax, but in describing what feature sets it must (minimally) support. Doing it that way would free existing implementations of rich text to map native formats onto that minimally required subset and to add them to their format translators for HMTL or whatever else they use for interchange.
>> 
>> Only with a definition can you ever hope to develop a processing model. It won't be as simple as for plain text strings, but it should be able to support common abstractions (like iteration by logical unit). It would have to support the management of external resources - if the inline format allows images, custom fonts, etc. one would need a way to manage references to them in the local context.
>> 
>> If your skeptical position proves correct in that this is something that turns out to not be tractable, then I think you've provided conclusive proof why stickers won't happen and why encoding emoji was the only sensible decision Unicode could have taken.
>> 
>> A./
>> 
>> On 5/30/2015 7:14 AM, John wrote:
>>> 
>>> Hmm, these "once entities" of which you speak, do they require javascript? Because I'm not sure what we are looking for here is static documents requiring a full programming language.
>>> 
>>> But let's say for a moment that html5 can, or could do the job here. Then to make the dream come true that you could just cut and paste text that happened to contain a custom character to somewhere else, and nothing untoward would happen, would mean that everything in the computing universe should allow full blown html. So every Java Swing component, every Apple gui component, every .NET component, every windows component, every browser, every Android and IOS component would allow text entry of HTML entities. OK, so let's say everyone agrees with this course of action, now the universal text format is HTML.
>>> 
>>> But in this new world where anywhere that previously you could input text, you can now input full blown html, does that actually make sense? Does it make sense that you can for example, put full blown HTML inside a H1 tag in html itself? That's a lot of recursion going on there. Or in a MS-Excel cell? Or interspersed in some otherwise fairly regular text in a Word document?
>>> 
>>> I suppose someone could define a strict limited subset of HTML to be that subset that makes sense in ALL textual situations. That subset would be something like just defining things that act like characters, and not like a full blown rendering engine. But who would define that subset? Not the HTML groups, because their mandate is to define full blown rendering engines. It would be more likely to be something like the unicode group.
>>> 
>>> And also, in this brave new world where HTML5 is the new standard text format, what would the binary format of it be? I mean, if I have the string of unicode characters <IMG would that be HTML5 image definition that should be rendered as such? Or would it be text that happens to contain greater than symbol, I, M and G? It would have to be the former I guess, and thereby there would no longer be a unicode symbol for the mathematical greater than symbol. Rather there would be a unicode symbol for opening a HTML tag, and the text code for greater than would be &gt; Never again would a computer store > to mean greater than. Do we want HTML to be so pervasive? Not sure it deserves that.
>>> 
>>> And from a programmers point of view, he wants to be able to iterate over an array of characters and treat each one the same way, regardless if it is a custom character or not. Without that kind of programmatic abstraction, the whole thing can never gain traction. I don't think fully blown HTML embedded in your text can fulfill that. A very strictly defined subset, possibly could. Sure HTML5 can RENDER stuff adquately, if the only aim of the game is provide a correct rendering. But to be able to actually treat particular images embedded as characters, and have some programming library see                     that abstraction consistently, I'm not sure I'm convinced that is possible. Not without nailing down exactly what html elements in what particular circumstances constitute a "character".
>>> 
>>> I guess in summary, yes we have the technology already to render anything. But I don't think the whole standards framework does anything to allow the computing universe to actually exchange custom characters as if they were just any other text. Someone would actually have to  work on a standard to do that, not just point to html5.
>>> 
>>> 
>>> On Saturday, 30 May 2015 at 5:08 am, Philippe Verdy <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>, wrote:
>>> 
>>> 2015-05-29 4:37 GMT+02:00 John <idou747 at gmail.com <mailto:idou747 at gmail.com>>:
>>> "Today the world goes very well with HTML(5) which is now the bext markup language for document (including for inserting embedded images that don?t require any external request?
>>> If I had a large document that reused a particular character thousands of times, would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way?
>>> 
>>> HTML(5) allows defining *once* entities for images that can then be reused thousands of times without repeting their definition. You can do this as well with CSS styles, just define a class for a small element. This element may still be an "image", but the semantic is carried by the class you assign to it. You are not required to provide an external source URL for that image if the CSS style provides the content.
>>> 
>>> You may also use PUAs for the same purpose (however I have not seen how CSS allows to style individual characters in text elements as these characters are not elements, and there's no defined selector for pseudo-elements matching a single character). PUAs are perfectly usable in the situation where you have embedded a custom font in your document for assigning glyphs to characters (you can still do that, but I would avoid TrueType/OpenType for this purpose, but would use the SVG font format which is valid in CSS, for defining a collection of glyphs).
>>> 
>>> If the document is not restricted to be standalone, of course you can use links to an external shared CSS stylesheet and to this SVG font referenced by the stylesheet. With such approach, you don't even need to use classes on elements, you use plain-text with very compact PUAs (it's up to you to decide if the document must be standalone (embedding everything it needs) or must use external references for missing definitions, HTML allows both (and SVG as well when it contains plain-text elements).
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150601/092a2894/attachment.html>

From prosfilaes at gmail.com  Sun May 31 20:29:27 2015
From: prosfilaes at gmail.com (David Starner)
Date: Mon, 01 Jun 2015 01:29:27 +0000
Subject: the usage of LATIN SMALL LETTER A WITH STROKE
In-Reply-To: <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl>
References: <86lhg43ji3.fsf@mimuw.edu.pl>
 <CALgEMhwCNzt4ngYGukME8a3df3zkbP_3Sb4p896q5nKCPbppzQ@mail.gmail.com>
 <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl>
 <556B34CF.2040106@gmail.com>
 <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl>
Message-ID: <CAMZ=zj5h=EJN9Xsb-WkycZuuX-hmSw1Bc6fgwOe2tPb1eFOHHg@mail.gmail.com>

On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien <jsbien at mimuw.edu.pl> wrote:

> The proposal makes me curious about past and present Unicode policy,
> e.g. would it be accepted if submitted now.
>

Why wouldn't it?  Unicode has, if anything, seemed to become more flexible
about adding characters that seeing any sort of use.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150601/7b50272f/attachment.html>